EXPLOITING BERT FOR MALFORMED SEGMENTATION DETECTION TO IMPROVE SCIENTIFIC WRITINGS
Abdelrahman Halawa
ahalawa@azhar.edu.egAl-Azhar University (Egypt)
https://orcid.org/0009-0004-7107-1049
Shehab Gamalel-Din
(Egypt)
https://orcid.org/0000-0002-0696-6119
Abdurrahman Nasr
(Egypt)
Abstract
Writing a well-structured scientific documents, such as articles and theses, is vital for comprehending the document's argumentation and understanding its messages. Furthermore, it has an impact on the efficiency and time required for studying the document. Proper document segmentation also yields better results when employing automated Natural Language Processing (NLP) manipulation algorithms, including summarization and other information retrieval and analysis functions. Unfortunately, inexperienced writers, such as young researchers and graduate students, often struggle to produce well-structured professional documents. Their writing frequently exhibits improper segmentations or lacks semantically coherent segments, a phenomenon referred to as "mal-segmentation." Examples of mal-segmentation include improper paragraph or section divisions and unsmooth transitions between sentences and paragraphs. This research addresses the issue of mal-segmentation in scientific writing by introducing an automated method for detecting mal-segmentations, and utilizing Sentence Bidirectional Encoder Representations from Transformers (sBERT) as an encoding mechanism. The experimental results section shows a promising results for the detection of mal-segmentation using the sBERT technique.
Keywords:
NLP, text segmentation, mal-segmentation, BERTReferences
Almuhareb, A. a.-T. (2019). Arabic word segmentation with long short-term memory neural networks and word embedding. IEEE Access, 7, 12879-12887. https://doi.org/10.1109/ACCESS.2019.2893460
DOI: https://doi.org/10.1109/ACCESS.2019.2893460
Google Scholar
Barrow, J., Jain, R., Morariu, V., & Manjunatha, V. (2020). A joint model for document segmentation and segment labeling. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, (pp. 313-322). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.29
DOI: https://doi.org/10.18653/v1/2020.acl-main.29
Google Scholar
Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., & Specia, L. (2017). Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv. https://doi.org/10.48550/arXiv.1708.00055
DOI: https://doi.org/10.18653/v1/S17-2001
Google Scholar
Cer, D., Yang, Y., Kong, S., Hua, N., Limtiaco, N., John, R. S., Constant, N., Guajardo- Cespedes, M., Yuan, S., Tar, Ch., Sung, Y.-H. Strope, B., & Kurzweil, R. (2018). Universal sentence encoder. arXiv. https://doi.org/10.48550/arXiv.1803.11175
DOI: https://doi.org/10.18653/v1/D18-2029
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv. https://doi.org/10.48550/arXiv.1810.04805
Google Scholar
Galanopoulos, D., & Mezaris, V.(2019). Temporal lecture video fragmentation using word embeddings. In Kompatsiaris, I., Huet, B., Mezaris, V., Gurrin, C., Cheng, W.-H., & Vrochidis, S. (Eds.) MultiMedia Modeling: 25th International Conference, MMM 2019, Thessaloniki, Greece, January 8--11, 2019, Proceedings, Part II (vol. 25, pp. 254--265). Springer. https://doi.org/10.1007/978-3-030-05716-9_21
DOI: https://doi.org/10.1007/978-3-030-05716-9_21
Google Scholar
Hearst, M. A. (1997). Text tiling: Segmenting text into multi-paragraph subtopic passages. Computational linguistics, 23(1), 33-64.
Google Scholar
Hinkel, E. (2001). Matters of cohesion in L2 academic texts. Applied language learning, 12(2), 111-132.
Google Scholar
ielts-mentor. (2022). Retrieved from https://www.ielts-mentor.com/reading-sample/gt-reading/3162- employment-in-japan ?
Google Scholar
Levy, C. M., & Ransdell. S. (1996). The science of writing: Theories, methods, individual differences and applications. Routledge. https://doi.org/10.4324/9780203811122
DOI: https://doi.org/10.4324/9780203811122
Google Scholar
Lin, M., Nunamaker, J.F., Chau, M., & Chen, H. (2004). Segmentation of lecture videos based on text: a method combining multiple linguistic features. 37th Annual Hawaii International Conference on System Sciences. (pp. 9-9). IEEE. https://doi.org/10.1109/HICSS.2004.1265045
DOI: https://doi.org/10.1109/HICSS.2004.1265045
Google Scholar
Lin, M., Chau, M., Cao, J., & Nunamaker, J. F. (2005). Automated video segmentation for lecture videos: A linguistics-based approach. International Journal of Technology and Human Interaction (IJTHI), 1(2), 27-45. https://doi.org/10.4018/jthi.2005040102
DOI: https://doi.org/10.4018/jthi.2005040102
Google Scholar
Lo, K., Jin, Y., Tan, W., Liu, M., Du, L., & Buntine, W. (2021). Transformer over Pre-trained Transformer for Neural Text Segmentation with Enhanced Topic Coherence. arXiv. https://doi.org/10.48550/arXiv.2110.07160
DOI: https://doi.org/10.18653/v1/2021.findings-emnlp.283
Google Scholar
Luckert, M., & Schaefer- Kehnert, M. (2016). Using machine learning methods for evaluating the quality of technical documents.
Google Scholar
Maraj, A., Martin, M. V., & Makrehchi, M. (2021). A More Effective Sentence-Wise Text Segmentation Approach Using BERT. In Llads, J., Lopresti, D., & Uchida, S (Eds.), Document Analysis and Recognition--ICDAR 2021, (pp. 236-250). Springer. https://doi.org/10.1007/978-3-030-86337-1_16
DOI: https://doi.org/10.1007/978-3-030-86337-1_16
Google Scholar
Ponceleon, D., & Srinivasan, S. (2001). Automatic discovery of salient segments in imperfect speech transcripts. Proceedings of the tenth international conference on Information and knowledge management, 490- 497. The ACM Digital Library. https://doi.org/10.1145/502585.502668
DOI: https://doi.org/10.1145/502585.502668
Google Scholar
Precision_and_recall. (2022). Retrieved from wikipedia: https://en.wikipedia.org/wiki/Precision_and_recall?oldformat=true
Google Scholar
Reimers, N., & Gurevyvh, I. (2019). Sentence-BERT: Sentence embeddings using siamese BERT-networks. arXiv. https://doi.org/10.48550/arXiv.1908.10084
DOI: https://doi.org/10.18653/v1/D19-1410
Google Scholar
Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. IEEE conference on computer vision and pattern recognition (CVPR) (pp.815-823). IEEE. https://doi.org/10.1109/CVPR.2015.7298682
DOI: https://doi.org/10.1109/CVPR.2015.7298682
Google Scholar
Shah, R. R., Yu, Y., Skaikh, A. D., & Zimmermann, R. (2015). TRACE: linguistic-based approach for automatic lecture video segmentation leveraging Wikipedia texts. 2015 IEEE International Symposium on Multimedia (ISM) (pp. 217-220). IEEE. https://doi.org/10.1109/ISM.2015.18
DOI: https://doi.org/10.1109/ISM.2015.18
Google Scholar
Soares, E. R., & Barrére, E. (2019). An optimization model for temporal video lecture segmentation using word2vec and acoustic features. Proceedings of the 25th Brazillian Symposium on Multimedia and the Web, 513-520. The ACM Digital Library. https://doi.org/10.1145/3323503.3349548
DOI: https://doi.org/10.1145/3323503.3349548
Google Scholar
Solbiati, A., Heffernan, K., Damaskinos, G., Poddar, S., Modi, S., & Cali, J. (2021). Unsupervised topic segmentation of meetings with BERT embeddings. arXiv. https://doi.org/10.48550/arXiv.2106.12978
Google Scholar
Glavas, G., & Somasundaran, S. (2020). Two-level transformer and auxiliary coherence modeling for improved text segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 7797-7804. https://doi.org/10.1609/aaai.v34i05.6284
DOI: https://doi.org/10.1609/aaai.v34i05.6284
Google Scholar
Text_segmentation. (2011). Retrieved from wikipedia: https://en.wikipedia.org/wiki/Text_segmentation
Google Scholar
Ugur Akinci, G. K. (2012). Writing Transition Phrases and Sentences: 12 Types of Sentence and Paragraph Transitions with 112 Examples.
Google Scholar
University, UAH. (n.d.). WRITING EFFECTIVE TRANSITIONS. Retrieved from https://www.uah.edu/images/administrative/student-successcenter/resources/handouts/handouts_2019/writing_effective_transitions.pdf
Google Scholar
Wang, Y., Li, S., & Yang, J. (2018). Toward fast and accurate neural discourse segmentation. arXiv. https://doi.org/10.48550/arXiv.1808.09147
DOI: https://doi.org/10.18653/v1/D18-1116
Google Scholar
Authors
Abdelrahman Halawaahalawa@azhar.edu.eg
Al-Azhar University Egypt
https://orcid.org/0009-0004-7107-1049
Authors
Abdurrahman NasrEgypt
Statistics
Abstract views: 128PDF downloads: 100
License
This work is licensed under a Creative Commons Attribution 4.0 International License.
All articles published in Applied Computer Science are open-access and distributed under the terms of the Creative Commons Attribution 4.0 International License.
Similar Articles
- Venkatesh BHANDAGE, Manohara PAI M. M., SEMANTIC SEGMENTATION OF ALGAL BLOOMS ON THE OCEAN SURFACE USING SENTINEL 3 CHL_NN BAND IMAGERY , Applied Computer Science: Vol. 20 No. 3 (2024)
- Dilek AYDOGAN-KILIC, Deniz Kenan KILIC, Izabela Ewa NIELSEN, EXAMINATION OF SUMMARIZED MEDICAL RECORDS FOR ICD CODE CLASSIFICATION VIA BERT , Applied Computer Science: Vol. 20 No. 2 (2024)
- Fernando Andrés CEVALLOS SALAS, DIGITAL NEWS CLASSIFICATION AND PUNCTUACTION USING MACHINE LEARNING AND TEXT MINING TECHNIQUES , Applied Computer Science: Vol. 20 No. 2 (2024)
- Manikandan SRIDHARAN, Delphin Carolina RANI ARULANANDAM, Rajeswari K CHINNASAMY, Suma THIMMANNA, Sivabalaselvamani DHANDAPANI, RECOGNITION OF FONT AND TAMIL LETTER IN IMAGES USING DEEP LEARNING , Applied Computer Science: Vol. 17 No. 2 (2021)
- Marcin BADUROWICZ, DETECTION OF SOURCE CODE IN INTERNET TEXTS USING AUTOMATICALLY GENERATED MACHINE LEARNING MODELS , Applied Computer Science: Vol. 18 No. 1 (2022)
- Behnaz ESLAMI, Mehdi HABIBZADEH MOTLAGH, Zahra REZAEI, Mohammad ESLAMI, Mohammad AMIN AMINI, UNSUPERVISED DYNAMIC TOPIC MODEL FOR EXTRACTING ADVERSE DRUG REACTION FROM HEALTH FORUMS , Applied Computer Science: Vol. 16 No. 1 (2020)
- Tomasz Sikora, Wanda Gryglewicz-Kacerka, APPLICATION OF GENETIC ALGORITHMS TO THE TRAVELING SALESMAN PROBLEM , Applied Computer Science: Vol. 19 No. 2 (2023)
- Archana Gunakala, Afzal Hussain Shahid, A COMPARATIVE STUDY ON PERFORMANCE OF BASIC AND ENSEMBLE CLASSIFIERS WITH VARIOUS DATASETS , Applied Computer Science: Vol. 19 No. 1 (2023)
- Mohamed ELBAHRI, Nasreddine TALEB, Sid Ahmed El Mehdi ARDJOUN, Chakib Mustapha Anouar ZOUAOUI , FEW-SHOT LEARNING WITH PRE-TRAINED LAYERS INTEGRATION APPLIED TO HAND GESTURE RECOGNITION FOR DISABLED PEOPLE , Applied Computer Science: Vol. 20 No. 2 (2024)
- Edyta ŁUKASIK, Wiktor FLIS, EFFICIENCY COMPARISON OF NETWORKS IN HANDWRITTEN LATIN CHARACTERS RECOGNITION WITH DIACRITICS , Applied Computer Science: Vol. 19 No. 4 (2023)
You may also start an advanced similarity search for this article.