EXPLOITING BERT FOR MALFORMED SEGMENTATION DETECTION TO IMPROVE SCIENTIFIC WRITINGS

Abdelrahman Halawa

ahalawa@azhar.edu.eg
Al-Azhar University (Egypt)
https://orcid.org/0009-0004-7107-1049

Shehab Gamalel-Din


(Egypt)
https://orcid.org/0000-0002-0696-6119

Abdurrahman Nasr


(Egypt)

Abstract

Writing a well-structured scientific documents, such as articles and theses, is vital for comprehending the document's argumentation and understanding its messages. Furthermore, it has an impact on the efficiency and time required for studying the document. Proper document segmentation also yields better results when employing automated Natural Language Processing (NLP) manipulation algorithms, including summarization and other information retrieval and analysis functions. Unfortunately, inexperienced writers, such as young researchers and graduate students, often struggle to produce well-structured professional documents. Their writing frequently exhibits improper segmentations or lacks semantically coherent segments, a phenomenon referred to as "mal-segmentation." Examples of mal-segmentation include improper paragraph or section divisions and unsmooth transitions between sentences and paragraphs. This research addresses the issue of mal-segmentation in scientific writing by introducing an automated method for detecting mal-segmentations, and utilizing Sentence Bidirectional Encoder Representations from Transformers (sBERT) as an encoding mechanism. The experimental results section shows a promising results for the detection of mal-segmentation using the sBERT technique.


Keywords:

NLP, text segmentation, mal-segmentation, BERT

Almuhareb, A. a.-T. (2019). Arabic word segmentation with long short-term memory neural networks and word embedding. IEEE Access, 7, 12879-12887. https://doi.org/10.1109/ACCESS.2019.2893460
DOI: https://doi.org/10.1109/ACCESS.2019.2893460   Google Scholar

Barrow, J., Jain, R., Morariu, V., & Manjunatha, V. (2020). A joint model for document segmentation and segment labeling. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, (pp. 313-322). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.29
DOI: https://doi.org/10.18653/v1/2020.acl-main.29   Google Scholar

Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., & Specia, L. (2017). Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv. https://doi.org/10.48550/arXiv.1708.00055
DOI: https://doi.org/10.18653/v1/S17-2001   Google Scholar

Cer, D., Yang, Y., Kong, S., Hua, N., Limtiaco, N., John, R. S., Constant, N., Guajardo- Cespedes, M., Yuan, S., Tar, Ch., Sung, Y.-H. Strope, B., & Kurzweil, R. (2018). Universal sentence encoder. arXiv. https://doi.org/10.48550/arXiv.1803.11175
DOI: https://doi.org/10.18653/v1/D18-2029   Google Scholar

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv. https://doi.org/10.48550/arXiv.1810.04805
  Google Scholar

Galanopoulos, D., & Mezaris, V.(2019). Temporal lecture video fragmentation using word embeddings. In Kompatsiaris, I., Huet, B., Mezaris, V., Gurrin, C., Cheng, W.-H., & Vrochidis, S. (Eds.) MultiMedia Modeling: 25th International Conference, MMM 2019, Thessaloniki, Greece, January 8--11, 2019, Proceedings, Part II (vol. 25, pp. 254--265). Springer. https://doi.org/10.1007/978-3-030-05716-9_21
DOI: https://doi.org/10.1007/978-3-030-05716-9_21   Google Scholar

Hearst, M. A. (1997). Text tiling: Segmenting text into multi-paragraph subtopic passages. Computational linguistics, 23(1), 33-64.
  Google Scholar

Hinkel, E. (2001). Matters of cohesion in L2 academic texts. Applied language learning, 12(2), 111-132.
  Google Scholar

ielts-mentor. (2022). Retrieved from https://www.ielts-mentor.com/reading-sample/gt-reading/3162- employment-in-japan ?
  Google Scholar

Levy, C. M., & Ransdell. S. (1996). The science of writing: Theories, methods, individual differences and applications. Routledge. https://doi.org/10.4324/9780203811122
DOI: https://doi.org/10.4324/9780203811122   Google Scholar

Lin, M., Nunamaker, J.F., Chau, M., & Chen, H. (2004). Segmentation of lecture videos based on text: a method combining multiple linguistic features. 37th Annual Hawaii International Conference on System Sciences. (pp. 9-9). IEEE. https://doi.org/10.1109/HICSS.2004.1265045
DOI: https://doi.org/10.1109/HICSS.2004.1265045   Google Scholar

Lin, M., Chau, M., Cao, J., & Nunamaker, J. F. (2005). Automated video segmentation for lecture videos: A linguistics-based approach. International Journal of Technology and Human Interaction (IJTHI), 1(2), 27-45. https://doi.org/10.4018/jthi.2005040102
DOI: https://doi.org/10.4018/jthi.2005040102   Google Scholar

Lo, K., Jin, Y., Tan, W., Liu, M., Du, L., & Buntine, W. (2021). Transformer over Pre-trained Transformer for Neural Text Segmentation with Enhanced Topic Coherence. arXiv. https://doi.org/10.48550/arXiv.2110.07160
DOI: https://doi.org/10.18653/v1/2021.findings-emnlp.283   Google Scholar

Luckert, M., & Schaefer- Kehnert, M. (2016). Using machine learning methods for evaluating the quality of technical documents.
  Google Scholar

Maraj, A., Martin, M. V., & Makrehchi, M. (2021). A More Effective Sentence-Wise Text Segmentation Approach Using BERT. In Llads, J., Lopresti, D., & Uchida, S (Eds.), Document Analysis and Recognition--ICDAR 2021, (pp. 236-250). Springer. https://doi.org/10.1007/978-3-030-86337-1_16
DOI: https://doi.org/10.1007/978-3-030-86337-1_16   Google Scholar

Ponceleon, D., & Srinivasan, S. (2001). Automatic discovery of salient segments in imperfect speech transcripts. Proceedings of the tenth international conference on Information and knowledge management, 490- 497. The ACM Digital Library. https://doi.org/10.1145/502585.502668
DOI: https://doi.org/10.1145/502585.502668   Google Scholar

Precision_and_recall. (2022). Retrieved from wikipedia: https://en.wikipedia.org/wiki/Precision_and_recall?oldformat=true
  Google Scholar

Reimers, N., & Gurevyvh, I. (2019). Sentence-BERT: Sentence embeddings using siamese BERT-networks. arXiv. https://doi.org/10.48550/arXiv.1908.10084
DOI: https://doi.org/10.18653/v1/D19-1410   Google Scholar

Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. IEEE conference on computer vision and pattern recognition (CVPR) (pp.815-823). IEEE. https://doi.org/10.1109/CVPR.2015.7298682
DOI: https://doi.org/10.1109/CVPR.2015.7298682   Google Scholar

Shah, R. R., Yu, Y., Skaikh, A. D., & Zimmermann, R. (2015). TRACE: linguistic-based approach for automatic lecture video segmentation leveraging Wikipedia texts. 2015 IEEE International Symposium on Multimedia (ISM) (pp. 217-220). IEEE. https://doi.org/10.1109/ISM.2015.18
DOI: https://doi.org/10.1109/ISM.2015.18   Google Scholar

Soares, E. R., & Barrére, E. (2019). An optimization model for temporal video lecture segmentation using word2vec and acoustic features. Proceedings of the 25th Brazillian Symposium on Multimedia and the Web, 513-520. The ACM Digital Library. https://doi.org/10.1145/3323503.3349548
DOI: https://doi.org/10.1145/3323503.3349548   Google Scholar

Solbiati, A., Heffernan, K., Damaskinos, G., Poddar, S., Modi, S., & Cali, J. (2021). Unsupervised topic segmentation of meetings with BERT embeddings. arXiv. https://doi.org/10.48550/arXiv.2106.12978
  Google Scholar

Glavas, G., & Somasundaran, S. (2020). Two-level transformer and auxiliary coherence modeling for improved text segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 7797-7804. https://doi.org/10.1609/aaai.v34i05.6284
DOI: https://doi.org/10.1609/aaai.v34i05.6284   Google Scholar

Text_segmentation. (2011). Retrieved from wikipedia: https://en.wikipedia.org/wiki/Text_segmentation
  Google Scholar

Ugur Akinci, G. K. (2012). Writing Transition Phrases and Sentences: 12 Types of Sentence and Paragraph Transitions with 112 Examples.
  Google Scholar

University, UAH. (n.d.). WRITING EFFECTIVE TRANSITIONS. Retrieved from https://www.uah.edu/images/administrative/student-successcenter/resources/handouts/handouts_2019/writing_effective_transitions.pdf
  Google Scholar

Wang, Y., Li, S., & Yang, J. (2018). Toward fast and accurate neural discourse segmentation. arXiv. https://doi.org/10.48550/arXiv.1808.09147
DOI: https://doi.org/10.18653/v1/D18-1116   Google Scholar

Download


Published
2023-06-30

Cited by

Halawa, A., Gamalel-Din, S. ., & Nasr, A. (2023). EXPLOITING BERT FOR MALFORMED SEGMENTATION DETECTION TO IMPROVE SCIENTIFIC WRITINGS. Applied Computer Science, 19(2), 126–141. https://doi.org/10.35784/acs-2023-20

Authors

Abdelrahman Halawa 
ahalawa@azhar.edu.eg
Al-Azhar University Egypt
https://orcid.org/0009-0004-7107-1049

Authors

Shehab Gamalel-Din 

Egypt
https://orcid.org/0000-0002-0696-6119

Authors

Abdurrahman Nasr 

Egypt

Statistics

Abstract views: 123
PDF downloads: 99


License

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

All articles published in Applied Computer Science are open-access and distributed under the terms of the Creative Commons Attribution 4.0 International License.


Similar Articles

You may also start an advanced similarity search for this article.