Examination of text's lexis using a Polish dictionary

Roman Voitovych

roman.voitovych@pollub.edu.pl
Lublin University of Technology (Poland)

Edyta Łukasik


Lublin University of Technology (Poland)
https://orcid.org/0000-0003-3644-9769

Abstract

This paper presents an approach to compare and classify books written in the Polish language by comparing their lexis fields. Books can be classified by their features, such as literature type, literary genre, style, author, etc. Using a preassembled dictionary and Jaccard index, we managed to prove a compact hypothesis concerning similar books. Further analysis with the PAM clustering algorithm presented a lexical connection between books of the same type or author. Overall static behaviour of similarities of any particular field on one side and some anomalous tendencies in other cases suggest that recognition of other features is possible. The method presented in this article allows drawing conclusions regarding the connection between any arbitrary books based solely on their vocabulary.


Keywords:

natural language processing, lexis analysis, Jaccard similarity coefficient, Partitioning Around Medoids

R. Singh, S. Singh, Text Similarity Measures in News Articles by Vector Space Model Using NLP, Journal of The Institution of Engineers (India): Series B 102 (2021) 329–338.
DOI: https://doi.org/10.1007/s40031-020-00501-5   Google Scholar

A. Huang, Similarity Measures for Text Document Clustering, Proceedings of the Sixth New Zealand Computer Science Research Student Conference 4 (2008) 49–56.
  Google Scholar

M. B. Magara, S. O. Ojo, T. Zuva, A Comparative Analysis of Text Similarity Measures and Algorithms in Research Paper Recommender Systems, 2018 Conference on Information Communications Technology and Society (2018) 1–5.
  Google Scholar

A. W. Qurashi, V. Holmes, A. P. Johnson, Document Processing: Methods for Semantic Text Similarity Analysis, In 2020 International Conference on INnovations in Intelligent SysTems and Applications (2020) 1–6.
DOI: https://doi.org/10.1109/INISTA49547.2020.9194665   Google Scholar

W. H. Gomaa, A. A. Fahmy, A Survey of Text Similarity Approaches, International Journal of Computer Applications 68 (2013) 13–18.
DOI: https://doi.org/10.5120/11638-7118   Google Scholar

S. Bekmirzaev, T. H. Kim, B. C. Lee, Pairwise Similarity Analysis and Quality Estimation on Classical Chinese Poetry of Ancient Korea in 15th Century, International Journal of Applied Engineering Research 12 (2017) 13884–13890.
  Google Scholar

D. M. Kaplan, D. M. Blei, A Computational Approach to Style in American Poetry, In Seventh IEEE International Conference on Data Mining (2007) 553–558.
DOI: https://doi.org/10.1109/ICDM.2007.76   Google Scholar

C. D. Manning, H. Schütze, Foundations of Statistical Natural Language Processing, MIT press, 1999.
  Google Scholar

R. Grishman, Computational Linguistics: An Introduction, Cambridge University Press, 1986.
DOI: https://doi.org/10.1017/CBO9780511611797   Google Scholar

R. Grzegorczykowa, R. Laskowski, H. Wróbel, Gramatyka współczesnego języka polskiego. Morfologia, Wydawnictwo Naukowe PWN, 1999.
  Google Scholar

S. Niwattanakul, J. Singthongchai, E. Naenudorn, S. Wanapu, Using of Jaccard Coefficient for Keywords Similarity, In Proceedings of the International Multiconference of Engineers and Computer Scientists 1 (2013) 380–384.
  Google Scholar

L. Kaufman, P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, 2009.
  Google Scholar

Słownik języka polskiego, https://sjp.pl, [18.09.2021].
  Google Scholar

Download


Published
2021-12-30

Cited by

Voitovych, R., & Łukasik, E. (2021). Examination of text’s lexis using a Polish dictionary. Journal of Computer Sciences Institute, 21, 316–323. https://doi.org/10.35784/jcsi.2731

Authors

Roman Voitovych 
roman.voitovych@pollub.edu.pl
Lublin University of Technology Poland

Authors

Edyta Łukasik 

Lublin University of Technology Poland
https://orcid.org/0000-0003-3644-9769

Statistics

Abstract views: 129
PDF downloads: 116