DIGITAL NEWS CLASSIFICATION AND PUNCTUACTION USING MACHINE LEARNING AND TEXT MINING TECHNIQUES

Fernando Andrés CEVALLOS SALAS

fcevallosepn@gmail.com
Escuela Politécnica Nacional (Ecuador)
https://orcid.org/0009-0002-5222-2599

Abstract

 Persistent growth of information in recent decades, along with the development of new information technologies for its management, have made it essential to develop systems that allow to synthesize this massive information or better known as big data. In this article, a feedback based system for massive processing of digital newspapers is presented. This system synthesizes the most relevant information from different news stories obtained from several sources. System is fed with information from the Internet using web scraping techniques. All this information is stored in a data lake which has been implemented using NoSQL databases. Next, data processing is performed, focusing on words, their relevance, and their correlation with other words from related content groups or headlines. In order to perform this aggrupation, machine learning Large Language Model (LLM), K Nearest Neighbors (KNN) and text mining techniques are used. New text mining algorithms are also developed to adjust thresholds during content aggregation and synthesis. Finally, the results visualization mechanism is presented which allow users to give a punctuation to the news stories. This mechanism represents a feedback punctuation for the system which will be considered into the global punctuation, which is the basis to show the results. This system can be useful to summarize all the information contained in the news stories which are stored in Internet, providing users a fast way to be informed.


Keywords:

artificial intelligence, digital news, machine learning, text mining

Abramowicz, W. & Tolksdorf, R. (2010). Business information systems. 13th International Conference. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-12814-1
DOI: https://doi.org/10.1007/978-3-642-12814-1   Google Scholar

Aggarwal, C. C., & Zhai, C. (Eds.). (2012). Mining text data. Springer New York.
DOI: https://doi.org/10.1007/978-1-4614-3223-4   Google Scholar

Almeida, I. (2023). Introduction to Large Language Models for business leaders: Responsible AI strategy beyond fear and hype. Now Next Later AI.
  Google Scholar

Amerland, D. (2013). Google Semantic Search: Search Engine Optimization (SEO) Techniques that get your company more traffic, increase brand impact, and amplify your online presence. Pearson Education.
  Google Scholar

Balusamy, B., Abirami, R. N., Kadry, S., & Gandomi, A. H. (2021). Big Data: Concepts, Technology, and Architecture. John Wiley & Sons.
DOI: https://doi.org/10.1002/9781119701859   Google Scholar

Bao, Z., Borovica-Gajic, R., Qiu, R., Choudhury, F., & Yang, Z. (Eds.). (2023). Databases theory and applications. 34th Australasian Database Conference (ADC 2023). Springer Nature Switzerland.
DOI: https://doi.org/10.1007/978-3-031-47843-7   Google Scholar

Berry, M. W., & Kogan, J. (Eds.). (2010). Text Mining: Applications and theory. John Wiley & Sons.
DOI: https://doi.org/10.1002/9780470689646   Google Scholar

Bobadilla, J. (2021). Machine Learning y Deep Learning: Usando Python, Scikit y Keras. Ediciones de la U.
  Google Scholar

Bustamante, N., & Guillén, S. (2020). Big Data y Mass Media. Aula Magna Proyecto clave McGraw Hill.
  Google Scholar

Campesato, O. (2023). Transformer, BERT, and GPT3: Including ChatGPT and Prompt Engineering. Mercury Learning and Information.
DOI: https://doi.org/10.1515/9781683928973   Google Scholar

Cevallos, F. (2024, April 9). GitHub dataset for digital news classification and punctuation using Machine Learning and Text Mining techniques. Github, Inc. Retrieved from https://github.com/fcevallosepn/news
  Google Scholar

Chen, J., Huynh, V.-N., Tang, X., & Wu, J. (Eds.). (2023). Knowledge and systems science. 22nd International Symposium. Springer Nature Singapore.
DOI: https://doi.org/10.1007/978-981-99-8318-6   Google Scholar

De Ville, B. (2001). Microsoft data mining: Integrated business intelligence for e-commerce and knowledge management. Digital Press.
  Google Scholar

Gils, B. (2023). Data in context: Models as enablers for managing and using data. Springer Nature Switzerland.
  Google Scholar

Gorelik, A. (2019). The Enterprise Big Data lake: Delivering the promise of Big Data and data science. O'Reilly Media.
  Google Scholar

Hildebrandt, M., & Gutwirth, S. (2008). Profiling the European citizen: Cross-disciplinary. Springer Netherlands.
DOI: https://doi.org/10.1007/978-1-4020-6914-7   Google Scholar

Johri, P., Verma, J. K., & Paul, S. (Eds.). (2020). Applications of Machine Learning (Algorithms for Intelligent Systems). Springer Nature Singapore.
DOI: https://doi.org/10.1007/978-981-15-3357-0   Google Scholar

Kannan, R., Rasool, R. U., Jin, H., & Balasundaram, S. R. (Eds.). (2016). Managing and processing Big Data in cloud computing. IGI Global. https://doi.org/10.4018/978-1-4666-9767-6
DOI: https://doi.org/10.4018/978-1-4666-9767-6   Google Scholar

Koul, N., (2023). Prompt engineering for Large Language Models. Nimrita Koul.
  Google Scholar

Kumar, S. (2020). Can webometrics predict the academic rankings of institutes? The Journal of Prediction Markets, 14(2), 61-76. https://doi.org/10.5750/jpm.v14i2.1816
DOI: https://doi.org/10.5750/jpm.v14i2.1816   Google Scholar

Nisbet, R., Miner, G., & Yale, K. (2017). Handbook of statistical analysis and data mining applications. Elsevier Science.
  Google Scholar

Ortega, J. M. (2022). Big data, machine learning y data science en python. RA-MA S.A. Editorial y Publicaciones.
  Google Scholar

Pasupuleti, P., & Purra, B. S. (2015). Data Lake Development with Big Data. Packt Publishing.
  Google Scholar

Rahman El Sheikh, A. A., & Alnoukari, M. (Eds.). (2012). Business Intelligence and Agile Methodologies for Knowledge-Based Organizations: Cross-Disciplinary Applications. IGI Global. https://doi.org/10.4018/978-1-61350-050-7
DOI: https://doi.org/10.4018/978-1-61350-050-7   Google Scholar

Rajaguru, H., & Prabhakar, S. K. (2017). KNN classifier and K-Means clustering for robust classification of epilepsy from EEG signals. A detailed analysis. Anchor Academic Publishing.
  Google Scholar

Ribeiro, J. A. (2019). Big Data for executives and market professionals - Second edition. Amazon Digital.
  Google Scholar

Rúa Pérez, J. (2009). Tecnologìa, innovación y empresa. Lulu Press, Incorporated.
  Google Scholar

Sánchez Trujillo, M., & Pérez Hernández, J. A. (2021). Metodología CRISP-DM en la gestión de proyecto de Data Mining. Caso enfermedades dermatológicas. International Conference on Project Management. EAN Universidad.
  Google Scholar

Sarkis, A. (2023). Training Data for Machine Learning. O'Reilly Media.
  Google Scholar

Suganthi, K., Karthik, R., Rajesh, G., & Ching, P. H. C. (Eds.). (2021). Machine Learning and Deep Learning techniques in wireless and Mobile Networking Systems. CRC Press.
DOI: https://doi.org/10.1201/9781003107477   Google Scholar

Wang, L., Licheng, J., Shi, G., Li, X., & Liu, J. (Ed.). (2006). Fuzzy systems and knowledge discovery. Third International Conference. Springer Berlin Heidelberg.
DOI: https://doi.org/10.1007/11881599   Google Scholar

Zong, C., Xia, R., & Zhang, J. (2021). Text Data Mining. Springer Nature Singapore.
DOI: https://doi.org/10.1007/978-981-16-0100-2   Google Scholar

Download


Published
2024-06-30

Cited by

CEVALLOS SALAS, F. A. (2024). DIGITAL NEWS CLASSIFICATION AND PUNCTUACTION USING MACHINE LEARNING AND TEXT MINING TECHNIQUES. Applied Computer Science, 20(2), 24–42. https://doi.org/10.35784/acs-2024-14

Authors

Fernando Andrés CEVALLOS SALAS 
fcevallosepn@gmail.com
Escuela Politécnica Nacional Ecuador
https://orcid.org/0009-0002-5222-2599

Statistics

Abstract views: 446
PDF downloads: 88


License

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

All articles published in Applied Computer Science are open-access and distributed under the terms of the Creative Commons Attribution 4.0 International License.