DIGITAL NEWS CLASSIFICATION AND PUNCTUACTION USING MACHINE LEARNING AND TEXT MINING TECHNIQUES
Fernando Andrés CEVALLOS SALAS
fcevallosepn@gmail.comEscuela Politécnica Nacional (Ecuador)
https://orcid.org/0009-0002-5222-2599
Abstract
Persistent growth of information in recent decades, along with the development of new information technologies for its management, have made it essential to develop systems that allow to synthesize this massive information or better known as big data. In this article, a feedback based system for massive processing of digital newspapers is presented. This system synthesizes the most relevant information from different news stories obtained from several sources. System is fed with information from the Internet using web scraping techniques. All this information is stored in a data lake which has been implemented using NoSQL databases. Next, data processing is performed, focusing on words, their relevance, and their correlation with other words from related content groups or headlines. In order to perform this aggrupation, machine learning Large Language Model (LLM), K Nearest Neighbors (KNN) and text mining techniques are used. New text mining algorithms are also developed to adjust thresholds during content aggregation and synthesis. Finally, the results visualization mechanism is presented which allow users to give a punctuation to the news stories. This mechanism represents a feedback punctuation for the system which will be considered into the global punctuation, which is the basis to show the results. This system can be useful to summarize all the information contained in the news stories which are stored in Internet, providing users a fast way to be informed.
Keywords:
artificial intelligence, digital news, machine learning, text miningReferences
Abramowicz, W. & Tolksdorf, R. (2010). Business information systems. 13th International Conference. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-12814-1
DOI: https://doi.org/10.1007/978-3-642-12814-1
Google Scholar
Aggarwal, C. C., & Zhai, C. (Eds.). (2012). Mining text data. Springer New York.
DOI: https://doi.org/10.1007/978-1-4614-3223-4
Google Scholar
Almeida, I. (2023). Introduction to Large Language Models for business leaders: Responsible AI strategy beyond fear and hype. Now Next Later AI.
Google Scholar
Amerland, D. (2013). Google Semantic Search: Search Engine Optimization (SEO) Techniques that get your company more traffic, increase brand impact, and amplify your online presence. Pearson Education.
Google Scholar
Balusamy, B., Abirami, R. N., Kadry, S., & Gandomi, A. H. (2021). Big Data: Concepts, Technology, and Architecture. John Wiley & Sons.
DOI: https://doi.org/10.1002/9781119701859
Google Scholar
Bao, Z., Borovica-Gajic, R., Qiu, R., Choudhury, F., & Yang, Z. (Eds.). (2023). Databases theory and applications. 34th Australasian Database Conference (ADC 2023). Springer Nature Switzerland.
DOI: https://doi.org/10.1007/978-3-031-47843-7
Google Scholar
Berry, M. W., & Kogan, J. (Eds.). (2010). Text Mining: Applications and theory. John Wiley & Sons.
DOI: https://doi.org/10.1002/9780470689646
Google Scholar
Bobadilla, J. (2021). Machine Learning y Deep Learning: Usando Python, Scikit y Keras. Ediciones de la U.
Google Scholar
Bustamante, N., & Guillén, S. (2020). Big Data y Mass Media. Aula Magna Proyecto clave McGraw Hill.
Google Scholar
Campesato, O. (2023). Transformer, BERT, and GPT3: Including ChatGPT and Prompt Engineering. Mercury Learning and Information.
DOI: https://doi.org/10.1515/9781683928973
Google Scholar
Cevallos, F. (2024, April 9). GitHub dataset for digital news classification and punctuation using Machine Learning and Text Mining techniques. Github, Inc. Retrieved from https://github.com/fcevallosepn/news
Google Scholar
Chen, J., Huynh, V.-N., Tang, X., & Wu, J. (Eds.). (2023). Knowledge and systems science. 22nd International Symposium. Springer Nature Singapore.
DOI: https://doi.org/10.1007/978-981-99-8318-6
Google Scholar
De Ville, B. (2001). Microsoft data mining: Integrated business intelligence for e-commerce and knowledge management. Digital Press.
Google Scholar
Gils, B. (2023). Data in context: Models as enablers for managing and using data. Springer Nature Switzerland.
Google Scholar
Gorelik, A. (2019). The Enterprise Big Data lake: Delivering the promise of Big Data and data science. O'Reilly Media.
Google Scholar
Hildebrandt, M., & Gutwirth, S. (2008). Profiling the European citizen: Cross-disciplinary. Springer Netherlands.
DOI: https://doi.org/10.1007/978-1-4020-6914-7
Google Scholar
Johri, P., Verma, J. K., & Paul, S. (Eds.). (2020). Applications of Machine Learning (Algorithms for Intelligent Systems). Springer Nature Singapore.
DOI: https://doi.org/10.1007/978-981-15-3357-0
Google Scholar
Kannan, R., Rasool, R. U., Jin, H., & Balasundaram, S. R. (Eds.). (2016). Managing and processing Big Data in cloud computing. IGI Global. https://doi.org/10.4018/978-1-4666-9767-6
DOI: https://doi.org/10.4018/978-1-4666-9767-6
Google Scholar
Koul, N., (2023). Prompt engineering for Large Language Models. Nimrita Koul.
Google Scholar
Kumar, S. (2020). Can webometrics predict the academic rankings of institutes? The Journal of Prediction Markets, 14(2), 61-76. https://doi.org/10.5750/jpm.v14i2.1816
DOI: https://doi.org/10.5750/jpm.v14i2.1816
Google Scholar
Nisbet, R., Miner, G., & Yale, K. (2017). Handbook of statistical analysis and data mining applications. Elsevier Science.
Google Scholar
Ortega, J. M. (2022). Big data, machine learning y data science en python. RA-MA S.A. Editorial y Publicaciones.
Google Scholar
Pasupuleti, P., & Purra, B. S. (2015). Data Lake Development with Big Data. Packt Publishing.
Google Scholar
Rahman El Sheikh, A. A., & Alnoukari, M. (Eds.). (2012). Business Intelligence and Agile Methodologies for Knowledge-Based Organizations: Cross-Disciplinary Applications. IGI Global. https://doi.org/10.4018/978-1-61350-050-7
DOI: https://doi.org/10.4018/978-1-61350-050-7
Google Scholar
Rajaguru, H., & Prabhakar, S. K. (2017). KNN classifier and K-Means clustering for robust classification of epilepsy from EEG signals. A detailed analysis. Anchor Academic Publishing.
Google Scholar
Ribeiro, J. A. (2019). Big Data for executives and market professionals - Second edition. Amazon Digital.
Google Scholar
Rúa Pérez, J. (2009). Tecnologìa, innovación y empresa. Lulu Press, Incorporated.
Google Scholar
Sánchez Trujillo, M., & Pérez Hernández, J. A. (2021). Metodología CRISP-DM en la gestión de proyecto de Data Mining. Caso enfermedades dermatológicas. International Conference on Project Management. EAN Universidad.
Google Scholar
Sarkis, A. (2023). Training Data for Machine Learning. O'Reilly Media.
Google Scholar
Suganthi, K., Karthik, R., Rajesh, G., & Ching, P. H. C. (Eds.). (2021). Machine Learning and Deep Learning techniques in wireless and Mobile Networking Systems. CRC Press.
DOI: https://doi.org/10.1201/9781003107477
Google Scholar
Wang, L., Licheng, J., Shi, G., Li, X., & Liu, J. (Ed.). (2006). Fuzzy systems and knowledge discovery. Third International Conference. Springer Berlin Heidelberg.
DOI: https://doi.org/10.1007/11881599
Google Scholar
Zong, C., Xia, R., & Zhang, J. (2021). Text Data Mining. Springer Nature Singapore.
DOI: https://doi.org/10.1007/978-981-16-0100-2
Google Scholar
Authors
Fernando Andrés CEVALLOS SALASfcevallosepn@gmail.com
Escuela Politécnica Nacional Ecuador
https://orcid.org/0009-0002-5222-2599
Statistics
Abstract views: 519PDF downloads: 106
License
This work is licensed under a Creative Commons Attribution 4.0 International License.
All articles published in Applied Computer Science are open-access and distributed under the terms of the Creative Commons Attribution 4.0 International License.
Similar Articles
- Krzysztof Michalczyk, Mariusz Warzecha, Robert Baran, A NEW METHOD FOR GENERATING VIRTUAL MODELS OF NONLINEAR HELICAL SPRINGS BASED ON A RIGOROUS MATHEMATICAL MODEL , Applied Computer Science: Vol. 19 No. 2 (2023)
- Andrzej ŁUKASZEWICZ, Jerzy JÓZWIK, Kamil CYBUL, IMPACT OF FRICTION COEFFICIENT VARIATION ON TEMPERATURE FIELD IN ROTARY FRICTION WELDING OF METALS – FEM STUDY , Applied Computer Science: Vol. 19 No. 3 (2023)
- Marcin TOMCZYK, Barbara BOROWIK, Bohdan BOROWIK, IDENTIFICATION OF THE MASS INERTIA MOMENT IN AN ELECTROMECHANICAL SYSTEM BASED ON WAVELET–NEURAL METHOD , Applied Computer Science: Vol. 14 No. 2 (2018)
- Tomasz Sikora, Wanda Gryglewicz-Kacerka, APPLICATION OF GENETIC ALGORITHMS TO THE TRAVELING SALESMAN PROBLEM , Applied Computer Science: Vol. 19 No. 2 (2023)
- Michał TOMCZYK, Anna PLICHTA, Mariusz MIKULSKI, APPLICATION OF WAVELET – NEURAL METHOD TO DETECT BACKLASH ZONE IN ELECTROMECHANICAL SYSTEMS GENERATING NOISES , Applied Computer Science: Vol. 15 No. 4 (2019)
- Przemysław KRAKOWSKI, Robert KARPIŃSKI, Marcin MACIEJEWSKI, APPLICATIONS OF MODERN IMAGING TECHNOLOGY IN ORTHOPAEDIC TRAUMA SURGERY , Applied Computer Science: Vol. 14 No. 3 (2018)
- Marcin TOMCZYK, Barbara BOROWIK, Mariusz MIKULSKI, IDENTIFICATION OF A BACKLASH ZONE IN AN ELECTROMECHANICAL SYSTEM CONTAINING CHANGES OF A MASS INERTIA MOMENT BASED ON A WAVELET–NEURAL METHOD , Applied Computer Science: Vol. 14 No. 4 (2018)
- Evans BAIDOO, FIREWORKS ALGORITHM FOR UNCONSTRAINED FUNCTION OPTIMIZATION PROBLEMS , Applied Computer Science: Vol. 13 No. 1 (2017)
- Hae Chan Na, Yoon Sang Kim, A STUDY ON AN AR-BASED CIRCUIT PRACTICE , Applied Computer Science: Vol. 20 No. 1 (2024)
- Piotr WITTBRODT, Iwona ŁAPUŃKA, Gulzhan BAYTIKENOVA, Arkadiusz GOLA, Alfiya ZAKIMOVA, IDENTIFICATION OF THE IMPACT OF THE AVAILABILITY FACTOR ON THE EFFICIENCY OF PRODUCTION PROCESSES USING THE AHP AND FUZZY AHP METHODS , Applied Computer Science: Vol. 18 No. 4 (2022)
You may also start an advanced similarity search for this article.