DIGITAL NEWS CLASSIFICATION AND PUNCTUACTION USING MACHINE LEARNING AND TEXT MINING TECHNIQUES
Fernando Andrés CEVALLOS SALAS
fcevallosepn@gmail.comEscuela Politécnica Nacional (Ecuador)
https://orcid.org/0009-0002-5222-2599
Abstract
Persistent growth of information in recent decades, along with the development of new information technologies for its management, have made it essential to develop systems that allow to synthesize this massive information or better known as big data. In this article, a feedback based system for massive processing of digital newspapers is presented. This system synthesizes the most relevant information from different news stories obtained from several sources. System is fed with information from the Internet using web scraping techniques. All this information is stored in a data lake which has been implemented using NoSQL databases. Next, data processing is performed, focusing on words, their relevance, and their correlation with other words from related content groups or headlines. In order to perform this aggrupation, machine learning Large Language Model (LLM), K Nearest Neighbors (KNN) and text mining techniques are used. New text mining algorithms are also developed to adjust thresholds during content aggregation and synthesis. Finally, the results visualization mechanism is presented which allow users to give a punctuation to the news stories. This mechanism represents a feedback punctuation for the system which will be considered into the global punctuation, which is the basis to show the results. This system can be useful to summarize all the information contained in the news stories which are stored in Internet, providing users a fast way to be informed.
Keywords:
artificial intelligence, digital news, machine learning, text miningReferences
Abramowicz, W. & Tolksdorf, R. (2010). Business information systems. 13th International Conference. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-12814-1
DOI: https://doi.org/10.1007/978-3-642-12814-1
Google Scholar
Aggarwal, C. C., & Zhai, C. (Eds.). (2012). Mining text data. Springer New York.
DOI: https://doi.org/10.1007/978-1-4614-3223-4
Google Scholar
Almeida, I. (2023). Introduction to Large Language Models for business leaders: Responsible AI strategy beyond fear and hype. Now Next Later AI.
Google Scholar
Amerland, D. (2013). Google Semantic Search: Search Engine Optimization (SEO) Techniques that get your company more traffic, increase brand impact, and amplify your online presence. Pearson Education.
Google Scholar
Balusamy, B., Abirami, R. N., Kadry, S., & Gandomi, A. H. (2021). Big Data: Concepts, Technology, and Architecture. John Wiley & Sons.
DOI: https://doi.org/10.1002/9781119701859
Google Scholar
Bao, Z., Borovica-Gajic, R., Qiu, R., Choudhury, F., & Yang, Z. (Eds.). (2023). Databases theory and applications. 34th Australasian Database Conference (ADC 2023). Springer Nature Switzerland.
DOI: https://doi.org/10.1007/978-3-031-47843-7
Google Scholar
Berry, M. W., & Kogan, J. (Eds.). (2010). Text Mining: Applications and theory. John Wiley & Sons.
DOI: https://doi.org/10.1002/9780470689646
Google Scholar
Bobadilla, J. (2021). Machine Learning y Deep Learning: Usando Python, Scikit y Keras. Ediciones de la U.
Google Scholar
Bustamante, N., & Guillén, S. (2020). Big Data y Mass Media. Aula Magna Proyecto clave McGraw Hill.
Google Scholar
Campesato, O. (2023). Transformer, BERT, and GPT3: Including ChatGPT and Prompt Engineering. Mercury Learning and Information.
DOI: https://doi.org/10.1515/9781683928973
Google Scholar
Cevallos, F. (2024, April 9). GitHub dataset for digital news classification and punctuation using Machine Learning and Text Mining techniques. Github, Inc. Retrieved from https://github.com/fcevallosepn/news
Google Scholar
Chen, J., Huynh, V.-N., Tang, X., & Wu, J. (Eds.). (2023). Knowledge and systems science. 22nd International Symposium. Springer Nature Singapore.
DOI: https://doi.org/10.1007/978-981-99-8318-6
Google Scholar
De Ville, B. (2001). Microsoft data mining: Integrated business intelligence for e-commerce and knowledge management. Digital Press.
Google Scholar
Gils, B. (2023). Data in context: Models as enablers for managing and using data. Springer Nature Switzerland.
Google Scholar
Gorelik, A. (2019). The Enterprise Big Data lake: Delivering the promise of Big Data and data science. O'Reilly Media.
Google Scholar
Hildebrandt, M., & Gutwirth, S. (2008). Profiling the European citizen: Cross-disciplinary. Springer Netherlands.
DOI: https://doi.org/10.1007/978-1-4020-6914-7
Google Scholar
Johri, P., Verma, J. K., & Paul, S. (Eds.). (2020). Applications of Machine Learning (Algorithms for Intelligent Systems). Springer Nature Singapore.
DOI: https://doi.org/10.1007/978-981-15-3357-0
Google Scholar
Kannan, R., Rasool, R. U., Jin, H., & Balasundaram, S. R. (Eds.). (2016). Managing and processing Big Data in cloud computing. IGI Global. https://doi.org/10.4018/978-1-4666-9767-6
DOI: https://doi.org/10.4018/978-1-4666-9767-6
Google Scholar
Koul, N., (2023). Prompt engineering for Large Language Models. Nimrita Koul.
Google Scholar
Kumar, S. (2020). Can webometrics predict the academic rankings of institutes? The Journal of Prediction Markets, 14(2), 61-76. https://doi.org/10.5750/jpm.v14i2.1816
DOI: https://doi.org/10.5750/jpm.v14i2.1816
Google Scholar
Nisbet, R., Miner, G., & Yale, K. (2017). Handbook of statistical analysis and data mining applications. Elsevier Science.
Google Scholar
Ortega, J. M. (2022). Big data, machine learning y data science en python. RA-MA S.A. Editorial y Publicaciones.
Google Scholar
Pasupuleti, P., & Purra, B. S. (2015). Data Lake Development with Big Data. Packt Publishing.
Google Scholar
Rahman El Sheikh, A. A., & Alnoukari, M. (Eds.). (2012). Business Intelligence and Agile Methodologies for Knowledge-Based Organizations: Cross-Disciplinary Applications. IGI Global. https://doi.org/10.4018/978-1-61350-050-7
DOI: https://doi.org/10.4018/978-1-61350-050-7
Google Scholar
Rajaguru, H., & Prabhakar, S. K. (2017). KNN classifier and K-Means clustering for robust classification of epilepsy from EEG signals. A detailed analysis. Anchor Academic Publishing.
Google Scholar
Ribeiro, J. A. (2019). Big Data for executives and market professionals - Second edition. Amazon Digital.
Google Scholar
Rúa Pérez, J. (2009). Tecnologìa, innovación y empresa. Lulu Press, Incorporated.
Google Scholar
Sánchez Trujillo, M., & Pérez Hernández, J. A. (2021). Metodología CRISP-DM en la gestión de proyecto de Data Mining. Caso enfermedades dermatológicas. International Conference on Project Management. EAN Universidad.
Google Scholar
Sarkis, A. (2023). Training Data for Machine Learning. O'Reilly Media.
Google Scholar
Suganthi, K., Karthik, R., Rajesh, G., & Ching, P. H. C. (Eds.). (2021). Machine Learning and Deep Learning techniques in wireless and Mobile Networking Systems. CRC Press.
DOI: https://doi.org/10.1201/9781003107477
Google Scholar
Wang, L., Licheng, J., Shi, G., Li, X., & Liu, J. (Ed.). (2006). Fuzzy systems and knowledge discovery. Third International Conference. Springer Berlin Heidelberg.
DOI: https://doi.org/10.1007/11881599
Google Scholar
Zong, C., Xia, R., & Zhang, J. (2021). Text Data Mining. Springer Nature Singapore.
DOI: https://doi.org/10.1007/978-981-16-0100-2
Google Scholar
Authors
Fernando Andrés CEVALLOS SALASfcevallosepn@gmail.com
Escuela Politécnica Nacional Ecuador
https://orcid.org/0009-0002-5222-2599
Statistics
Abstract views: 519PDF downloads: 106
License
This work is licensed under a Creative Commons Attribution 4.0 International License.
All articles published in Applied Computer Science are open-access and distributed under the terms of the Creative Commons Attribution 4.0 International License.
Similar Articles
- Alexandru Marius OBRETIN, Andreea Alina CORNEA, FILTERING STRATEGIES FOR SMARTPHONE EMITTED DIGITAL SIGNALS , Applied Computer Science: Vol. 20 No. 1 (2024)
- Mariano LARIOS, Perfecto M. QUINTERO-FLORES , Mario ANZURES-GARCÍA , Miguel CAMACHO-HERNANDEZ , APPLICATION OF THE REAL-TIME FAN SCHEDULING IN THE EXPLORATION-EXPLOITATION TO OPTIMIZE MINIMUM FUNCTIONS OBJECTIVES , Applied Computer Science: Vol. 19 No. 2 (2023)
- Błażej CZAJKA, Patryk RÓŻYŁO, Hubert DĘBSKI, STABILITY AND FAILURE OF THIN-WALLED COMPOSITE STRUCTURES WITH A SQUARE CROSS-SECTION , Applied Computer Science: Vol. 18 No. 2 (2022)
- Mohamed ELBAHRI, Nasreddine TALEB, Sid Ahmed El Mehdi ARDJOUN, Chakib Mustapha Anouar ZOUAOUI , FEW-SHOT LEARNING WITH PRE-TRAINED LAYERS INTEGRATION APPLIED TO HAND GESTURE RECOGNITION FOR DISABLED PEOPLE , Applied Computer Science: Vol. 20 No. 2 (2024)
- Anupa ARACHCHIGE, Ranil SUGATHADASA, Oshadhi HERATH, Amila THIBBOTUWAWA, ARTIFICIAL NEURAL NETWORK BASED DEMAND FORECASTING INTEGRATED WITH FEDERAL FUNDS RATE , Applied Computer Science: Vol. 17 No. 4 (2021)
- Monika KULISZ, Justyna KUJAWSKA, Zulfiya AUBAKIROVA, Gulnaz ZHAIRBAEVA, Tomasz WAROWNY, PREDICTION OF THE COMPRESSIVE STRENGTH OF ENVIRONMENTALLY FRIENDLY CONCRETE USING ARTIFICIAL NEURAL NETWORK , Applied Computer Science: Vol. 18 No. 4 (2022)
- Rowell HERNANDEZ, Robert ATIENZA, CAREER TRACK PREDICTION USING DEEP LEARNING MODEL BASED ON DISCRETE SERIES OF QUANTITATIVE CLASSIFICATION , Applied Computer Science: Vol. 17 No. 4 (2021)
- Anna MACHROWSKA, Robert KARPIŃSKI, Przemysław KRAKOWSKI, Józef JONAK, DIAGNOSTIC FACTORS FOR OPENED AND CLOSED KINEMATIC CHAIN OF VIBROARTHROGRAPHY SIGNALS , Applied Computer Science: Vol. 15 No. 3 (2019)
- Rosa Maria VAZQUEZ, Edmundo BONILLA, Eduardo SANCHEZ, Oscar ATRIANO, Cinthya BERRUECOS, APPLICATION OF DATA MINING TECHNIQUES TO FIND RELATIONSHIPS BETWEEN THE DISHES OFFERED BY A RESTAURANT FOR THE ELABORATION OF COMBOS BASED ON THE PREFERENCES OF THE DINERS , Applied Computer Science: Vol. 15 No. 2 (2019)
- Saheed A. ADEWUYI, Segun AINA, Adeniran I. OLUWARANTI, A DEEP LEARNING MODEL FOR ELECTRICITY DEMAND FORECASTING BASED ON A TROPICAL DATA , Applied Computer Science: Vol. 16 No. 1 (2020)
<< < 1 2 3 4 5 6 7 8 9 10 > >>
You may also start an advanced similarity search for this article.