DIGITAL NEWS CLASSIFICATION AND PUNCTUACTION USING MACHINE LEARNING AND TEXT MINING TECHNIQUES
Fernando Andrés CEVALLOS SALAS
fcevallosepn@gmail.comEscuela Politécnica Nacional (Ecuador)
https://orcid.org/0009-0002-5222-2599
Abstract
Persistent growth of information in recent decades, along with the development of new information technologies for its management, have made it essential to develop systems that allow to synthesize this massive information or better known as big data. In this article, a feedback based system for massive processing of digital newspapers is presented. This system synthesizes the most relevant information from different news stories obtained from several sources. System is fed with information from the Internet using web scraping techniques. All this information is stored in a data lake which has been implemented using NoSQL databases. Next, data processing is performed, focusing on words, their relevance, and their correlation with other words from related content groups or headlines. In order to perform this aggrupation, machine learning Large Language Model (LLM), K Nearest Neighbors (KNN) and text mining techniques are used. New text mining algorithms are also developed to adjust thresholds during content aggregation and synthesis. Finally, the results visualization mechanism is presented which allow users to give a punctuation to the news stories. This mechanism represents a feedback punctuation for the system which will be considered into the global punctuation, which is the basis to show the results. This system can be useful to summarize all the information contained in the news stories which are stored in Internet, providing users a fast way to be informed.
Keywords:
artificial intelligence, digital news, machine learning, text miningReferences
Abramowicz, W. & Tolksdorf, R. (2010). Business information systems. 13th International Conference. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-12814-1
DOI: https://doi.org/10.1007/978-3-642-12814-1
Google Scholar
Aggarwal, C. C., & Zhai, C. (Eds.). (2012). Mining text data. Springer New York.
DOI: https://doi.org/10.1007/978-1-4614-3223-4
Google Scholar
Almeida, I. (2023). Introduction to Large Language Models for business leaders: Responsible AI strategy beyond fear and hype. Now Next Later AI.
Google Scholar
Amerland, D. (2013). Google Semantic Search: Search Engine Optimization (SEO) Techniques that get your company more traffic, increase brand impact, and amplify your online presence. Pearson Education.
Google Scholar
Balusamy, B., Abirami, R. N., Kadry, S., & Gandomi, A. H. (2021). Big Data: Concepts, Technology, and Architecture. John Wiley & Sons.
DOI: https://doi.org/10.1002/9781119701859
Google Scholar
Bao, Z., Borovica-Gajic, R., Qiu, R., Choudhury, F., & Yang, Z. (Eds.). (2023). Databases theory and applications. 34th Australasian Database Conference (ADC 2023). Springer Nature Switzerland.
DOI: https://doi.org/10.1007/978-3-031-47843-7
Google Scholar
Berry, M. W., & Kogan, J. (Eds.). (2010). Text Mining: Applications and theory. John Wiley & Sons.
DOI: https://doi.org/10.1002/9780470689646
Google Scholar
Bobadilla, J. (2021). Machine Learning y Deep Learning: Usando Python, Scikit y Keras. Ediciones de la U.
Google Scholar
Bustamante, N., & Guillén, S. (2020). Big Data y Mass Media. Aula Magna Proyecto clave McGraw Hill.
Google Scholar
Campesato, O. (2023). Transformer, BERT, and GPT3: Including ChatGPT and Prompt Engineering. Mercury Learning and Information.
DOI: https://doi.org/10.1515/9781683928973
Google Scholar
Cevallos, F. (2024, April 9). GitHub dataset for digital news classification and punctuation using Machine Learning and Text Mining techniques. Github, Inc. Retrieved from https://github.com/fcevallosepn/news
Google Scholar
Chen, J., Huynh, V.-N., Tang, X., & Wu, J. (Eds.). (2023). Knowledge and systems science. 22nd International Symposium. Springer Nature Singapore.
DOI: https://doi.org/10.1007/978-981-99-8318-6
Google Scholar
De Ville, B. (2001). Microsoft data mining: Integrated business intelligence for e-commerce and knowledge management. Digital Press.
Google Scholar
Gils, B. (2023). Data in context: Models as enablers for managing and using data. Springer Nature Switzerland.
Google Scholar
Gorelik, A. (2019). The Enterprise Big Data lake: Delivering the promise of Big Data and data science. O'Reilly Media.
Google Scholar
Hildebrandt, M., & Gutwirth, S. (2008). Profiling the European citizen: Cross-disciplinary. Springer Netherlands.
DOI: https://doi.org/10.1007/978-1-4020-6914-7
Google Scholar
Johri, P., Verma, J. K., & Paul, S. (Eds.). (2020). Applications of Machine Learning (Algorithms for Intelligent Systems). Springer Nature Singapore.
DOI: https://doi.org/10.1007/978-981-15-3357-0
Google Scholar
Kannan, R., Rasool, R. U., Jin, H., & Balasundaram, S. R. (Eds.). (2016). Managing and processing Big Data in cloud computing. IGI Global. https://doi.org/10.4018/978-1-4666-9767-6
DOI: https://doi.org/10.4018/978-1-4666-9767-6
Google Scholar
Koul, N., (2023). Prompt engineering for Large Language Models. Nimrita Koul.
Google Scholar
Kumar, S. (2020). Can webometrics predict the academic rankings of institutes? The Journal of Prediction Markets, 14(2), 61-76. https://doi.org/10.5750/jpm.v14i2.1816
DOI: https://doi.org/10.5750/jpm.v14i2.1816
Google Scholar
Nisbet, R., Miner, G., & Yale, K. (2017). Handbook of statistical analysis and data mining applications. Elsevier Science.
Google Scholar
Ortega, J. M. (2022). Big data, machine learning y data science en python. RA-MA S.A. Editorial y Publicaciones.
Google Scholar
Pasupuleti, P., & Purra, B. S. (2015). Data Lake Development with Big Data. Packt Publishing.
Google Scholar
Rahman El Sheikh, A. A., & Alnoukari, M. (Eds.). (2012). Business Intelligence and Agile Methodologies for Knowledge-Based Organizations: Cross-Disciplinary Applications. IGI Global. https://doi.org/10.4018/978-1-61350-050-7
DOI: https://doi.org/10.4018/978-1-61350-050-7
Google Scholar
Rajaguru, H., & Prabhakar, S. K. (2017). KNN classifier and K-Means clustering for robust classification of epilepsy from EEG signals. A detailed analysis. Anchor Academic Publishing.
Google Scholar
Ribeiro, J. A. (2019). Big Data for executives and market professionals - Second edition. Amazon Digital.
Google Scholar
Rúa Pérez, J. (2009). Tecnologìa, innovación y empresa. Lulu Press, Incorporated.
Google Scholar
Sánchez Trujillo, M., & Pérez Hernández, J. A. (2021). Metodología CRISP-DM en la gestión de proyecto de Data Mining. Caso enfermedades dermatológicas. International Conference on Project Management. EAN Universidad.
Google Scholar
Sarkis, A. (2023). Training Data for Machine Learning. O'Reilly Media.
Google Scholar
Suganthi, K., Karthik, R., Rajesh, G., & Ching, P. H. C. (Eds.). (2021). Machine Learning and Deep Learning techniques in wireless and Mobile Networking Systems. CRC Press.
DOI: https://doi.org/10.1201/9781003107477
Google Scholar
Wang, L., Licheng, J., Shi, G., Li, X., & Liu, J. (Ed.). (2006). Fuzzy systems and knowledge discovery. Third International Conference. Springer Berlin Heidelberg.
DOI: https://doi.org/10.1007/11881599
Google Scholar
Zong, C., Xia, R., & Zhang, J. (2021). Text Data Mining. Springer Nature Singapore.
DOI: https://doi.org/10.1007/978-981-16-0100-2
Google Scholar
Authors
Fernando Andrés CEVALLOS SALASfcevallosepn@gmail.com
Escuela Politécnica Nacional Ecuador
https://orcid.org/0009-0002-5222-2599
Statistics
Abstract views: 519PDF downloads: 106
License
This work is licensed under a Creative Commons Attribution 4.0 International License.
All articles published in Applied Computer Science are open-access and distributed under the terms of the Creative Commons Attribution 4.0 International License.
Similar Articles
- Hanan M. SHUKUR, Shavan ASKAR, Subhi R.M. ZEEBAREE, THE UTILIZATION OF 6G IN INDUSTRY 4.0 , Applied Computer Science: Vol. 20 No. 2 (2024)
- Behnaz ESLAMI, Mehdi HABIBZADEH MOTLAGH, Zahra REZAEI, Mohammad ESLAMI, Mohammad AMIN AMINI, UNSUPERVISED DYNAMIC TOPIC MODEL FOR EXTRACTING ADVERSE DRUG REACTION FROM HEALTH FORUMS , Applied Computer Science: Vol. 16 No. 1 (2020)
- Kadeejah ABDULSALAM, John ADEBISI, Victor DUROJAIYE, IMPLEMENTATION OF A HARDWARE TROJAN CHIP DETECTOR MODEL USING ARDUINO MICROCONTROLLER , Applied Computer Science: Vol. 17 No. 4 (2021)
- Robert KARPIŃSKI, Przemysław KRAKOWSKI, Józef JONAK, Anna MACHROWSKA, Marcin MACIEJEWSKI, COMPARISON OF SELECTED CLASSIFICATION METHODS BASED ON MACHINE LEARNING AS A DIAGNOSTIC TOOL FOR KNEE JOINT CARTILAGE DAMAGE BASED ON GENERATED VIBROACOUSTIC PROCESSES , Applied Computer Science: Vol. 19 No. 4 (2023)
- Lubna RIYAZ, Muheet Ahmed BUTT, Majid ZAMAN, IMPROVING CORONARY HEART DISEASE PREDICTION BY OUTLIER ELIMINATION , Applied Computer Science: Vol. 18 No. 1 (2022)
- Manikandan SRIDHARAN, Delphin Carolina RANI ARULANANDAM, Rajeswari K CHINNASAMY, Suma THIMMANNA, Sivabalaselvamani DHANDAPANI, RECOGNITION OF FONT AND TAMIL LETTER IN IMAGES USING DEEP LEARNING , Applied Computer Science: Vol. 17 No. 2 (2021)
- Olutayo BOYINBODE, Paul OLOTU, Kolawole AKINTOLA, DEVELOPMENT OF AN ONTOLOGY-BASED ADAPTIVE PERSONALIZED E-LEARNING SYSTEM , Applied Computer Science: Vol. 16 No. 4 (2020)
- Marcin BADUROWICZ, DETECTION OF SOURCE CODE IN INTERNET TEXTS USING AUTOMATICALLY GENERATED MACHINE LEARNING MODELS , Applied Computer Science: Vol. 18 No. 1 (2022)
- Roman GALAGAN, Serhiy ANDREIEV, Nataliia STELMAKH, Yaroslava RAFALSKA, Andrii MOMOT, AUTOMATION OF POLYCYSTIC OVARY SYNDROME DIAGNOSTICS THROUGH MACHINE LEARNING ALGORITHMS IN ULTRASOUND IMAGING , Applied Computer Science: Vol. 20 No. 2 (2024)
- Amina ALYAMANI, Oleh YASNIY, CLASSIFICATION OF EEG SIGNAL BY METHODS OF MACHINE LEARNING , Applied Computer Science: Vol. 16 No. 4 (2020)
<< < 1 2 3 4 5 6 7 8 9 10 > >>
You may also start an advanced similarity search for this article.