Classification Performance Comparison of BERT and IndoBERT on SelfReport of COVID-19 Status on Social Media
Article Sidebar
Open full text
Issue Vol. 30 (2024)
-
Analysis of data processing efficiency with use of Apache Hive and Apache Pig in Hadoop environment
Mikołaj Skrzypczyński, Piotr Muryjas1-8
-
Analysis of the application for the DFD authoring usage possibilities
Marek Pieczykolan, Marcin Badurowicz9-13
-
Comparative analysis of query execution speed using Entity Framework for selected database engines
Krzysztof Winiarczyk, Rafał Stęgierski14-20
-
C++ and Kotlin performance on Android – a comparative analysis
Grzegorz Zaręba, Maciej Zarębski, Jakub Smołka21-25
-
Comparative analysis of Node.js frameworks
Bartłomiej Zima, Marcin Barszcz26-30
-
User experience analysis in virtual museums
Aleksandra Kobylska, Mariusz Dzieńkowski31-38
-
Analysis of user experience during interaction with automotive repair workshop websites
Radosław Danielkiewicz, Mariusz Dzieńkowski39-46
-
A comparative analysis of transitions generated using the Unity game development platform
Marek Tabiszewski47-52
-
Comparative analysis of the performance of Unity and Unreal Engine game engines in 3D games
Kamil Abramowicz, Przemysław Borczuk53-60
-
Classification Performance Comparison of BERT and IndoBERT on SelfReport of COVID-19 Status on Social Media
Irwan Budiman, Mohammad Reza Faisal, Astina Faridhah, Andi Farmadi, Muhammad Itqan Mazdadi, Triando Hamonangan Saragih, Friska Abadi61-67
Main Article Content
DOI
Authors
Abstract
Messages shared on social media platforms like X are automatically categorized into two groups: those who self-report COVID-19 status and those who do not. However, it is essential to note that these messages cannot be a reliable monitoring tool for tracking the spread of the COVID-19 pandemic. The classification of social media messages can be achieved through the application of classification algorithms. Many deep learning-based algorithms, such as Convolutional Neural Networks (CNN) or Long Short-Term Memory (LSTM), have been used for text classification. However, CNN has limitations in understanding global context, while LSTM focuses more on understanding word-by-word sequences. Apart from that, both require a lot of data to learn. Currently, an algorithm is being developed for text classification that can cover the shortcomings of the previous algorithm, namely Bidirectional Encoder Representations from Transformers (BERT). Currently, there are many variants of BERT development. The primary objective of this study was to compare the effectiveness of two classification models, namely BERT and IndoBERT, in identifying self-report messages of COVID-19 status. Both BERT and IndoBERT models were evaluated using raw and preprocessed text data from X. The study's findings revealed that the IndoBERT model exhibited superior performance, achieving an accuracy rate of 94%, whereas the BERT model achieved a performance rate of 82%.
Keywords:
References
T. Mackey, V. Purushothaman, J. Li, N. Shah, M. Nali, C. Bardier, B. Liang, M. Cai, R. Cuomo, Machine learning to detect self-reporting of symptoms, testing access, and recovery associated with COVID-19 on Twitter: retrospective big data infoveillance study, JMIR public health and surveillance, 6(2) (2020) 1-9, https://doi.org/10.2196/19509 DOI: https://doi.org/10.2196/19509
A. Z. Klein, A. Magge, K. O’Connor, J. I. Flores Amaro, D. Weissenbacher, and G. Gonzalez Hernandez, Toward using Twitter for tracking COVID-19: a natural language processing pipeline and exploratory data set, Journal of medical Internet research, 23 (1) (2021) 1-6, https://doi.org/10.2196/25314 DOI: https://doi.org/10.2196/25314
F. E. Ayo, O. Folorunso, F. T. Ibharalu, and I. A. Osinuga, Machine learning techniques for hate speech classification of Twiiter data: State-of-The-Art, future challenges and research directions, Computer Science Review, 38 (2020) 1-34, https://doi.org/10.1016/j.cosrev.2020.100311 DOI: https://doi.org/10.1016/j.cosrev.2020.100311
M. A. Riza, N. Charibaldi, U. Pembangunan, and N. Veteran, Emotion Detection in Twiter Social Media Using Long Short - Term Memory ( LSTM ) and Fast Text, 3 (1) (2021) 15–26, https://doi.org/10.25139/ijair.v3i1.3827 DOI: https://doi.org/10.25139/ijair.v3i1.3827
A. Chiorrini, C. Diamantini, A. Mircoli, and D. Potena, Emotion and sentiment analysis of posts using BERT, In EDBT/ICDT Workshops, 3 (2021) 1-7
B. Wilie, K. Vincentio, G.I. Winata, S. Cahyawijaya, X. Li, Z.Y. Lim, S. Soleman, R. Mahendra, P. Fung, S. Bahar, A. Purwarianti, IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding, arXiv preprint arXiv:2009.05387, (2020) 1-15
P. Ganesh, Y. Chen, X. Lou, M.A. Khan, Y. Yang, H. Sajjad, P. Nakov, D. Chen, M . Winslett, Compressing large-scale transformer-based models: A case study on BERT, Transactions of the Association for Computational Linguistics, 9 (2021) 1061–1080, https://doi.org/10.1162/tacl_a_00413 DOI: https://doi.org/10.1162/tacl_a_00413
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, In Proceedings of naacL-HLT, 1 (2019) 4171–4186
F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP, COLING 2020 - 28th International Conference on Computational Linguistics, Proceedings of the Conference (2020) 757–770, http://dx.doi.org/10.18653/v1/2020.coling-main.66 DOI: https://doi.org/10.18653/v1/2020.coling-main.66
C. Menni, A.M. Valdes, M.B. Freidin, C.H. Sudre, L.H. Nguyen, D.A. Drew, S. Ganesh, T. Varsavsky, M.J. Cardoso, J.S. El-Sayed Moustafa, A. Visconti, Real-time tracking of self-reported symptoms to predict potential COVID-19, Nature medicine, 26 (7) (2020) 1037–1040, https://doi.org/10.1038/s41591-020-0916-2 DOI: https://doi.org/10.1038/s41591-020-0916-2
M. A. Al-garadi, Y. Yang, S. Lakamana, A. Sarker, A Text Classification Approach for the Automatic Detection of Twitter Posts Containing Self-reported COVID-19 Symptoms, Open Review, (2020) 1–5
S. N. Sari, M. R. Faisal, D. Kartini, I. Budiman, Comparison of Feature Extraction with Supervised and Unsupervised Weighting in the Random Forest Algorithm for Monitoring Reports of COVID-19 Sufferers on Twitter, Jurnal Komputasi, 11 (1) (2023) 34–42, http://dx.doi.org/10.23960%2Fkomputasi.v11i1.6650 DOI: https://doi.org/10.23960/komputasi.v11i1.6650
M. R. Faisal, I. Budiman, F. Abadi, M. Haekal, D. T. Nugrahadi, A comparison of word embedding-based extraction feature techniques and deep learning models of natural disaster messages classification, Journal of Computer Sciences Institute, 27 (2023) 145–153, https://doi.org/10.35784/jcsi.3322 DOI: https://doi.org/10.35784/jcsi.3322
M. Khairie, M. R. Faisal, R. Herteno, I. Budiman, F. Abadi, and M. I. Mazdadi, The Effect of Channel Size on Performance of 1D CNN Architecture for Automatic Detection of Self-Reported COVID-19 Symptoms on Twitter, in 2023 International Seminar on Intelligent Technology and Its Applications (ISITIA) (2023) 621–625. https://doi.org/10.1109/ISITIA59021.2023.10220444 DOI: https://doi.org/10.1109/ISITIA59021.2023.10220444
M. R. Faisal, I. Budiman, F. Abadi, D. T. Nugrahadi, M. Haekal, and I. Sutedja, Applying Features Based on Word Embedding Techniques to 1D CNN for Natural Disaster Messages Classification, 2022 5th International Conference on Computer and Informatics Engineering, IC2IE 2022, (2022) 192–197, https://doi.org/10.1109/IC2IE56416.2022.9970188 DOI: https://doi.org/10.1109/IC2IE56416.2022.9970188
G. A. Pradnyana, W. Anggraeni, E. M. Yuniarno, and M. H. Purnomo, Fine-Tuning IndoBERT Model for Big Five Personality Prediction from Indonesian Social Media, in 2023 International Seminar on Intelligent Technology and Its Applications (ISITIA) (2023) 93–98, https://doi.org/10.1109/ISITIA59021.2023.10221074 DOI: https://doi.org/10.1109/ISITIA59021.2023.10221074
M. F. Nafiz, D. Kartini, M. R. Faisal, F. Indriani, and T. Hamonangan, Automated Detection of COVID-19 Cough Sound using Mel-Spectrogram Images and Convolutional Neural Network, Jurnal Ilmiah Teknik Elektro Komputer dan Informatika (JITEKI), 9 (3) (2023) 535–548, http://dx.doi.org/10.26555/jiteki.v9i3.26374
K. Y. Halim, D. T. Nugrahadi, M. R. Faisal, R. Herteno, and I. Budiman, Gender Classification Based on Electrocardiogram Signals Using Long Short Term Memory and Bidirectional Long Short Term Memory, Jurnal Ilmiah Teknik Elektro Komputer dan Informatika (JITEKI), 9 (3) (2023) 606–618, http://dx.doi.org/10.26555/jiteki.v9i3.26354
Article Details
Abstract views: 568
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
