Classification Performance Comparison of BERT and IndoBERT on SelfReport of COVID-19 Status on Social Media

Irwan Budiman


Lambung Mangkurat University (Indonesia)

Mohammad Reza Faisal

reza.faisal@ulm.ac.id
Lambung Mangkurat University (Indonesia)

Astina Faridhah


Lambung Mangkurat University (Indonesia)

Andi Farmadi


Lambung Mangkurat University (Indonesia)

Muhammad Itqan Mazdadi


Lambung Mangkurat University (Indonesia)

Triando Hamonangan Saragih


Lambung Mangkurat University (Indonesia)

Friska Abadi


Lambung Mangkurat University (Indonesia)

Abstract

Messages shared on social media platforms like X are automatically categorized into two groups: those who self-report COVID-19 status and those who do not. However, it is essential to note that these messages cannot be a reliable monitoring tool for tracking the spread of the COVID-19 pandemic. The classification of social media messages can be achieved through the application of classification algorithms. Many deep learning-based algorithms, such as Convolutional Neural Networks (CNN) or Long Short-Term Memory (LSTM), have been used for text classification. However, CNN has limitations in understanding global context, while LSTM focuses more on understanding word-by-word sequences. Apart from that, both require a lot of data to learn. Currently, an algorithm is being developed for text classification that can cover the shortcomings of the previous algorithm, namely Bidirectional Encoder Representations from Transformers (BERT). Currently, there are many variants of BERT development. The primary objective of this study was to compare the effectiveness of two classification models, namely BERT and IndoBERT, in identifying self-report messages of COVID-19 status. Both BERT and IndoBERT models were evaluated using raw and preprocessed text data from X. The study's findings revealed that the IndoBERT model exhibited superior performance, achieving an accuracy rate of 94%, whereas the BERT model achieved a performance rate of 82%.


Keywords:

Text Classification, Covid-19 symptoms, Twitter, BERT, IndoBERT

T. Mackey, V. Purushothaman, J. Li, N. Shah, M. Nali, C. Bardier, B. Liang, M. Cai, R. Cuomo, Machine learning to detect self-reporting of symptoms, testing access, and recovery associated with COVID-19 on Twitter: retrospective big data infoveillance study, JMIR public health and surveillance, 6(2) (2020) 1-9, https://doi.org/10.2196/19509
  Google Scholar

A. Z. Klein, A. Magge, K. O’Connor, J. I. Flores Amaro, D. Weissenbacher, and G. Gonzalez Hernandez, Toward using Twitter for tracking COVID-19: a natural language processing pipeline and exploratory data set, Journal of medical Internet research, 23 (1) (2021) 1-6, https://doi.org/10.2196/25314
  Google Scholar

F. E. Ayo, O. Folorunso, F. T. Ibharalu, and I. A. Osinuga, Machine learning techniques for hate speech classification of Twiiter data: State-of-The-Art, future challenges and research directions, Computer Science Review, 38 (2020) 1-34, https://doi.org/10.1016/j.cosrev.2020.100311
  Google Scholar

M. A. Riza, N. Charibaldi, U. Pembangunan, and N. Veteran, Emotion Detection in Twiter Social Media Using Long Short - Term Memory ( LSTM ) and Fast Text, 3 (1) (2021) 15–26, https://doi.org/10.25139/ijair.v3i1.3827
  Google Scholar

A. Chiorrini, C. Diamantini, A. Mircoli, and D. Potena, Emotion and sentiment analysis of posts using BERT, In EDBT/ICDT Workshops, 3 (2021) 1-7
  Google Scholar

B. Wilie, K. Vincentio, G.I. Winata, S. Cahyawijaya, X. Li, Z.Y. Lim, S. Soleman, R. Mahendra, P. Fung, S. Bahar, A. Purwarianti, IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding, arXiv preprint arXiv:2009.05387, (2020) 1-15
  Google Scholar

P. Ganesh, Y. Chen, X. Lou, M.A. Khan, Y. Yang, H. Sajjad, P. Nakov, D. Chen, M . Winslett, Compressing large-scale transformer-based models: A case study on BERT, Transactions of the Association for Computational Linguistics, 9 (2021) 1061–1080, https://doi.org/10.1162/tacl_a_00413
  Google Scholar

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, In Proceedings of naacL-HLT, 1 (2019) 4171–4186
  Google Scholar

F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP, COLING 2020 - 28th International Conference on Computational Linguistics, Proceedings of the Conference (2020) 757–770, http://dx.doi.org/10.18653/v1/2020.coling-main.66
  Google Scholar

C. Menni, A.M. Valdes, M.B. Freidin, C.H. Sudre, L.H. Nguyen, D.A. Drew, S. Ganesh, T. Varsavsky, M.J. Cardoso, J.S. El-Sayed Moustafa, A. Visconti, Real-time tracking of self-reported symptoms to predict potential COVID-19, Nature medicine, 26 (7) (2020) 1037–1040, https://doi.org/10.1038/s41591-020-0916-2
  Google Scholar

M. A. Al-garadi, Y. Yang, S. Lakamana, A. Sarker, A Text Classification Approach for the Automatic Detection of Twitter Posts Containing Self-reported COVID-19 Symptoms, Open Review, (2020) 1–5
  Google Scholar

S. N. Sari, M. R. Faisal, D. Kartini, I. Budiman, Comparison of Feature Extraction with Supervised and Unsupervised Weighting in the Random Forest Algorithm for Monitoring Reports of COVID-19 Sufferers on Twitter, Jurnal Komputasi, 11 (1) (2023) 34–42, http://dx.doi.org/10.23960%2Fkomputasi.v11i1.6650
  Google Scholar

M. R. Faisal, I. Budiman, F. Abadi, M. Haekal, D. T. Nugrahadi, A comparison of word embedding-based extraction feature techniques and deep learning models of natural disaster messages classification, Journal of Computer Sciences Institute, 27 (2023) 145–153, https://doi.org/10.35784/jcsi.3322
  Google Scholar

M. Khairie, M. R. Faisal, R. Herteno, I. Budiman, F. Abadi, and M. I. Mazdadi, The Effect of Channel Size on Performance of 1D CNN Architecture for Automatic Detection of Self-Reported COVID-19 Symptoms on Twitter, in 2023 International Seminar on Intelligent Technology and Its Applications (ISITIA) (2023) 621–625. https://doi.org/10.1109/ISITIA59021.2023.10220444
  Google Scholar

M. R. Faisal, I. Budiman, F. Abadi, D. T. Nugrahadi, M. Haekal, and I. Sutedja, Applying Features Based on Word Embedding Techniques to 1D CNN for Natural Disaster Messages Classification, 2022 5th International Conference on Computer and Informatics Engineering, IC2IE 2022, (2022) 192–197, https://doi.org/10.1109/IC2IE56416.2022.9970188
  Google Scholar

G. A. Pradnyana, W. Anggraeni, E. M. Yuniarno, and M. H. Purnomo, Fine-Tuning IndoBERT Model for Big Five Personality Prediction from Indonesian Social Media, in 2023 International Seminar on Intelligent Technology and Its Applications (ISITIA) (2023) 93–98, https://doi.org/10.1109/ISITIA59021.2023.10221074
  Google Scholar

M. F. Nafiz, D. Kartini, M. R. Faisal, F. Indriani, and T. Hamonangan, Automated Detection of COVID-19 Cough Sound using Mel-Spectrogram Images and Convolutional Neural Network, Jurnal Ilmiah Teknik Elektro Komputer dan Informatika (JITEKI), 9 (3) (2023) 535–548, http://dx.doi.org/10.26555/jiteki.v9i3.26374
  Google Scholar

K. Y. Halim, D. T. Nugrahadi, M. R. Faisal, R. Herteno, and I. Budiman, Gender Classification Based on Electrocardiogram Signals Using Long Short Term Memory and Bidirectional Long Short Term Memory, Jurnal Ilmiah Teknik Elektro Komputer dan Informatika (JITEKI), 9 (3) (2023) 606–618, http://dx.doi.org/10.26555/jiteki.v9i3.26354
  Google Scholar

Download


Published
2024-03-20

Cited by

Budiman, I., Faisal, M. R., Faridhah, A., Farmadi, A., Mazdadi, M. I., Saragih, T. H., & Abadi, F. (2024). Classification Performance Comparison of BERT and IndoBERT on SelfReport of COVID-19 Status on Social Media. Journal of Computer Sciences Institute, 30, 61–67. https://doi.org/10.35784/jcsi.5564

Authors

Irwan Budiman 

Lambung Mangkurat University Indonesia

Authors

Mohammad Reza Faisal 
reza.faisal@ulm.ac.id
Lambung Mangkurat University Indonesia

Authors

Astina Faridhah 

Lambung Mangkurat University Indonesia

Authors

Andi Farmadi 

Lambung Mangkurat University Indonesia

Authors

Muhammad Itqan Mazdadi 

Lambung Mangkurat University Indonesia

Authors

Triando Hamonangan Saragih 

Lambung Mangkurat University Indonesia

Authors

Friska Abadi 

Lambung Mangkurat University Indonesia

Statistics

Abstract views: 74
PDF downloads: 47