Classification Performance Comparison of BERT and IndoBERT on SelfReport of COVID-19 Status on Social Media

Irwan Budiman; Mohammad Reza Faisal; Astina Faridhah; Andi Farmadi; Muhammad Itqan Mazdadi; Triando Hamonangan Saragih; Friska Abadi

doi:10.35784/jcsi.5564

PDF

Published: Mar 20, 2024

DOI: https://doi.org/10.35784/jcsi.5564

Issue Vol. 30 (2024)

Articles

Analysis of data processing efficiency with use of Apache Hive and Apache Pig in Hadoop environment
Mikołaj Skrzypczyński, Piotr Muryjas

1-8
Analysis of the application for the DFD authoring usage possibilities
Marek Pieczykolan, Marcin Badurowicz

9-13
Comparative analysis of query execution speed using Entity Framework for selected database engines
Krzysztof Winiarczyk, Rafał Stęgierski

14-20
C++ and Kotlin performance on Android – a comparative analysis
Grzegorz Zaręba, Maciej Zarębski, Jakub Smołka

21-25
Comparative analysis of Node.js frameworks
Bartłomiej Zima, Marcin Barszcz

26-30
User experience analysis in virtual museums
Aleksandra Kobylska, Mariusz Dzieńkowski

31-38
Analysis of user experience during interaction with automotive repair workshop websites
Radosław Danielkiewicz, Mariusz Dzieńkowski

39-46
A comparative analysis of transitions generated using the Unity game development platform
Marek Tabiszewski

47-52
Comparative analysis of the performance of Unity and Unreal Engine game engines in 3D games
Kamil Abramowicz, Przemysław Borczuk

53-60
Classification Performance Comparison of BERT and IndoBERT on SelfReport of COVID-19 Status on Social Media
Irwan Budiman, Mohammad Reza Faisal, Astina Faridhah, Andi Farmadi, Muhammad Itqan Mazdadi, Triando Hamonangan Saragih, Friska Abadi

61-67

DOI

https://doi.org/10.35784/jcsi.5564

Authors

Irwan Budiman

irwan.budiman@ulm.ac.id

Lambung Mangkurat University, Indonesia

Mohammad Reza Faisal

reza.faisal@ulm.ac.id

Lambung Mangkurat University, Indonesia

Astina Faridhah

1911016120003@mhs.ulm.ac.id

Lambung Mangkurat University, Indonesia

Andi Farmadi

andifarmadi@ulm.ac.id

Lambung Mangkurat University, Indonesia

Muhammad Itqan Mazdadi

mazdadi@ulm.ac.id

Lambung Mangkurat University, Indonesia

Triando Hamonangan Saragih

triando.saragih@ulm.ac.id

Lambung Mangkurat University, Indonesia

Friska Abadi

friska.abadi@ulm.ac.id

Lambung Mangkurat University, Indonesia

Abstract

Messages shared on social media platforms like X are automatically categorized into two groups: those who self-report COVID-19 status and those who do not. However, it is essential to note that these messages cannot be a reliable monitoring tool for tracking the spread of the COVID-19 pandemic. The classification of social media messages can be achieved through the application of classification algorithms. Many deep learning-based algorithms, such as Convolutional Neural Networks (CNN) or Long Short-Term Memory (LSTM), have been used for text classification. However, CNN has limitations in understanding global context, while LSTM focuses more on understanding word-by-word sequences. Apart from that, both require a lot of data to learn. Currently, an algorithm is being developed for text classification that can cover the shortcomings of the previous algorithm, namely Bidirectional Encoder Representations from Transformers (BERT). Currently, there are many variants of BERT development. The primary objective of this study was to compare the effectiveness of two classification models, namely BERT and IndoBERT, in identifying self-report messages of COVID-19 status. Both BERT and IndoBERT models were evaluated using raw and preprocessed text data from X. The study's findings revealed that the IndoBERT model exhibited superior performance, achieving an accuracy rate of 94%, whereas the BERT model achieved a performance rate of 82%.

Keywords:

Text Classification, Covid-19 symptoms, Twitter, BERT, IndoBERT

References

T. Mackey, V. Purushothaman, J. Li, N. Shah, M. Nali, C. Bardier, B. Liang, M. Cai, R. Cuomo, Machine learning to detect self-reporting of symptoms, testing access, and recovery associated with COVID-19 on Twitter: retrospective big data infoveillance study, JMIR public health and surveillance, 6(2) (2020) 1-9, https://doi.org/10.2196/19509 DOI: https://doi.org/10.2196/19509

A. Z. Klein, A. Magge, K. O’Connor, J. I. Flores Amaro, D. Weissenbacher, and G. Gonzalez Hernandez, Toward using Twitter for tracking COVID-19: a natural language processing pipeline and exploratory data set, Journal of medical Internet research, 23 (1) (2021) 1-6, https://doi.org/10.2196/25314 DOI: https://doi.org/10.2196/25314

F. E. Ayo, O. Folorunso, F. T. Ibharalu, and I. A. Osinuga, Machine learning techniques for hate speech classification of Twiiter data: State-of-The-Art, future challenges and research directions, Computer Science Review, 38 (2020) 1-34, https://doi.org/10.1016/j.cosrev.2020.100311 DOI: https://doi.org/10.1016/j.cosrev.2020.100311

M. A. Riza, N. Charibaldi, U. Pembangunan, and N. Veteran, Emotion Detection in Twiter Social Media Using Long Short - Term Memory ( LSTM ) and Fast Text, 3 (1) (2021) 15–26, https://doi.org/10.25139/ijair.v3i1.3827 DOI: https://doi.org/10.25139/ijair.v3i1.3827

A. Chiorrini, C. Diamantini, A. Mircoli, and D. Potena, Emotion and sentiment analysis of posts using BERT, In EDBT/ICDT Workshops, 3 (2021) 1-7

B. Wilie, K. Vincentio, G.I. Winata, S. Cahyawijaya, X. Li, Z.Y. Lim, S. Soleman, R. Mahendra, P. Fung, S. Bahar, A. Purwarianti, IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding, arXiv preprint arXiv:2009.05387, (2020) 1-15

P. Ganesh, Y. Chen, X. Lou, M.A. Khan, Y. Yang, H. Sajjad, P. Nakov, D. Chen, M . Winslett, Compressing large-scale transformer-based models: A case study on BERT, Transactions of the Association for Computational Linguistics, 9 (2021) 1061–1080, https://doi.org/10.1162/tacl_a_00413 DOI: https://doi.org/10.1162/tacl_a_00413

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, In Proceedings of naacL-HLT, 1 (2019) 4171–4186

F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP, COLING 2020 - 28th International Conference on Computational Linguistics, Proceedings of the Conference (2020) 757–770, http://dx.doi.org/10.18653/v1/2020.coling-main.66 DOI: https://doi.org/10.18653/v1/2020.coling-main.66

C. Menni, A.M. Valdes, M.B. Freidin, C.H. Sudre, L.H. Nguyen, D.A. Drew, S. Ganesh, T. Varsavsky, M.J. Cardoso, J.S. El-Sayed Moustafa, A. Visconti, Real-time tracking of self-reported symptoms to predict potential COVID-19, Nature medicine, 26 (7) (2020) 1037–1040, https://doi.org/10.1038/s41591-020-0916-2 DOI: https://doi.org/10.1038/s41591-020-0916-2

M. A. Al-garadi, Y. Yang, S. Lakamana, A. Sarker, A Text Classification Approach for the Automatic Detection of Twitter Posts Containing Self-reported COVID-19 Symptoms, Open Review, (2020) 1–5

S. N. Sari, M. R. Faisal, D. Kartini, I. Budiman, Comparison of Feature Extraction with Supervised and Unsupervised Weighting in the Random Forest Algorithm for Monitoring Reports of COVID-19 Sufferers on Twitter, Jurnal Komputasi, 11 (1) (2023) 34–42, http://dx.doi.org/10.23960%2Fkomputasi.v11i1.6650 DOI: https://doi.org/10.23960/komputasi.v11i1.6650

M. R. Faisal, I. Budiman, F. Abadi, M. Haekal, D. T. Nugrahadi, A comparison of word embedding-based extraction feature techniques and deep learning models of natural disaster messages classification, Journal of Computer Sciences Institute, 27 (2023) 145–153, https://doi.org/10.35784/jcsi.3322 DOI: https://doi.org/10.35784/jcsi.3322

M. Khairie, M. R. Faisal, R. Herteno, I. Budiman, F. Abadi, and M. I. Mazdadi, The Effect of Channel Size on Performance of 1D CNN Architecture for Automatic Detection of Self-Reported COVID-19 Symptoms on Twitter, in 2023 International Seminar on Intelligent Technology and Its Applications (ISITIA) (2023) 621–625. https://doi.org/10.1109/ISITIA59021.2023.10220444 DOI: https://doi.org/10.1109/ISITIA59021.2023.10220444

M. R. Faisal, I. Budiman, F. Abadi, D. T. Nugrahadi, M. Haekal, and I. Sutedja, Applying Features Based on Word Embedding Techniques to 1D CNN for Natural Disaster Messages Classification, 2022 5th International Conference on Computer and Informatics Engineering, IC2IE 2022, (2022) 192–197, https://doi.org/10.1109/IC2IE56416.2022.9970188 DOI: https://doi.org/10.1109/IC2IE56416.2022.9970188

G. A. Pradnyana, W. Anggraeni, E. M. Yuniarno, and M. H. Purnomo, Fine-Tuning IndoBERT Model for Big Five Personality Prediction from Indonesian Social Media, in 2023 International Seminar on Intelligent Technology and Its Applications (ISITIA) (2023) 93–98, https://doi.org/10.1109/ISITIA59021.2023.10221074 DOI: https://doi.org/10.1109/ISITIA59021.2023.10221074

M. F. Nafiz, D. Kartini, M. R. Faisal, F. Indriani, and T. Hamonangan, Automated Detection of COVID-19 Cough Sound using Mel-Spectrogram Images and Convolutional Neural Network, Jurnal Ilmiah Teknik Elektro Komputer dan Informatika (JITEKI), 9 (3) (2023) 535–548, http://dx.doi.org/10.26555/jiteki.v9i3.26374

K. Y. Halim, D. T. Nugrahadi, M. R. Faisal, R. Herteno, and I. Budiman, Gender Classification Based on Electrocardiogram Signals Using Long Short Term Memory and Bidirectional Long Short Term Memory, Jurnal Ilmiah Teknik Elektro Komputer dan Informatika (JITEKI), 9 (3) (2023) 606–618, http://dx.doi.org/10.26555/jiteki.v9i3.26354

Budiman, I., Faisal, M. R., Faridhah, A., Farmadi, A., Mazdadi, M. I., Saragih, T. H., & Abadi, F. (2024). Classification Performance Comparison of BERT and IndoBERT on SelfReport of COVID-19 Status on Social Media. Journal of Computer Sciences Institute, 30, 61–67. https://doi.org/10.35784/jcsi.5564

Classification Performance Comparison of BERT and IndoBERT on SelfReport of COVID-19 Status on Social Media

Issue Vol. 30 (2024)

Archives

DOI

Authors

Abstract

Keywords:

References

License

Article Sidebar

Issue Vol. 30 (2024)

Archives

Main Article Content

DOI

Authors

Abstract

Keywords:

References

Article Details

License