Machine learning approach to detect GAI-disguised academic programming plagiarism
Article Sidebar
Issue Vol. 22 No. 2 (2026)
-
Path planning in swarm robotics exploration using SARSA and ACO algorithms
Aicha HAFID, Riadh HOCINE, Lahcene GUEZOULI1-15
-
Detection of suspicious facial objects in neutral ATMs using deep learning architectures based on YOLOV8 and Faster R-CNN
Marco Manuel ARAGON PAUCAR, Kelvin Yhonson FERNANDEZ ACERO, Erasmo SULLA ESPINOZA16-32
-
Assessing the effectiveness of one-stage and two-stage methods for identifying high-voltage power grid equipment in UAV imagery
Thi Thanh Tan NGUYEN, Thi Thu Nga VU33-47
-
An automatic speech recognition approach for controlled medications prescription with natural language processing
Luis Enrique COLMENARES-GUILLÉN, Angel Axel MÉNDEZ-MENESES48-66
-
Improving image retrieval using CNN with PCA and Optimized K-Means clustering
Mohsin Hasan HUSSEIN, Ali Mohsin Ahmed AL-SABAAWI, Zakaria A. Hamed ALNAISH67-84
-
Numerical investigation into the hydrodynamic characteristics of water vortex turbines with varied blade angles
Sarwo EDHY SOFYAN, Zamzami, Akhyar AKHYAR, Suriadi, Agus SASMITO85-104
-
Optimization of the corporate cluster structure using the Tabu Search method
Andrzej IMIEŁOWSKI, Łukasz BANAŚ, Bogusław TWARÓG, Janusz BYTNAR105-116
-
Application controls audit framework in the context of ERP systems
Sakchai TANGPRASERT, Nalinpat BHUMPENPEIN117-125
-
Autonomous AI agents in digital markets: Economic implications for competition, pricing, and regulation
Elmira KYDYRBAYEVA, Balhiya SHOMSHEKOVA, Asset ABZHAKOV, Ainur ASHIMOVA, Assel NURTAYEVA126-137
-
Multi-criteria analysis of parameter impact in large-scale robotic 3D printing
Łukasz SOBASZEK, Ivan GAJDOŠ, Pavol ŠTEFČÁK138-147
-
Designing cloud-based knowledge management systems to improve organizational innovation
Hayfaa Subhi MALALLAH, Sherzad Mohammad AJEEL148-168
-
Data normalisation methods on microarray data
Inggih PERNAMA, Shir Li WANG, Hoi Yeh LEE, Suliana SULAIMAN, Hasnatul Nazuha HASSAN169-179
-
Log-based learning analytics of gamified Moodle activities: Quantifying student engagement
Iva GRUBJEŠIĆ, Tomislav IVANJKO, Vedran JURIČIĆ180-192
-
SFAB-Net: Semantic segmentation network for railway track surface defects based on Spatial Fusion and Adaptive Bottleneck feature enhancement
Qike WU, Sharafiz ABDUL RAHIM, Sai Hong TANG, Muhammad Azim AZIZI, Li ZHANG193-207
-
Machine learning approach to detect GAI-disguised academic programming plagiarism
Oscar KARNALIM, Yehezkiel David SETIAWAN, Maresha Caroline WIJANTO, Rossevine Artha NATHASYA208-224
Archives
-
Vol. 22 No. 2
2026-06-30 15
-
Vol. 22 No. 1
2026-03-31 15
-
Vol. 21 No. 4
2025-12-31 12
-
Vol. 21 No. 3
2025-09-30 12
-
Vol. 21 No. 2
2025-06-30 12
-
Vol. 21 No. 1
2025-03-31 12
-
Vol. 20 No. 4
2024-12-31 12
-
Vol. 20 No. 3
2024-09-30 12
-
Vol. 20 No. 2
2024-06-30 12
-
Vol. 20 No. 1
2024-03-30 12
-
Vol. 19 No. 4
2023-12-31 10
-
Vol. 19 No. 3
2023-09-30 10
-
Vol. 19 No. 2
2023-06-30 10
-
Vol. 19 No. 1
2023-03-31 10
-
Vol. 18 No. 4
2022-12-30 8
-
Vol. 18 No. 3
2022-09-30 8
-
Vol. 18 No. 2
2022-06-30 8
-
Vol. 18 No. 1
2022-03-31 8
Main Article Content
Authors
oscar.karnalim@it.maranatha.edu
Abstract
Plagiarism is a common issue in programming education, and the issue exacerbates with the emergence of Generative Artificial Intelligence (GAI). Plagiarism acts can be disguised with GAI, resulting in pervasive, consistent changes across the entire program. We present a programming plagiarism detector dedicated to GAI disguises. It not only relies on program similarities but also on GAI characteristics. GAI has its own way of writing programs. Our plagiarism detector employs 23 features. Five of them are related to structure (program similarities) while the rest are associated with GAI characteristics (the use of list comprehension, recursion, etc). It features seven machine learning models to choose from: Logistic Regression, Random Forest, XGBoost, LightGBM, CatBoost, Voting Classifier, and Stacking Classifier. According to our evaluation of 6344 instances from the machine intelligence course, Stacking Classifier achieves the highest performance, with 89.17% accuracy, 88.94% precision, 89.17% recall, and 88.77% F-score. It outperforms similarity-based plagiarism detectors (which serve as the baseline) by a factor of 2 in most metrics. All structural features (program similarities) are considered important by our machine learning models, accompanied by several GAI-characteristic features. The prominent GAI characteristics are the use of list comprehension, recursion, and branching condition statements without parentheses.
Keywords:
Sustainable Development Goals (SDG)
- 4 - Quality education
- 16 - Peace, justice and strong institutions
References
Aivaloglou, E., & Meulen, A. van der. (2021). An Empirical Study of Students’ Perceptions on the Setup and Grading of Group Programming Assignments. ACM Transactions on Computing Education (TOCE), 21(3), 1–22. https://doi.org/10.1145/3440994
Albluwi, I. (2019). Plagiarism in programming assessments: a systematic review. ACM Transactions on Computing Education, 20(1), 6:1-6:28. https://doi.org/10.1145/3371156
Allen, J. M., Vahid, F., Downey, K., & Edgcomb, A. D. (2018). Weekly programs in a CS1 class: experiences with auto-graded many-small programs (MSP). ASEE Annual Conference & Exposition, 1–13. https://doi.org/10.18260/1-2--31231
Bandara, U., & Wijayarathna, G. (2011). A machine learning based tool for source code plagiarism detection. International Journal of Machine Learning and Computing, 1(4), 337–343. https://doi.org/10.7763/IJMLC.2011.V1.50
Blanchard, J., Hott, J. R., Berry, V., Carroll, R., Edmison, B., Glassey, R., Karnalim, O., Plancher, B., & Russell, S. (2022). Stop reinventing the wheel! Promoting community software in computing education. In Proceedings of the 2022 Working Group Reports on Innovation and Technology in Computer Science Education (pp. 261–292). Association for Computing Machinery. https://doi.org/10.1145/3571785.3574125
Bradley, S. (2020). Creative assessment in programming: Diversity and divergence. In Proceedings of the Fourth Conference on Computing Education Practice (Article 13). Association for Computing Machinery. https://doi.org/10.1145/3372310.3372325
Bubenkova, L., Pietrikova, E., & Horvath, M. (2025). Code reuse and good clones in programming education. In 2025 IEEE 23rd International Symposium on Applied Machine Intelligence and Informatics (SAMI) (pp. 401–406). IEEE. https://doi.org/10.1109/SAMI63904.2025.10883291
Bulla, L., Midolo, A., Mongiovì, M., & Tramontana, E. (2024). EX-CODE: A robust and explainable model to detect AI-generated code. Information, 15(12), Article 819. https://doi.org/10.3390/info15120819
Cendrowski, H., & Martin, J. (2015). The fraud triangle. In H. Cendrowski & J. Martin (Eds.), The handbook of fraud deterrence (pp. 41–46). John Wiley & Sons. https://doi.org/10.1002/9781119202165.ch5
Cheers, H., Lin, Y., & Smith, S. P. (2021). Academic source code plagiarism detection by measuring program behavioral similarity. IEEE Access, 9, 50391–50412. https://doi.org/10.1109/ACCESS.2021.3069367
Duracik, M., Hrkut, P., Krsak, E., & Toth, S. (2020). Abstract syntax tree based source code antiplagiarism system for large projects set. IEEE Access, 8, 175347–175359. https://doi.org/10.1109/ACCESS.2020.3026422
Ebrahim, F., & Joy, M. (2024). Semantic similarity search for source code plagiarism detection: An exploratory study. In Proceedings of the 2024 Innovation and Technology in Computer Science Education (ITiCSE) (Vol. 1, pp. 360–366). Association for Computing Machinery. https://doi.org/10.1145/3649217.3653622
Eppa, A., & Murali, A. H. (2021). Machine learning techniques for multisource plagiarism detection. In 2021 5th International Conference on Computational Systems and Information Technology for Sustainable Solutions (CSITSS). IEEE. https://doi.org/10.1109/CSITSS54238.2021.9683752
Eppa, A., & Murali, A. (2022). Source code plagiarism detection: A machine intelligence approach. In 2022 4th International Conference on Advances in Electronics, Computers and Communications (ICAECC). IEEE. https://doi.org/10.1109/ICAECC54045.2022.9716671
Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., & Zhou, M. (2020). CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 1536–1547). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.139
Fokam, M. A., & Ajoodha, R. (2021). Influence of contrastive learning on source code plagiarism detection through recursive neural networks. In 2021 3rd International Multidisciplinary Information Technology and Engineering Conference (IMITEC). IEEE. https://doi.org/10.1109/IMITEC52926.2021.9714688
Foltýnek, T., Všianský, R., Meuschke, N., Dlabolová, D., & Gipp, B. (2020). Cross-language source code plagiarism detection using explicit semantic analysis and scored greedy string tilling. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (pp. 523–524). Association for Computing Machinery. https://doi.org/10.1145/3383583.3398594
Fowler, M., & Zilles, C. (2021). Superficial code-guise: Investigating the impact of surface feature changes on students’ programming question scores. In Proceedings of the 52nd ACM Technical Symposium on Computer Science Education (pp. 3–9). Association for Computing Machinery. https://doi.org/10.1145/3408877.3432420
Fowler, M., Smith, D. H., & Zilles, C. (2024). Quickly producing ‘isomorphic’ exercises: Quantifying the impact of programming question permutations. In Proceedings of the 2024 Innovation and Technology in Computer Science Education (ITiCSE) (Vol. 1, pp. 178–184). Association for Computing Machinery. https://doi.org/10.1145/3649217.3653617
Guo, D., Ren, S., Lu, S., Feng, Z., Tang, D., Liu, S., Zhou, L., Duan, N., Svyatkovskiy, A., Fu, S., Tufano, M., Deng, S. K., Clement, C., Drain, D., Sundaresan, N., Yin, J., Jiang, D., & Zhou, M. (2020). GraphCodeBERT: Pre-training code representations with data flow. ArXiv, abs/2009.08366. https://arxiv.org/abs/2009.08366
Hawlitschek, A., Berndt, S., & Schulz, S. (2023). Empirical research on pair programming in higher education: A literature review. Computer Science Education, 33(3), 400–428. https://doi.org/10.1080/08993408.2022.2039504
Hoq, M., Shi, Y., Leinonen, J., Babalola, D., Lynch, C., Price, T., & Akram, B. (2024). Detecting ChatGPT-generated code submissions in a CS1 course using machine learning models. In Proceedings of the 55th ACM Technical Symposium on Computer Science Education (Vol. 1, pp. 526–532). Association for Computing Machinery. https://doi.org/10.1145/3626252.3630800
Jovanovic, M., & Campbell, M. (2022). Generative artificial intelligence: Trends and prospects. Computer, 55(10), 107–112. https://doi.org/10.1109/MC.2022.3192720
Karnalim, O., & Kurniawati, G. (2020). Programming style on source code plagiarism and collusion detection. International Journal of Computing, 19(1), 27–38. https://doi.org/10.47839/ijc.19.1.1691
Karnalim, O. (2023). Maintaining academic integrity in programming: Locality-sensitive hashing and recommendations. Education Sciences, 13(1), Article 54. https://doi.org/10.3390/educsci13010054
Karnalim, O., Simon, & Chivers, W. (2023). Gamification to help inform students about programming plagiarism and collusion. IEEE Transactions on Learning Technologies, 16(5), 1–14. https://doi.org/10.1109/TLT.2023.3243893
Karnalim, O., Toba, H., & Johan, M. C. (2024). Detecting AI assisted submissions in introductory programming via code anomaly. Education and Information Technologies, 29(13), 16841–16866. https://doi.org/10.1007/s10639-024-12520-6
Karnalim, O. (2025). Identifying AI generated code with parallel KNN weight outlier detection. In Lecture Notes in Networks and Systems (Vol. 1140, pp. 459–470). Springer. https://doi.org/10.1007/978-3-031-71530-3_29
Kosmyna, N., Hauptmann, E., Yuan, Y. T., Situ, J., Liao, X.-H., Beresnitzky, A. V., Braunstein, I., & Maes, P. (2025). Your brain on ChatGPT: Accumulation of cognitive debt when using an AI assistant for essay writing task. ArXiv, abs/2506.08872. https://doi.org/10.48550/arXiv.2506.08872
Li, S., Liu, J., & Dong, Q. (2025). Generative artificial intelligence-supported programming education: Effects on learning performance, self-efficacy and processes. Australasian Journal of Educational Technology, 41(3), 1–25. https://doi.org/10.14742/ajet.9932
Ljubovic, V., & Pajic, E. (2020). Plagiarism detection in computer programming using feature extraction from ultra-fine-grained repositories. IEEE Access, 8, 96505–96514. https://doi.org/10.1109/ACCESS.2020.3000523
Maertens, R., Van Neyghem, M., Geldhof, M., Van Petegem, C., Strijbol, N., Dawyndt, P., & Mesuere, B. (2024). Discovering and exploring cases of educational source code plagiarism with Dolos. SoftwareX, 26, Article 101755. https://doi.org/10.1016/j.softx.2024.101755
Mason, T., Gavrilovska, A., & Joyner, D. A. (2019). Collaboration versus cheating: Reducing code plagiarism in an online MS computer science program. In Proceedings of the 50th ACM Technical Symposium on Computer Science Education (pp. 1004–1010). Association for Computing Machinery. https://doi.org/10.1145/3287324.3287443
Nguyen, P. T., Di Rocco, J., Di Sipio, C., Rubei, R., Di Ruscio, D., & Di Penta, M. (2024). GPTSniffer: A CodeBERT-based classifier to detect source code written by ChatGPT. Journal of Systems and Software, 214, Article 112059. https://doi.org/10.1016/j.jss.2024.112059
Novak, M., Joy, M., & Kermek, D. (2019). Source-code similarity detection and detection tools used in academia: A systematic review. ACM Transactions on Computing Education, 19(3), 1–37. https://doi.org/10.1145/3313290
Parr, T. (2013). The definitive ANTLR 4 reference. Pragmatic Bookshelf.
Parthasarathy, P. D., Kapoor, I., Joshi, S., & Thomas, S. (2024). Influence of personality traits on plagiarism through collusion in programming assignments. In Proceedings of the 2024 ACM Conference on International Computing Education Research (Vol. 1, pp. 143–153). Association for Computing Machinery. https://doi.org/10.1145/3632620.3671121
Pham, H., Ha, H., Tong, V., Hoang, D., Tran, D., & Le, T. N. (2024). MAGECODE: Machine-generated code detection method using large language models. IEEE Access, 12, 190186–190202. https://doi.org/10.1109/ACCESS.2024.3509987
Pudasaini, S., Miralles-Pechuán, L., Lillis, D., & Llorens Salvador, M. (2024). Survey on AI-generated plagiarism detection: The impact of large language models on academic integrity. Journal of Academic Ethics, 23(3), 1137–1170. https://doi.org/10.1007/s10805-024-09576-x
Ryman, D., Imbrie, P. K., & Kastner, J. (2022). Enhancement of plagiarism detection techniques via watermarking. In 2022 IEEE Frontiers in Education Conference (FIE). IEEE. https://doi.org/10.1109/FIE56618.2022.9962396
Saǧlam, T., Hahner, S., Schmid, L., & Burger, E. (2024). Obfuscation-resilient software plagiarism detection with JPlag. In Proceedings of the 2024 International Conference on Software Engineering (pp. 264–265). Association for Computing Machinery. https://doi.org/10.1145/3639478.3643074
Schneider, J., Bernstein, A., vom Brocke, J., Damevski, K., & Shepherd, D. C. (2018). Detecting plagiarism based on the creation process. IEEE Transactions on Learning Technologies, 11(3), 348–361. https://doi.org/10.1109/TLT.2017.2705056
Sharma, N., Shinde, S., Bhosale, S., & Patil, S. (2024). SourcePlag: Source code plagiarism detection based on abstract syntax trees. In 2024 IEEE International Conference on Blockchain and Distributed Systems Security (ICBDS). IEEE. https://doi.org/10.1109/ICBDS61829.2024.10837209
Sheahen, D., & Joyner, D. (2016). TAPS: A MOSS extension for detecting software plagiarism at scale. In Proceedings of the Third (2016) ACM Conference on Learning @ Scale (pp. 285–288). Association for Computing Machinery. https://doi.org/10.1145/2876034.2893435
Simon. (2017). Designing programming assignments to reduce the likelihood of cheating. In Proceedings of the 19th Australasian Computing Education Conference (pp. 42–47). Association for Computing Machinery. https://doi.org/10.1145/3013499.3013506
Simon, Sheard, J., Morgan, M., Petersen, A., Settle, A., & Sinclair, J. (2018). Informing students about academic integrity in programming. In Proceedings of the 20th Australasian Computing Education Conference (pp. 113–122). Association for Computing Machinery. https://doi.org/10.1145/3160489.3160502
Spacco, J., Fossati, D., Stamper, J., & Rivers, K. (2013). Towards improving programming habits to create better computer science course outcomes. In Proceedings of the 18th ACM Conference on Innovation and Technology in Computer Science Education (pp. 243–248). Association for Computing Machinery. https://doi.org/10.1145/2462476.2462483
Surahman, E., & Wang, T. H. (2022). Academic dishonesty and trustworthy assessment in online learning: A systematic literature review. Journal of Computer Assisted Learning, 38(6), 1535–1553. https://doi.org/10.1111/jcal.12708
Toba, H., Karnalim, O., Johan, M. C., Tada, T., Djajalaksana, Y. M., & Vivaldy, T. (2023). Inappropriate benefits and identification of ChatGPT misuse in programming tests: A controlled experiment. In Proceedings of the International Conference on Interactive Collaborative Learning (pp. 520–531). Springer. https://doi.org/10.1007/978-3-031-52667-1_50
Toba, H., & Karnalim, O. (2025). Machine learning models to detect AI-assisted code anomaly in introductory programming course. In Lecture Notes in Networks and Systems (Vol. 1140, pp. 163–181). Springer. https://doi.org/10.1007/978-3-031-71530-3_11
Tsang, H. H., Hanbidge, A. S., & Tin, T. (2018). Experiential learning through inter-university collaboration research project in academic integrity. In Proceedings of the 23rd Western Canadian Conference on Computing Education. Association for Computing Machinery. https://doi.org/10.1145/3209635.3209641
Ullah, F., Wang, J., Farhan, M., Habib, M., & Khalid, S. (2018). Software plagiarism detection in multiprogramming languages using machine learning approach. Concurrency and Computation: Practice and Experience, 30(21), e5000. https://doi.org/10.1002/cpe.5000
Ullah, F., Jabbar, S., & Mostarda, L. (2021). An intelligent decision support system for software plagiarism detection in academia. International Journal of Intelligent Systems, 36(6), 2730–2752. https://doi.org/10.1002/int.22399
Viuginov, N., Grachev, P., & Filchenkov, A. (2020). A machine learning based plagiarism detection in source code. In Proceedings of the 3rd International Conference on Algorithms, Computing and Artificial Intelligence (pp. 1–6). Association for Computing Machinery. https://doi.org/10.1145/3446132.3446420
Wang, Y., Wang, W., Joty, S., & Hoi, S. C. H. (2021). CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 8696–8708). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.685
Xie, Y., Wu, S., & Chakravarty, S. (2023). AI meets AI: Artificial intelligence and academic integrity - A survey on mitigating AI-assisted cheating in computing education. In Proceedings of the 24th Annual Conference on Information Technology Education (pp. 79–83). Association for Computing Machinery. https://doi.org/10.1145/3585059.3611449
Xu, Z., & Sheng, V. S. (2024). Detecting AI-generated code assignments using perplexity of large language models. AAAI Conference on Artificial Intelligence, 38(21), 23155–23162. https://doi.org/10.1609/aaai.v38i21.30361
Xu, X., Ni, C., Guo, X., Liu, S., Wang, X., Liu, K., & Yang, X. (2025). Distinguishing LLM-generated from human-written code by contrastive learning. ACM Transactions on Software Engineering and Methodology, 34(4), Article 100. https://doi.org/10.1145/3705300
Yasaswi, J., Kailash, S., Chilupuri, A., Purini, S., & Jawahar, C. V. (2017). Unsupervised learning based approach for plagiarism detection in programming assignments. In Proceedings of the 10th Innovations in Software Engineering Conference (pp. 117–121). Association for Computing Machinery. https://doi.org/10.1145/3021460.3021477
Yasaswi, J., Purini, S., & Jawahar, C. V. (2017). Plagiarism detection in programming assignments using deep features. In 2017 4th IAPR Asian Conference on Pattern Recognition (ACPR) (pp. 652–657). IEEE. https://doi.org/10.1109/ACPR.2017.146
Zhang, Z., & Saber, T. (2025). Machine learning approaches to code similarity measurement: A systematic review. IEEE Access, 13, 51729–51764. https://doi.org/10.1109/ACCESS.2025.3553392
Zhou, Z.-H. (2021). Machine learning. Springer. https://doi.org/10.1007/978-981-15-1967-3
Article Details
Abstract views: 7
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
All articles published in Applied Computer Science are open-access and distributed under the terms of the Creative Commons Attribution 4.0 International License.
