DETECTION OF SOURCE CODE IN INTERNET TEXTS USING AUTOMATICALLY GENERATED MACHINE LEARNING MODELS
Marcin BADUROWICZ
Lublin University of Technology, Faculty of Electrical Engineering and Computer Science, Department of Computer Science, Lublin (Poland)
Abstract
In the paper, the authors are presenting the outcome of web scraping software allowing for the automated classification of source code. The software system was prepared for a discussion forum for software developers to find fragments of source code that were published without marking them as code snippets. The analyzer software is using a Machine Learning binary classification model for differentiating between a programming language source code and highly technical text about software. The analyzer model was prepared using the AutoML subsystem without human intervention and fine-tuning and its accuracy in a described problem exceeds 95%. The analyzer based on the automatically generated model has been deployed and after the first year of continuous operation, its False Positive Rate is less than 3%. The similar process may be introduced in document management in software development process, where automatic tagging and search for code or pseudo-code may be useful for archiving purposes.
Keywords:
source code, binary classification, text classification, AutoMLReferences
programmers.net. (2000). Forum dyskusyjne dla programistów. https://4programmers.net
Google Scholar
Ahmed, Z., Amizadeh, S., Bilenko, M., Carr, R., Chin, W.-S., Dekel, Y., Dupre, X., Eksarevskiy, V., Filipi, S., Finley, T., Goswami, A., Hoover, M., Inglis, S., Interlandi, M., Kazmi, N., Krivosheev, G., Luferenko, P., Matantsev, I., Matusevych, S., Moradi, S., Nazirov, G., Ormont, J., Oshri, G., Pagnoni, A., Parmar, J., Roy, P., Siddiqui, M. Z., Weimer, M., Zahirazami, S., and Zhu, Y. (2019). Machine Learning at Microsoft with ML.NET. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2448–2458). Association for Computing Machinery. https://doi.org/10.1145/3292500.3330667
DOI: https://doi.org/10.1145/3292500.3330667
Google Scholar
Alreshedy, K., Dharmaretnam, D., German, D. M., Srinivasan, V., & Gulliver, T. A. (2018). SCC: Automatic Classification of Code Snippets. arXiv:1809.07945. https://doi.org/10.48550/arXiv.1809.07945
DOI: https://doi.org/10.1109/SCAM.2018.00031
Google Scholar
Badurowicz, M. (2020). ktos/Eleia: 4programmers.net bot for nagging users when their code in post is not marked as code. http://github.com/ktos/eleia
Google Scholar
Van Dam, J. K., & Zaytsev, V. (2016). Software Language Identification with Natural Language Classifiers. 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER) (pp. 624–628). IEEE. https://doi.org/10.1109/SANER.2016.92
DOI: https://doi.org/10.1109/SANER.2016.92
Google Scholar
Gilda, S. (2017). Source code classification using Neural Networks. 2017 14th International Joint Conference on Computer Science and Software Engineering (JCSSE) (1–6). IEEE. https://doi.org/10.1109/JCSSE.2017.8025917
DOI: https://doi.org/10.1109/JCSSE.2017.8025917
Google Scholar
GitHub Copilot – Your AI pair programmer. (n.d.). Retrieved January 22, 2021 from https://copilot.github.com
Google Scholar
He, X., Zhao, K., & Chu, X. (2021). AutoML: A survey of the state-of-the-art. Knowledge-Based Systems, 212, 106622. https://doi.org/https://doi.org/10.1016/j.knosys.2020.106622
DOI: https://doi.org/10.1016/j.knosys.2020.106622
Google Scholar
Khasnabish, J. N., Sodhi, M., Deshmukh, J., & Srinivasaraghavan, G. (2014). Detecting Programming Language from Source Code Using Bayesian Learning Techniques. In P. Perner (Ed.), Machine Learning and Data Mining in Pattern Recognition (pp. 513–522). Springer International Publishing.
DOI: https://doi.org/10.1007/978-3-319-08979-9_39
Google Scholar
Kłosowski, G., Kulisz, M., Lipski, J., Maj, M., & Bialek, R. (2021). The Use of Transfer Learning with Very Deep Convolutional Neural Network in Quality Management. European Research Studies Journal, XXIV(Special Issue 2), 253–263. https://doi.org/10.35808/ersj/2222
DOI: https://doi.org/10.35808/ersj/2222
Google Scholar
Kulisz, M., Kujawska, J., Przysucha, B., & Cel, W. (2021). Forecasting Water Quality Index in Groundwater Using Artificial Neural Network. Energies, 14(18), 5875. https://doi.org/10.3390/en14185875
DOI: https://doi.org/10.3390/en14185875
Google Scholar
LeClair, A., Eberhart, Z., & McMillan, C. (2018). Adapting Neural Text Classification for Improved Software Categorization. 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME) (461–472). IEEE. https://doi.org/10.1109/ICSME.2018.00056
DOI: https://doi.org/10.1109/ICSME.2018.00056
Google Scholar
Linguist. (n.d.). Retrieved January 22, 2022 from https://github.com/github/linguist
Google Scholar
Machrowska, A., Szabelski, J., Karpiński, R., Krakowski, P., Jonak, J., & Jonak, K. (2020). Use of Deep Learning Networks and Statistical Modeling to Predict Changes in Mechanical Parameters of Contaminated Bone Cements. Materials, 13(23), 5419. https://doi.org/10.3390/ma13235419
DOI: https://doi.org/10.3390/ma13235419
Google Scholar
Madani, N., Guerrouj, L., Di Penta, M., Gueheneuc, Y.-G., & Antoniol, G. (2010). Recognizing Words from Source Code Identifiers Using Speech Recognition Techniques. 2010 14th European Conference on Software Maintenance and Reengineering (pp. 68–77). IEEE. https://doi.org/10.1109/CSMR.2010.31
DOI: https://doi.org/10.1109/CSMR.2010.31
Google Scholar
Ohashi, H., & Watanobe, Y. (2019). Convolutional Neural Network for Classification of Source Codes. 2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip (MCSoC) (pp. 194–200). IEEE. https://doi.org/10.1109/MCSoC.2019.00035
DOI: https://doi.org/10.1109/MCSoC.2019.00035
Google Scholar
Pygments - Python syntax highlighter. (n.d.). Retrieved January 22, 2021 from https://pygments.org
Google Scholar
Sobaszek, Ł., Gola, A., & Kozłowski, E. (2020). Predictive Scheduling with Markov Chains and ARIMA Models. Applied Sciences, 10(17), 6121. https://doi.org/10.3390/app10176121
DOI: https://doi.org/10.3390/app10176121
Google Scholar
Szabelski, J., Karpiński, R., & Machrowska, A. (2022). Application of an Artificial Neural Network in the Modelling of Heat Curing Effects on the Strength of Adhesive Joints at Elevated Temperature with Imprecise Adhesive Mix Ratios. Materials, 15(3), 721. https://doi.org/10.3390/ma15030721
DOI: https://doi.org/10.3390/ma15030721
Google Scholar
Ugurel, S., Krovetz, R., & Giles, C. L. (2002). What’s the Code? Automatic Classification of Source Code Archives. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 632–638). ACM Digital Library. https://doi.org/10.1145/775047.775141
DOI: https://doi.org/10.1145/775047.775141
Google Scholar
Wever, M., Tornede, A., Mohr, F., & Hullermeier, E. (2021). AutoML for Multi-Label Classification: Overview and Empirical Evaluation. IEEE Transactions on Pattern Analysis & Machine Intelligence, 43(09), 3037–3054. https://doi.org/10.1109/TPAMI.2021.3051276
DOI: https://doi.org/10.1109/TPAMI.2021.3051276
Google Scholar
Yin, P., Deng, B., Chen, E., Vasilescu, B., & Neubig, G. (2018). Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow. International Conference on Mining Software Repositories (pp. 476–486). ACM Digital Library. https://doi.org/10.1145/3196398.3196408
DOI: https://doi.org/10.1145/3196398.3196408
Google Scholar
Authors
Marcin BADUROWICZLublin University of Technology, Faculty of Electrical Engineering and Computer Science, Department of Computer Science, Lublin Poland
Statistics
Abstract views: 206PDF downloads: 108
License
This work is licensed under a Creative Commons Attribution 4.0 International License.
All articles published in Applied Computer Science are open-access and distributed under the terms of the Creative Commons Attribution 4.0 International License.
Most read articles by the same author(s)
- Stanisław SKULIMOWSKI, Jerzy MONTUSIEWICZ, Marcin BADUROWICZ, ENHANCING THE EFFICIENCY OF THE LEVENSHTEIN DISTANCE BASED HEURISTIC METHOD OF ARRANGING 2D APICTORIAL ELEMENTS FOR INDUSTRIAL APPLICATIONS , Applied Computer Science: Vol. 19 No. 4 (2023)
- Marcin Badurowicz, Sebastian Łagowski, USAGE OF IOT EDGE APPROACH FOR ROAD QUALITY ANALYSIS , Applied Computer Science: Vol. 19 No. 1 (2023)
Similar Articles
- Manikandan SRIDHARAN, Delphin Carolina RANI ARULANANDAM, Rajeswari K CHINNASAMY, Suma THIMMANNA, Sivabalaselvamani DHANDAPANI, RECOGNITION OF FONT AND TAMIL LETTER IN IMAGES USING DEEP LEARNING , Applied Computer Science: Vol. 17 No. 2 (2021)
- Puppala Praneeth, Majety Sathvika, Vivek Kommareddy, Madala Sarath, Saran Mallela, Koneru Suvarna Vani, Prasun Chkrabarti, CLASSIFICATION OF PARKINSON'S DISEASE IN BRAIN MRI IMAGES USING DEEP RESIDUAL CONVOLUTIONAL NEURAL NETWORK , Applied Computer Science: Vol. 19 No. 2 (2023)
- Elmehdi BENMALEK, Jamal EL MHAMDI, Abdelilah JILBAB, Atman JBARI, A COUGH-BASED COVID-19 DETECTION SYSTEM USING PCA AND MACHINE LEARNING CLASSIFIERS , Applied Computer Science: Vol. 18 No. 4 (2022)
- Miguel Angel BELLO RIVERA, Carlos Alberto REYES GARCÍA, Tania Cristal TALAVERA ROJAS, Perfecto Malaquías QUINTERO FLORES, Rodolfo Eleazar PÉREZ LOAIZA, AUTOMATIC IDENTIFICATION OF DYSPHONIAS USING MACHINE LEARNING ALGORITHMS , Applied Computer Science: Vol. 19 No. 4 (2023)
- Mohamed ELBAHRI, Nasreddine TALEB, Sid Ahmed El Mehdi ARDJOUN, Chakib Mustapha Anouar ZOUAOUI , FEW-SHOT LEARNING WITH PRE-TRAINED LAYERS INTEGRATION APPLIED TO HAND GESTURE RECOGNITION FOR DISABLED PEOPLE , Applied Computer Science: Vol. 20 No. 2 (2024)
- Nouhaila BOUALOULOU, Taoufiq BELHOUSSINE DRISSI, Benayad NSIRI, CNN AND LSTM FOR THE CLASSIFICATION OF PARKINSON'S DISEASE BASED ON THE GTCC AND MFCC , Applied Computer Science: Vol. 19 No. 2 (2023)
- Wulan Dewi, Wiranto Herry Utomo, PLANT CLASSIFICATION BASED ON LEAF EDGES AND LEAF MORPHOLOGICAL VEINS USING WAVELET CONVOLUTIONAL NEURAL NETWORK , Applied Computer Science: Vol. 17 No. 1 (2021)
- Anusha NALLAPAREDDY, DETECTION AND CLASSIFICATION OF VEGETATION AREAS FROM RED AND NEAR INFRARED BANDS OF LANDSAT-8 OPTICAL SATELLITE IMAGE , Applied Computer Science: Vol. 18 No. 1 (2022)
- Jarosław WIKAREK, Paweł SITEK, Mieczysław JAGODZIŃSKI, A DECLARATIVE APPROACH TO SHOP ORDERS OPTIMIZATION , Applied Computer Science: Vol. 15 No. 4 (2019)
- Tilla IZSÁK, László MARÁK, Mihály ORMOS, EVALUATION OF SUPPORT VECTOR MACHINE BASED STOCK PRICE PREDICTION , Applied Computer Science: Vol. 19 No. 3 (2023)
You may also start an advanced similarity search for this article.