DETECTION OF SOURCE CODE IN INTERNET TEXTS USING AUTOMATICALLY GENERATED MACHINE LEARNING MODELS

Marcin BADUROWICZ

doi:10.35784/acs-2022-7

Open full text

pdf

Published: Mar 30, 2022

DOI: https://doi.org/10.35784/acs-2022-7

DOI

https://doi.org/10.35784/acs-2022-7

Authors

Marcin BADUROWICZ

m.badurowicz@pollub.pl

Lublin University of Technology, Faculty of Electrical Engineering and Computer Science, Department of Computer Science, Lublin

Abstract

In the paper, the authors are presenting the outcome of web scraping software allowing for the automated classification of source code. The software system was prepared for a discussion forum for software developers to find fragments of source code that were published without marking them as code snippets. The analyzer software is using a Machine Learning binary classification model for differentiating between a programming language source code and highly technical text about software. The analyzer model was prepared using the AutoML subsystem without human intervention and fine-tuning and its accuracy in a described problem exceeds 95%. The analyzer based on the automatically generated model has been deployed and after the first year of continuous operation, its False Positive Rate is less than 3%. The similar process may be introduced in document management in software development process, where automatic tagging and search for code or pseudo-code may be useful for archiving purposes.

Keywords:

source code, binary classification, text classification, AutoML

References

programmers.net. (2000). Forum dyskusyjne dla programistów. https://4programmers.net

Ahmed, Z., Amizadeh, S., Bilenko, M., Carr, R., Chin, W.-S., Dekel, Y., Dupre, X., Eksarevskiy, V., Filipi, S., Finley, T., Goswami, A., Hoover, M., Inglis, S., Interlandi, M., Kazmi, N., Krivosheev, G., Luferenko, P., Matantsev, I., Matusevych, S., Moradi, S., Nazirov, G., Ormont, J., Oshri, G., Pagnoni, A., Parmar, J., Roy, P., Siddiqui, M. Z., Weimer, M., Zahirazami, S., and Zhu, Y. (2019). Machine Learning at Microsoft with ML.NET. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2448–2458). Association for Computing Machinery. https://doi.org/10.1145/3292500.3330667 DOI: https://doi.org/10.1145/3292500.3330667

Alreshedy, K., Dharmaretnam, D., German, D. M., Srinivasan, V., & Gulliver, T. A. (2018). SCC: Automatic Classification of Code Snippets. arXiv:1809.07945. https://doi.org/10.48550/arXiv.1809.07945 DOI: https://doi.org/10.1109/SCAM.2018.00031

Badurowicz, M. (2020). ktos/Eleia: 4programmers.net bot for nagging users when their code in post is not marked as code. http://github.com/ktos/eleia

Van Dam, J. K., & Zaytsev, V. (2016). Software Language Identification with Natural Language Classifiers. 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER) (pp. 624–628). IEEE. https://doi.org/10.1109/SANER.2016.92 DOI: https://doi.org/10.1109/SANER.2016.92

Gilda, S. (2017). Source code classification using Neural Networks. 2017 14th International Joint Conference on Computer Science and Software Engineering (JCSSE) (1–6). IEEE. https://doi.org/10.1109/JCSSE.2017.8025917 DOI: https://doi.org/10.1109/JCSSE.2017.8025917

GitHub Copilot – Your AI pair programmer. (n.d.). Retrieved January 22, 2021 from https://copilot.github.com

He, X., Zhao, K., & Chu, X. (2021). AutoML: A survey of the state-of-the-art. Knowledge-Based Systems, 212, 106622. https://doi.org/https://doi.org/10.1016/j.knosys.2020.106622 DOI: https://doi.org/10.1016/j.knosys.2020.106622

Khasnabish, J. N., Sodhi, M., Deshmukh, J., & Srinivasaraghavan, G. (2014). Detecting Programming Language from Source Code Using Bayesian Learning Techniques. In P. Perner (Ed.), Machine Learning and Data Mining in Pattern Recognition (pp. 513–522). Springer International Publishing. DOI: https://doi.org/10.1007/978-3-319-08979-9_39

Kłosowski, G., Kulisz, M., Lipski, J., Maj, M., & Bialek, R. (2021). The Use of Transfer Learning with Very Deep Convolutional Neural Network in Quality Management. European Research Studies Journal, XXIV(Special Issue 2), 253–263. https://doi.org/10.35808/ersj/2222 DOI: https://doi.org/10.35808/ersj/2222

Kulisz, M., Kujawska, J., Przysucha, B., & Cel, W. (2021). Forecasting Water Quality Index in Groundwater Using Artificial Neural Network. Energies, 14(18), 5875. https://doi.org/10.3390/en14185875 DOI: https://doi.org/10.3390/en14185875

LeClair, A., Eberhart, Z., & McMillan, C. (2018). Adapting Neural Text Classification for Improved Software Categorization. 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME) (461–472). IEEE. https://doi.org/10.1109/ICSME.2018.00056 DOI: https://doi.org/10.1109/ICSME.2018.00056

Linguist. (n.d.). Retrieved January 22, 2022 from https://github.com/github/linguist

Machrowska, A., Szabelski, J., Karpiński, R., Krakowski, P., Jonak, J., & Jonak, K. (2020). Use of Deep Learning Networks and Statistical Modeling to Predict Changes in Mechanical Parameters of Contaminated Bone Cements. Materials, 13(23), 5419. https://doi.org/10.3390/ma13235419 DOI: https://doi.org/10.3390/ma13235419

Madani, N., Guerrouj, L., Di Penta, M., Gueheneuc, Y.-G., & Antoniol, G. (2010). Recognizing Words from Source Code Identifiers Using Speech Recognition Techniques. 2010 14th European Conference on Software Maintenance and Reengineering (pp. 68–77). IEEE. https://doi.org/10.1109/CSMR.2010.31 DOI: https://doi.org/10.1109/CSMR.2010.31

Ohashi, H., & Watanobe, Y. (2019). Convolutional Neural Network for Classification of Source Codes. 2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip (MCSoC) (pp. 194–200). IEEE. https://doi.org/10.1109/MCSoC.2019.00035 DOI: https://doi.org/10.1109/MCSoC.2019.00035

Pygments - Python syntax highlighter. (n.d.). Retrieved January 22, 2021 from https://pygments.org

Sobaszek, Ł., Gola, A., & Kozłowski, E. (2020). Predictive Scheduling with Markov Chains and ARIMA Models. Applied Sciences, 10(17), 6121. https://doi.org/10.3390/app10176121 DOI: https://doi.org/10.3390/app10176121

Szabelski, J., Karpiński, R., & Machrowska, A. (2022). Application of an Artificial Neural Network in the Modelling of Heat Curing Effects on the Strength of Adhesive Joints at Elevated Temperature with Imprecise Adhesive Mix Ratios. Materials, 15(3), 721. https://doi.org/10.3390/ma15030721 DOI: https://doi.org/10.3390/ma15030721

Ugurel, S., Krovetz, R., & Giles, C. L. (2002). What’s the Code? Automatic Classification of Source Code Archives. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 632–638). ACM Digital Library. https://doi.org/10.1145/775047.775141 DOI: https://doi.org/10.1145/775047.775141

Wever, M., Tornede, A., Mohr, F., & Hullermeier, E. (2021). AutoML for Multi-Label Classification: Overview and Empirical Evaluation. IEEE Transactions on Pattern Analysis & Machine Intelligence, 43(09), 3037–3054. https://doi.org/10.1109/TPAMI.2021.3051276 DOI: https://doi.org/10.1109/TPAMI.2021.3051276

Yin, P., Deng, B., Chen, E., Vasilescu, B., & Neubig, G. (2018). Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow. International Conference on Mining Software Repositories (pp. 476–486). ACM Digital Library. https://doi.org/10.1145/3196398.3196408 DOI: https://doi.org/10.1145/3196398.3196408

BADUROWICZ, M. (2022). DETECTION OF SOURCE CODE IN INTERNET TEXTS USING AUTOMATICALLY GENERATED MACHINE LEARNING MODELS. Applied Computer Science, 18(1), 89–98. https://doi.org/10.35784/acs-2022-7

Article Sidebar

Main Article Content

DOI

Authors

Abstract

Keywords:

References

Article Details

License