Machine learning in big data: A performance benchmarking study of Flink-ML and Spark MLlib
Article Sidebar
Open full text
Issue Vol. 21 No. 2 (2025)
-
Integrating path planning and task scheduling in autonomous drone operations
Ahmed KAMIL, Basim MAHMOOD1-17
-
Machine learning in big data: A performance benchmarking study of Flink-ML and Spark MLlib
Messaoud MEZATI, Ines AOURIA18-27
-
Buckling of a structure made of a new eco-composite material
Jarosław GAWRYLUK, Karolina GŁOGOWSKA, Hubert BARTNICKI28-36
-
Deep learning for early Parkinson's detection: A review of fundus imaging approaches
Zheen ALI, Najdavan KAKO37-50
-
Digital solutions for risk management in sustainable development conditions of business ecosystems
Oleksii HNIEZDOVSKYI, Danylo DOMASHENKO, Svitlana DOMASHENKO, Denys MOROZOV, Serhii SHYLO51-62
-
A new approach for diabetes risk detection using quadratic interpolation flower pollination neural network
Yulianto Triwahyuadi POLLY, Adriana FANGGIDAE, Juan Rizky Mannuel LEDOH, Clarissa Elfira AMOS PAH, Bertha S. DJAHI, Kisan Emiliano Rape TUPEN63-81
-
Predictive modeling of telemedicine implementation in central Asia using neural networks
Zhannur ABDRAKHMANOVA, Talgat DEMESSINOV, Kadisha JAPAROVA, Monika KULISZ, Gulzhan BAYTIKENOVA, Ainur KARIPOVA , Zhansaya ERSAINOVA82-95
-
Enhanced IoT cybersecurity through Machine Learning - based penetration testing
Mohammed J. BAWANEH, Obaida M. AL-HAZAIMEH, Malek M. AL-NAWASHI , Monther H. AL-BSOOL, Essam HANANDAH96-110
-
A two phase ensembled deep learning approach of prominent gene extraction and disease risk prediction
Prajna Paramita DEBATA, Alakananda TRIPATHY, Pournamasi PARHI, Smruti Rekha DAS111-127
-
Effectiveness of large language models and software libraries in sentiment analysis
Agnieszka WOJDECKA, Jakub GROMADZIŃSKI, Krzysztof WALCZAK128-138
-
A comprehensive review of deepfakes in medical imaging: Ethical concerns, detection techniques and future directions
Pradepan P, Gladston Raj S, Juby George K139-153
-
Appling Power BI for improved retail business analytics and decision-making
Huu DANG QUOC154-163
Archives
-
Vol. 21 No. 3
2025-10-05 12
-
Vol. 21 No. 2
2025-06-27 12
-
Vol. 21 No. 1
2025-03-31 12
-
Vol. 20 No. 4
2025-01-31 12
-
Vol. 20 No. 3
2024-09-30 12
-
Vol. 20 No. 2
2024-08-14 12
-
Vol. 20 No. 1
2024-03-30 12
-
Vol. 19 No. 4
2023-12-31 10
-
Vol. 19 No. 3
2023-09-30 10
-
Vol. 19 No. 2
2023-06-30 10
-
Vol. 19 No. 1
2023-03-31 10
-
Vol. 18 No. 4
2022-12-30 8
-
Vol. 18 No. 3
2022-09-30 8
-
Vol. 18 No. 2
2022-06-30 8
-
Vol. 18 No. 1
2022-03-30 7
-
Vol. 17 No. 4
2021-12-30 8
-
Vol. 17 No. 3
2021-09-30 8
-
Vol. 17 No. 2
2021-06-30 8
-
Vol. 17 No. 1
2021-03-30 8
Main Article Content
DOI
Authors
mezati.messaoud@univ-ouargla.dz
Abstract
Machine learning (ML) in big data frameworks plays a critical role in real-time analytics, decision making, and predictive modeling. Among the most prominent ML libraries for large-scale data processing are Flink-ML, the machine learning extension of Apache Flink, and MLlib, the machine learning library of Apache Spark. This paper provides a comparative analysis of these two frameworks, evaluating their performance, scalability, streaming capabilities, iterative computation efficiency, and ease of integration with external deep learning frameworks. Flink-ML is designed for real-time, event-driven ML applications and provides native support for streaming-based model training and inference. In contrast, Spark MLlib is optimized for batch processing and micro-batch streaming, making it more suitable for traditional machine learning workflows. Experimental results show that training time is nearly identical for both frameworks, with Spark MLlib requiring 4006.4 seconds and Flink-ML 4003.2 seconds, demonstrating comparable efficiency in batch training and streaming-based model updates. Accuracy results show that Flink-ML (74.9%) slightly outperforms Spark MLlib (74.7%), suggesting that continuous learning in Flink-ML may contribute to better generalization. Inference throughput is slightly higher for Spark MLlib (8.4 images/sec) compared to Flink-ML (8.2 images/sec), suggesting that Spark's batch execution provides a slight advantage in processing efficiency. Both frameworks consume the same amount of memory (30.2%), confirming that TensorFlow's deep learning operations dominate resource consumption rather than architectural differences between Spark and Flink. These results highlight the tradeoffs between Flink-ML and Spark MLlib, and guide data scientists and engineers in selecting the appropriate framework based on specific ML workflow requirements and scalability considerations.
Keywords:
References
Apache Flink. (n.d.). What is Apache Flink? - Architecture. Retrieved December 20, 2024 from https://flink.apache.org/what-is-flink/flink-architecture
Apache Spark. (2025, May 29). MLlib is Apache Spark’s scalable machine learning library.Retrieved January 5, 2025 from https://spark.apache.org/mllib/
Bazdaric, K., Sverko, D., Salaric, I., Martinovic, A., & Lucijanic, M. (2021). The ABC of linear regression analysis: What every author and editor should know. European Science Editing, 47, e63780. https://doi.org/10.3897/ese.2021.e63780 DOI: https://doi.org/10.3897/ese.2021.e63780
Carbone, P., Ewen, S., Fóra, G., Haridi, S., Richter, S., & Tzoumas, K. (2017). State management in Apache Flink®: Consistent stateful distributed stream processing. VLDB Endowment, 10(12), 1718–1729. https://doi.org/10.14778/3137765.3137777 DOI: https://doi.org/10.14778/3137765.3137777
Choi, H., & Lee, J. (2021). Efficient use of GPU memory for large-scale deep learning model training. Applied Sciences, 11(21), 10377. https://doi.org/10.3390/app112110377 DOI: https://doi.org/10.3390/app112110377
Dritsas, E., & Trigka, M. (2025). Exploring the intersection of machine learning and big data: A survey. Machine Learning and Knowledge Extraction, 7(1), 13. https://doi.org/10.3390/make7010013 DOI: https://doi.org/10.3390/make7010013
Gao, H., Kou, G., Liang, H., Zhang, H., Chao, X., Li, C.-C., & Dong, Y. (2024). Machine learning in business and finance: A literature review and research opportunities. Financial Innovation, 10, 86. https://doi.org/10.1186/s40854-024-00629-z DOI: https://doi.org/10.1186/s40854-024-00629-z
Jin, X., & Han, J. (2011). K-Means clustering. In C. Sammut & G. I. Webb (Eds.), Encyclopedia of Machine Learning (pp. 563–564). Springer US. https://doi.org/10.1007/978-0-387-30164-8_425 DOI: https://doi.org/10.1007/978-0-387-30164-8_425
Khalid, M., & Yousaf, M. M. (2021). A comparative analysis of big data frameworks: An adoption perspective. Applied Sciences, 11(22), 11033. https://doi.org/10.3390/app112211033 DOI: https://doi.org/10.3390/app112211033
Krizhevsky, A. (n.d.). The CIFAR-10 dataset. Retrieved December 28, 2024, from https://www.cs.toronto.edu/~kriz/cifar.html
Lopes, N., & Ribeiro, B. (2015). Support Vector Machines (SVMs). In N. Lopes & B. Ribeiro, Machine Learning for Adaptive Many-Core Machines—A Practical Approach (Vol. 7, pp. 85–105). Springer International Publishing. https://doi.org/10.1007/978-3-319-06938-8_5 DOI: https://doi.org/10.1007/978-3-319-06938-8_5
Markou, G., Bakas, N. P., Chatzichristofis, S. A., & Papadrakakis, M. (2024). A general framework of high-performance machine learning algorithms: Application in structural mechanics. Computational Mechanics, 73, 705–729. https://doi.org/10.1007/s00466-023-02386-9 DOI: https://doi.org/10.1007/s00466-023-02386-9
Nightlies Apache. (2022, February 2). Concepts & Common API. Retrieved December 20, 2024 from https://nightlies.apache.org/flink/flink-docs-release-1.3/dev/table/common.html
Nightlies Apache. (n.d.). Flink ML: Apache Flink Machine Learning Library. Retrieved December 25, 2024 from https://nightlies.apache.org/flink/flink-ml-docs-stable/
Ning, Z., Iradukunda, H. N., Zhang, Q., & Zhu, T. (2021). Benchmarking machine learning: How fast can your algorithms go? ArXiv, abs/2101.03219. https://doi.org/10.48550/arXiv.2101.03219
Pacella, M., Papa, A., Papadia, G., & Fedeli, E. (2025). A scalable framework for sensor data ingestion and real-time processing in cloud manufacturing. Algorithms, 18(1), 22. https://doi.org/10.3390/a18010022 DOI: https://doi.org/10.3390/a18010022
Tang, S., He, B., Yu, C., Li, Y., & Li, K. (2020). A survey on spark ecosystem for big data processing. IEEE Transactions on Knowledge and Data Engineering, 34(1), 71-91. https://doi.org/10.1109/TKDE.2020.2975652 DOI: https://doi.org/10.1109/TKDE.2020.2975652
Theodorakopoulos, L., Karras, A., & Krimpas, G. A. (2025). Optimizing apache spark MLlib: Predictive performance of large-scale models for big data analytics. Algorithms, 18(2), 74. https://doi.org/10.3390/a18020074 DOI: https://doi.org/10.3390/a18020074
Wongpanich, A., Oguntebi, T., Paredes, J. B., Wang, Y. E., Phothilimthana, P. M., Mitra, R., Zhou, Z., Kumar, N., & Reddi, V. J. (2025). Machine learning fleet efficiency: Analyzing and optimizing large-scale Google TPU systems with ML productivity goodput. ArXiv, abs/2502.06982. https://doi.org/10.48550/arXiv.2502.06982
Zeydan, E., & Mangues-Bafalluy, J. (2022). Recent advances in data engineering for networking. IEEE Access, 10, 34449–34496. https://doi.org/10.1109/ACCESS.2022.3162863 DOI: https://doi.org/10.1109/ACCESS.2022.3162863
Article Details
Abstract views: 401
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
All articles published in Applied Computer Science are open-access and distributed under the terms of the Creative Commons Attribution 4.0 International License.
