Machine learning in big data: A performance benchmarking study of Flink-ML and Spark MLlib

Messaoud MEZATI; Ines AOURIA

doi:10.35784/acs_7297

PDF

Published: Jun 28, 2025

DOI: https://doi.org/10.35784/acs_7297

Issue Vol. 21 No. 2 (2025)

Articles

Integrating path planning and task scheduling in autonomous drone operations
Ahmed KAMIL, Basim MAHMOOD

1-17
Machine learning in big data: A performance benchmarking study of Flink-ML and Spark MLlib
Messaoud MEZATI, Ines AOURIA

18-27
Buckling of a structure made of a new eco-composite material
Jarosław GAWRYLUK, Karolina GŁOGOWSKA, Hubert BARTNICKI

28-36
Deep learning for early Parkinson's detection: A review of fundus imaging approaches
Zheen ALI, Najdavan KAKO

37-50
Digital solutions for risk management in sustainable development conditions of business ecosystems
Oleksii HNIEZDOVSKYI, Danylo DOMASHENKO, Svitlana DOMASHENKO, Denys MOROZOV, Serhii SHYLO

51-62
A new approach for diabetes risk detection using quadratic interpolation flower pollination neural network
Yulianto Triwahyuadi POLLY, Adriana FANGGIDAE, Juan Rizky Mannuel LEDOH, Clarissa Elfira AMOS PAH, Bertha S. DJAHI, Kisan Emiliano Rape TUPEN

63-81
Predictive modeling of telemedicine implementation in central Asia using neural networks
Zhannur ABDRAKHMANOVA, Talgat DEMESSINOV, Kadisha JAPAROVA, Monika KULISZ, Gulzhan BAYTIKENOVA, Ainur KARIPOVA , Zhansaya ERSAINOVA

82-95
Enhanced IoT cybersecurity through Machine Learning - based penetration testing
Mohammed J. BAWANEH, Obaida M. AL-HAZAIMEH, Malek M. AL-NAWASHI , Monther H. AL-BSOOL, Essam HANANDAH

96-110
A two phase ensembled deep learning approach of prominent gene extraction and disease risk prediction
Prajna Paramita DEBATA, Alakananda TRIPATHY, Pournamasi PARHI, Smruti Rekha DAS

111-127
Effectiveness of large language models and software libraries in sentiment analysis
Agnieszka WOJDECKA, Jakub GROMADZIŃSKI, Krzysztof WALCZAK

128-138
A comprehensive review of deepfakes in medical imaging: Ethical concerns, detection techniques and future directions
Pradepan P, Gladston Raj S, Juby George K

139-153
Appling Power BI for improved retail business analytics and decision-making
Huu DANG QUOC

154-163

Authors

Messaoud MEZATI

mezati.messaoud@univ-ouargla.dz

Kasdi Merbah University, Algeria

https://orcid.org/0009-0001-1996-5625

Ines AOURIA

aouria.ines@univ-ouargla.dz

Kasdi Merbah University, Algeria

https://orcid.org/0009-0004-2593-2979

Abstract

Machine learning (ML) in big data frameworks plays a critical role in real-time analytics, decision making, and predictive modeling. Among the most prominent ML libraries for large-scale data processing are Flink-ML, the machine learning extension of Apache Flink, and MLlib, the machine learning library of Apache Spark. This paper provides a comparative analysis of these two frameworks, evaluating their performance, scalability, streaming capabilities, iterative computation efficiency, and ease of integration with external deep learning frameworks. Flink-ML is designed for real-time, event-driven ML applications and provides native support for streaming-based model training and inference. In contrast, Spark MLlib is optimized for batch processing and micro-batch streaming, making it more suitable for traditional machine learning workflows. Experimental results show that training time is nearly identical for both frameworks, with Spark MLlib requiring 4006.4 seconds and Flink-ML 4003.2 seconds, demonstrating comparable efficiency in batch training and streaming-based model updates. Accuracy results show that Flink-ML (74.9%) slightly outperforms Spark MLlib (74.7%), suggesting that continuous learning in Flink-ML may contribute to better generalization. Inference throughput is slightly higher for Spark MLlib (8.4 images/sec) compared to Flink-ML (8.2 images/sec), suggesting that Spark's batch execution provides a slight advantage in processing efficiency. Both frameworks consume the same amount of memory (30.2%), confirming that TensorFlow's deep learning operations dominate resource consumption rather than architectural differences between Spark and Flink. These results highlight the tradeoffs between Flink-ML and Spark MLlib, and guide data scientists and engineers in selecting the appropriate framework based on specific ML workflow requirements and scalability considerations.

Keywords:

Machine Learning, Apache Spark, Apache Flink, Flink-ML, MLlib

References

MEZATI, M., & AOURIA, I. (2025). Machine learning in big data: A performance benchmarking study of Flink-ML and Spark MLlib. Applied Computer Science, 21(2), 18–27. https://doi.org/10.35784/acs_7297

Machine learning in big data: A performance benchmarking study of Flink-ML and Spark MLlib

Issue Vol. 21 No. 2 (2025)

Archives

Authors

Abstract

Keywords:

References

License

Article Sidebar

Issue Vol. 21 No. 2 (2025)

Archives

Main Article Content

Authors

Abstract

Keywords:

References

Article Details

License