Analysis of data processing efficiency with use of Apache Hive and Apache Pig in Hadoop environment
Mikołaj Skrzypczyński
mikolaj.skrzypczynski@pollub.edu.plLublin University of Technology (Poland)
Piotr Muryjas
Lublin University of Technology (Poland)
Abstract
The aim of this paper is the analysis of data processing efficiency with use of Apache Hive and Apache Pig in Hadoop environment. The analysis was based on comparison between both mentioned tools with use of large data set, represented by 28 million records. Research was provided with use of scripts and queries destined for Apache Hive and Apache Pig, and then executed 10 times on environment brought by created virtual machine. Those methods were performed on the same data sets for 16 times according to previously prepared research scenarios. As the conclusion, authors had observed that Apache Hive is more efficient tool, than Apache Pig.
Keywords:
data processing, Apache Hive, Apache Pig, HadoopReferences
K. Bansal, P. Chawla, P. Kurle, Analyzing Performance of Apache Pig and Apache Hive with Hadoop, International Conference On Engineering Vibration Communication and Information Processing (ICoEVCI), (2018) 41-51, https://doi.org/10.1007/978-981-13-1642-5_4
Google Scholar
M. Ahmad, S. Kanwal, M. Cheema, M. A. Habib, Performance Analysis of ECG Big Data using Apache Hive and Apache Pig, 2019 8th International Conference on Information and Communication Technologies (ICICT), (2019) 2-7, https://doi.org/10.1109/ICICT47744.2019.9001287
Google Scholar
A. Fuad, A. Erwin, H. P. Ipung, Processing performance on Apache Pig, Apache Hive and MySQL cluster, Proceedings of International Conference on Information, Communication Technology and System (ICTS), (2014) 297-302, https://doi.org/10.1109/ICTS.2014.7010600
Google Scholar
Dokumentacja techniczna technologii Apache Hadoop https://hadoop.apache.org/, [10.07.2023]
Google Scholar
K. Sitto, M. Presser, Field Guide to Hadoop: An Introduction to Hadoop, Its Ecosystem, and Aligned Technologies, O'Reilly Media, 2015
Google Scholar
Dokumentacja techniczna technologii MapReduce https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Overview, [10.07.2023]
Google Scholar
D Dayong., Apache Hive Essentials Second Edition, Packt Publishing, 2015
Google Scholar
C. Swarna, Z. Ansari, Apache Pig-a data flow framework based on Hadoop Map Reduce. International Journal of Engineering Trends and Technology (IJETT), 50 (5) (2017) 271-275 https://doi.org/10.14445/22315381/IJETT-V50P244
Google Scholar
Środowisko wirtualizacji VMware Workstation 17 Player https://www.vmware.com/products/workstation-player/workstation-player-evaluation.html, [10.07.2023]
Google Scholar
Komponenty składowe środowiska Cloudera CDH https://www.cloudera.com/products/open-source/apache-hadoop/key-cdh-components.html, [10.07.2023]
Google Scholar
Zbiór danych testowych „NYC Taxi Trips Dataset” https://maven datasets.s3.amazonaws.com/Taxi+Trips/NYC_Taxi_Trips.zip, [10.07.2023]
Google Scholar
Authors
Mikołaj Skrzypczyńskimikolaj.skrzypczynski@pollub.edu.pl
Lublin University of Technology Poland
Authors
Piotr MuryjasLublin University of Technology Poland
Statistics
Abstract views: 222PDF downloads: 310
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.