Analysis of data processing efficiency with use of Apache Hive and Apache Pig in Hadoop environment

Mikołaj Skrzypczyński

mikolaj.skrzypczynski@pollub.edu.pl
Lublin University of Technology (Poland)

Piotr Muryjas


Lublin University of Technology (Poland)

Abstract

The aim of this paper is the analysis of data processing efficiency with use of Apache Hive and Apache Pig in Hadoop environment. The analysis was based on comparison between both mentioned tools with use of large data set, represented by 28 million records. Research was provided with use of scripts and queries destined for Apache Hive and Apache Pig, and then executed 10 times on environment brought by created virtual machine. Those methods were performed on the same data sets for 16 times according to previously prepared research scenarios. As the conclusion, authors had observed that Apache Hive is more efficient tool, than Apache Pig.


Keywords:

data processing, Apache Hive, Apache Pig, Hadoop

K. Bansal, P. Chawla, P. Kurle, Analyzing Performance of Apache Pig and Apache Hive with Hadoop, International Conference On Engineering Vibration Communication and Information Processing (ICoEVCI), (2018) 41-51, https://doi.org/10.1007/978-981-13-1642-5_4
  Google Scholar

M. Ahmad, S. Kanwal, M. Cheema, M. A. Habib, Performance Analysis of ECG Big Data using Apache Hive and Apache Pig, 2019 8th International Conference on Information and Communication Technologies (ICICT), (2019) 2-7, https://doi.org/10.1109/ICICT47744.2019.9001287
  Google Scholar

A. Fuad, A. Erwin, H. P. Ipung, Processing performance on Apache Pig, Apache Hive and MySQL cluster, Proceedings of International Conference on Information, Communication Technology and System (ICTS), (2014) 297-302, https://doi.org/10.1109/ICTS.2014.7010600
  Google Scholar

Dokumentacja techniczna technologii Apache Hadoop https://hadoop.apache.org/, [10.07.2023]
  Google Scholar

K. Sitto, M. Presser, Field Guide to Hadoop: An Introduction to Hadoop, Its Ecosystem, and Aligned Technologies, O'Reilly Media, 2015
  Google Scholar

Dokumentacja techniczna technologii MapReduce https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Overview, [10.07.2023]
  Google Scholar

D Dayong., Apache Hive Essentials Second Edition, Packt Publishing, 2015
  Google Scholar

C. Swarna, Z. Ansari, Apache Pig-a data flow framework based on Hadoop Map Reduce. International Journal of Engineering Trends and Technology (IJETT), 50 (5) (2017) 271-275 https://doi.org/10.14445/22315381/IJETT-V50P244
  Google Scholar

Środowisko wirtualizacji VMware Workstation 17 Player https://www.vmware.com/products/workstation-player/workstation-player-evaluation.html, [10.07.2023]
  Google Scholar

Komponenty składowe środowiska Cloudera CDH https://www.cloudera.com/products/open-source/apache-hadoop/key-cdh-components.html, [10.07.2023]
  Google Scholar

Zbiór danych testowych „NYC Taxi Trips Dataset” https://maven datasets.s3.amazonaws.com/Taxi+Trips/NYC_Taxi_Trips.zip, [10.07.2023]
  Google Scholar

Download


Published
2024-03-20

Cited by

Skrzypczyński, M., & Muryjas, P. (2024). Analysis of data processing efficiency with use of Apache Hive and Apache Pig in Hadoop environment. Journal of Computer Sciences Institute, 30, 1–8. https://doi.org/10.35784/jcsi.4060

Authors

Mikołaj Skrzypczyński 
mikolaj.skrzypczynski@pollub.edu.pl
Lublin University of Technology Poland

Authors

Piotr Muryjas 

Lublin University of Technology Poland

Statistics

Abstract views: 222
PDF downloads: 310