Analyze the effectiveness of ETL processes implemented using SQL and Apache HiveQL languages

Krzysztof Litka

s99174@pollub.edu.pl
Lublin University of Technology (Poland)

Abstract

In the era of digitization, where data is collected in ever-increasing quantities, efficient processing is required. The article analyzes the performance of SQL and HiveQL, for scenarios of varying complexity, focusing on the execution time of individual queries. The tools used in the study are also discussed. The results of the study for each language are summarized and compared, highlighting their strengths and weaknesses, as well as identifying their possible areas of application.


Keywords:

ETL, SQL, HiveQL

E. Capriolo, D. Wampler, J. Rutherglen, Programming Hive: Data Warehouse and Query Language for Hadoop, O'Reilly Media, 1st edition, 2012.
  Google Scholar

J. Caserta, R. Kimball, The Data Warehouse ETL Toolkit., Wiley, 2004.
  Google Scholar

Cloudera Data Platform, https://www.cloudera.com/products/cloudera-data-platform.html, [25.05.2023].
  Google Scholar

J. Dean, S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Communications of the ACM 51(1) (2008) 107-113, https://doi.org/10.1145/1327452.1327492.
DOI: https://doi.org/10.1145/1327452.1327492   Google Scholar

B. Karwin, SQL Antipatterns: Avoiding the Pitfalls of Database Programming, Pragmatic Programmers LLC, The 1st edition 2017.
  Google Scholar

P. Mellor, SQL and Relational Theory: How to Write Accurate SQL Code, O'Reilly Media Inc., 2011.
  Google Scholar

B. Oliveira, O. Belo, J. Caldeira, A Systematic Literature Review on Big Data Extraction, Transformation and Loading (ETL), Proceedings of the 2021 Computing Conference Volume 2 held virtually (2021) 308-324, https://doi.org/10.1007/978-3-030-80126-7_24.
DOI: https://doi.org/10.1007/978-3-030-80126-7_24   Google Scholar


  Google Scholar

A. Pelikant, Hurtownie danych. Od przetwarzania anali-tycznego do raportowania, Wydanie II, Helion, 2021.
  Google Scholar

A. Simitsis, P. Vassiliadis, T. Sellis, Optimizing ETL processes in data warehouses, 21st International Confer-ence on Data Engineering (ICDE'05), Tokyo, Japan (2005) 564-575, https://doi.org/10.1109/ICDE.2005.103.
DOI: https://doi.org/10.1109/ICDE.2005.103   Google Scholar

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, Hive - a Petabyte Scale Data Warehouse using Hadoop, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), Long Beach, CA USA (2010) 996-1005, https://doi.org/10.1109/ICDE.2010.5447738.
DOI: https://doi.org/10.1109/ICDE.2010.5447738   Google Scholar

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, R. Murthy, Hive: a ware-housing solution over a map-reduce framework, Proceed-ings of the VLDB Endowment 2(2) (2009) 1626–1629, https://doi.org/10.14778/1687553.1687609.
DOI: https://doi.org/10.14778/1687553.1687609   Google Scholar

T. White, Hadoop: The definitive guide, O'Reilly Media Inc., 2012.
  Google Scholar

P. C. Zikopoulos, C. Eaton, Understanding big data: Analytics for enterprise class Hadoop and streaming data, McGraw-Hill Osborne Media, 2011.
  Google Scholar

N. Ahmed, S. Ahamed, J. I. Rahim, Data Processing in Hive vs. SQL Server: A comparative analysis in the query performance, 2017 IEEE 3rd International Conference on Engineering Technologies and Social Sciences, Bangkok, Thailand (2017) 1-5, https://doi.org/10.1109/icetss.2017.8324202.
DOI: https://doi.org/10.1109/ICETSS.2017.8324202   Google Scholar

Download


Published
2023-09-30

Cited by

Litka, K. (2023). Analyze the effectiveness of ETL processes implemented using SQL and Apache HiveQL languages. Journal of Computer Sciences Institute, 28, 204–209. https://doi.org/10.35784/jcsi.3674

Authors

Krzysztof Litka 
s99174@pollub.edu.pl
Lublin University of Technology Poland

Statistics

Abstract views: 78
PDF downloads: 110