Skip to main content

20 posts tagged with "Hive"

View All Tags

Optimizing Query Compilation in Hive 4 on MR3

· 7 min read
Sungwoo Park
MR3 Architect and Developer

Introduction

In our previous article, we evaluated the performance of Hive 4 on MR3 1.11 and Trino 453 on the 10TB TPC-DS benchmark. The results can be summarized as follows:

  • In terms of the total running time, Hive 4 on MR3 runs slightly faster than Trino -- Hive 4 on MR3 5744 seconds vs Trino 5798 seconds.
  • In terms of the geometric mean of running times, Trino responds about 15 percent faster than Hive 4 on MR3 -- Trino 17.99 seconds vs Hive 4 on MR3 21.02 seconds.

Performance Evaluation of Trino and Hive on MR3 using the TPC-DS Benchmark

· 5 min read
Sungwoo Park
MR3 Architect and Developer

Introduction

In our previous article, we evaluate the performance of Trino 418 and Hive on MR3 1.7 using the TPC-DS Benchmark with a scale factor of 10TB.

  • In terms of the total running time, the two systems are comparable: Trino 7424 seconds vs Hive on MR3 7415 seconds.
  • In terms of the geometric mean of running times, Trino is faster than Hive on MR3: Trino 21.75 seconds vs Hive on MR3 27.68 seconds.
  • Trino returns wrong answers on query 23 after running for 1756 seconds.
  • Trino fails to complete query 72 after running for 156 seconds.

Performance Tuning for Single-table Queries

· 5 min read
Sungwoo Park
MR3 Architect and Developer

Introduction

In our previous article, we have shown that Hive on MR3 1.7 runs much faster than Spark 3.4.0 on the TPC-DS benchmark with a scale factor of 10TB (7415 seconds vs 19669 seconds). The performance gap is expected to widen further due to improvements in Hive on MR3 1.8 (6867 seconds vs 7415 seconds). Still, however, there is a category of queries on which Hive on MR3 seems noticeably slower than Spark: single-table queries with no joins.

Why you should run Hive on Kubernetes, even in a Hadoop cluster

· 9 min read
Sungwoo Park
MR3 Architect and Developer

Hive and Presto

Hive and Presto have developed a tortoise-and-hare story over the past 8 years. Initially conceived at Facebook and open sourced in August 2008, Hive was hailed as a breakthrough in the SQL-on-Hadoop technology and generally regarded as the de facto standard. Then in 2012, Facebook started to develop Presto as a replacement of Hive, which was considered too slow for their daily workload. As Facebook was specific about its goal in developing Presto, the future of Hive did not look so bright.

Testing MR3 - Principle and Practice

· 28 min read
Sungwoo Park
MR3 Architect and Developer

Introduction

As an execution engine for big data processing, MR3 is a distributed system consisting of a single master (called DAGAppMaster) and multiple workers (called ContainerWorkers) running across the network. The master orchestrates the execution of workers and implements all the features required of a distributed system. Workers receive commands from the master and communicate with each other in order to transfer intermediate data. In this way, MR3 tries to maximize the utilization of cluster resources.