Skip to main content

Why you should run Hive on Kubernetes, even in a Hadoop cluster

· 9 min read
Sungwoo Park
MR3 Architect and Developer

Hive and Presto

Hive and Presto have developed a tortoise-and-hare story over the past 8 years. Initially conceived at Facebook and open sourced in August 2008, Hive was hailed as a breakthrough in the SQL-on-Hadoop technology and generally regarded as the de facto standard. Then in 2012, Facebook started to develop Presto as a replacement of Hive, which was considered too slow for their daily workload. As Facebook was specific about its goal in developing Presto, the future of Hive did not look so bright.

Testing MR3 - Principle and Practice

· 28 min read
Sungwoo Park
MR3 Architect and Developer

Introduction

As an execution engine for big data processing, MR3 is a distributed system consisting of a single master (called DAGAppMaster) and multiple workers (called ContainerWorkers) running across the network. The master orchestrates the execution of workers and implements all the features required of a distributed system. Workers receive commands from the master and communicate with each other in order to transfer intermediate data. In this way, MR3 tries to maximize the utilization of cluster resources.

Hive vs SparkSQL: Hive-LLAP, Hive on MR3, SparkSQL 2.3.2

· 5 min read
Sungwoo Park
MR3 Architect and Developer

Introduction

In our previous article published in October 2018, we use the TPC-DS benchmark to compare the performance of Hive-LLAP and SparkSQL 2.3.1 included in HDP 3.0.1 along with Hive 3.1.0 on MR3 0.4. In this article, we update the result by testing SparkSQL 2.3.2 included in HDP 3.1.4. As in the previous experiment, we use the TPC-DS benchmark.

Hive Performance: Hive-LLAP in HDP 3.1.4 vs Hive 3/4 on MR3 0.10

· 9 min read
Sungwoo Park
MR3 Architect and Developer

Introduction

In our previous article published in October 2018, we use the TPC-DS benchmark to compare the performance of Hive-LLAP in HDP 3.0.1 (as well as HDP 2.6.4) and Hive 3 on MR3 0.4. We have shown that Hive 3 on MR3 yields consistently higher throughput than Hive-LLAP in concurrency tests, but since then, the performance of Hive-LLAP has improved considerably for concurrent queries. Thus we are interested in the question of how Hive on MR3 compares with Hive-LLAP in the latest lease of HDP.

Presto vs Hive on MR3 (Presto 317 vs Hive on MR3 0.10)

· 8 min read
Sungwoo Park
MR3 Architect and Developer

Introduction

In our previous article, we use the TPC-DS benchmark to compare the performance of three SQL-on-Hadoop systems: Impala 2.12.0+cdh5.15.2+0, Presto 0.217, and Hive 3.1.1 on MR3 0.6. It uses sequential tests to draw the following conclusion:

  • Impala runs faster than Hive on MR3 on short-running queries that take less than 10 seconds.
  • For long-running queries, Hive on MR3 runs slightly faster than Impala.
  • For most queries, Hive on MR3 runs faster than Presto, sometimes an order of magnitude faster.

Performance Evaluation of Impala, Presto, and Hive on MR3

· 9 min read
Sungwoo Park
MR3 Architect and Developer

Introduction

In our previous article, we use the TPC-DS benchmark to compare the performance of five SQL-on-Hadoop systems: Hive-LLAP, Presto, SparkSQL, Hive on Tez, and Hive on MR3. As it uses both sequential tests and concurrency tests across three separate clusters, we believe that the performance evaluation is thorough and comprehensive enough to closely reflect the current state in the SQL-on-Hadoop landscape.

Performance Evaluation of SQL-on-Hadoop Systems using the TPC-DS Benchmark

· 18 min read
Sungwoo Park
MR3 Architect and Developer

Introduction

We often ask questions on the performance of SQL-on-Hadoop systems:

  • How fast is Hive-LLAP in comparison with Presto, SparkSQL, or Hive on Tez?
  • As it is an MPP-style system, does Presto run the fastest if it successfully executes a query?
  • As it stores intermediate data in memory, does SparkSQL run much faster than Hive on Tez in general?
  • What is the best system for running concurrent queries?
  • ...

Hive on MR3 0.2 vs Hive-LLAP

· 12 min read
Sungwoo Park
MR3 Architect and Developer

Introduction

Hive running on top of MR3 0.2, or Hive-MR3 henceforth, supports LLAP (Low Latency Analytical Processing) I/O. In conjunction with the ability to execute multiple TaskAttempts concurrently inside a single ContainerWorker, the support for LLAP I/O makes Hive-MR3 functionally equivalent to Hive-LLAP. Hence Hive-MR3 can now serve as a substitute for Hive-LLAP in typical use cases.