Hive on MR3 on Kubernetes suffers from a performance penalty due to the use of Docker containers. This performance penalty is the price we pay for all the benefits of using Kubernetes such as easier deployment and higher security. Below we describe how to minimize the performance penalty when running Hive on MR3 on Kubernetes.

This page is work in progress and may be expanded in the future.

Page cache for Docker containers

The most significant change when migrating from Hadoop to Kubernetes is that page cache is now local to each Docker container. The user should consider this property of Docker containers before choosing 1) the memory limit of ContainerWorker Pods and 2) the size of Java heap of ContainerWorker processes.

In the case of Hadoop, all ContainerWorkers running on the same node share page cache which is managed by the operating system. Hence the size of Java heap can be close to the size of the memory allocated to a ContainerWorker process, as long as page cache can grow sufficiently large. For example, it may be okay to set the configuration key hive.mr3.container.max.java.heap.fraction even to 0.9f in hive-site.xml.

This is no longer the case when ContainerWorkers run as Pods on Kubernetes. The operating system maintains page cache for each ContainerWorker Pod, but by consuming the memory allocated to the ContainerWorker Pod itself. Since separate ContainerWorker Pods do not read common files very often, we can think of each ContainerWorker Pod as managing its own page cache. Hence, depending on the characteristics of the workload, the configuration key hive.mr3.container.max.java.heap.fraction should be adjusted to a sufficiently small value, like 0.7f or even 0.6f, to guarantee sufficiently large page cache for each ContainerWorker Pod. Otherwise the performance of Hive on MR3 can sometimes be noticeably slow because of thrashing.

hive.k8s.memory.page.cache

Location hints when using HDFS

In a Kubernetes cluster collocated with HDFS, Hive on MR3 can take advantage of location hints provided by HDFS. Specifically, when scheduling a mapper that is about to read an HDFS block, MR3 can check the location hint provided by HDFS and try to assign it to a ContainerWorker Pod running on the same physical node where the HDFS block is stored.

In order to exploit this feature, the user should set two configuration keys in mr3-site.xml:

  1. set mr3.convert.container.address.host.name to true.
  2. set mr3.k8s.host.aliases to mappings from host names to IP addresses.

As an example, suppose that an HDFS DataNode and a ContainerWorker Pod (with logical IP address 10.1.1.1 assigned by Kubernetes) are running on a node foo with physical IP address 192.168.100.1.

hive.k8s.hdfs.location.hint

For every HDFS block stored on the node foo, its location hint includes host name foo, but does not include IP address 192.168.100.1. On the other hand, the ContainerWorker Pod can retrieve its physical IP address 192.168.100.1 from Kubernetes, but is not aware of the corresponding host name foo. We can reconcile the mismatch by setting mr3.convert.container.address.host.name to true and mr3.k8s.host.aliases to foo=192.168.100.1.

Running shuffle handlers in a separate process

Depending on resources allocated to each mapper, reducer, and ContainerWorker, it may help to run shuffle handlers in a separate process inside the ContainerWorker Pod. For more details, see MR3 Shuffle Handler and Using the MR3 Shuffle Handler.

Deleting Vertex-local directories

By default, ContainerWorkers delete intermediate data after the completion of the query. The default behavior is acceptable when running Hive on MR3 on Hadoop, but it may affect the stability of Hive on MR3 on Kubernetes. For example, a ContainerWorker Pod may not have enough space for local directories where intermediate data is written, especially when running a query that produces a huge amount of intermediate data. Since intermediate data produced by early stage Vertexes may linger on page cache without being accessed again, the default behavior may negatively affect the utilization of page cache as well.

In order to deal with the potential problem of the default behavior, MR3 allows the user to delete intermediate data produced by a Vertex as soon as all its consumer Vertexes complete. For example, Map 1 and Map2 in the following diagram can delete their intermediate data right after Reduce 1 completes.

hive.k8s.delete.vertex.local

For running Hive on MR3, the user can set the configuration key hive.mr3.delete.vertex.local.directory to true, either in Beeline before submitting a query or in hive-site.xml before starting HiveServer2. Then HiveServer2 internally sets the MR3 configuration key mr3.am.notify.destination.vertex.complete to true and ContainerWorkers are notified of the completion of all consumer Vertexes for each individual Vertex.

Note that if the configuration key hive.mr3.delete.vertex.local.directory is set to true, fetch-failures may give rise to a cascade of Vertex reruns all the way up to leaf Vertexs. In the above diagram, suppose that after Reduce 1 finishes all its Tasks, a TaskAttempt of Reduce 2 reports a fetch-failure. Then a Vertex rerun occurs and Reduce 1 creates a new TaskAttempt. The new TaskAttempt, however, finds that no input data is available because both Map 1 and Map 2 have already deleted their intermediate data. As a consequence, Map 1 and Map 2 re-executes all their Tasks, which in turn cause Vertex reruns in their ancestors.

Hence the user should set the configuration key hive.mr3.delete.vertex.local.directory to true with caution only if 1) a ContainerWorker Pod does not have enough space for local directories (e.g., on Amazon Fargate), or 2) fetch-failures rarely occur.