Memory Settings

Map joins and output buffers

The following configuration keys are arguably the most important for performance tuning, especially when running complex queries with heavy joins.

hive.auto.convert.join.noconditionaltask.size in hive-site.xml specifies the threshold for triggering map joins in Hive. Unlike Apache Hive which internally applies an additional formula to adjust the value specified in hive-site.xml, Hive on MR3 uses the value with no adjustment. Hence the user is advised to choose a value much larger than recommended for Apache Hive.
tez.runtime.io.sort.mb and tez.runtime.unordered.output.buffer.size-mb in tez-site.xml specifies the size of output buffers in Tez.

The following example shows sample values for these configuration keys.

vi conf/hive-site.xml

<property>
  <name>hive.auto.convert.join.noconditionaltask.size</name>
  <value>4000000000</value>
</property>

vi conf/tez-site.xml

<property>
  <name>tez.runtime.io.sort.mb</name>
  <value>1040</value>
</property>

<property>
  <name>tez.runtime.unordered.output.buffer.size-mb</name>
  <value>307</value>
</property>

The default values can be overridden for each individual query inside Beeline connections, as shown below.

jdbc:hive2://192.168.10.1:9852/> set hive.auto.convert.join.noconditionaltask.size=2000000000;
jdbc:hive2://192.168.10.1:9852/> set tez.runtime.io.sort.mb=2000;
jdbc:hive2://192.168.10.1:9852/> set tez.runtime.unordered.output.buffer.size-mb=600;
jdbc:hive2://192.168.10.1:9852/> !run /home/hive/sample.sql

Setting tez.runtime.io.sort.mb to a large value may result in OutOfMemoryError when Java VM cannot allocate a contiguous memory segment for the output buffer.

Caused by: java.lang.OutOfMemoryError: Java heap space
...
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.allocateSpace(PipelinedSorter.java:269)

In such a case, the user can try a smaller value for tez.runtime.io.sort.mb.

Using free memory to store shuffle input

Hive on MR3 can utilize memory to store shuffle input data that has been transmitted from upstream mappers. If a ContainerWorker runs multiple Tasks concurrently, a Task may find free memory in the Java heap even after exhausting the entire memory allocated to it. The configuration key tez.runtime.use.free.memory.fetched.input in tez-site.xml (which is set to true in the MR3 release) controls whether or not to use such free memory to store shuffle input data. If there is not enough free memory available, shuffle input data is written to local disks.

Setting tez.runtime.use.free.memory.fetched.input to true can significantly reduce the execution time of a query if:

Multiple Tasks can run concurrently in a ContainerWorker (e.g., 18 Tasks in a single ContainerWorker).
The query produces more shuffle data than can be accommodated in memory.
Local disks use slow storage (such as HDDs).

The following log of HiveServer2 shows that during the execution of a query, 689,693,191 bytes of shuffle data are stored in memory, with no data being written to local disks. (63,312,923 bytes of data are read directly from local disks because mappers and reducers are collocated.)

2023-12-15T06:59:16,028  INFO [HiveServer2-Background-Pool: Thread-84] mr3.MR3Task:    SHUFFLE_BYTES_TO_MEM: 689693191
2023-12-15T06:59:16,028  INFO [HiveServer2-Background-Pool: Thread-84] mr3.MR3Task:    SHUFFLE_BYTES_TO_DISK: 0
2023-12-15T06:59:16,028  INFO [HiveServer2-Background-Pool: Thread-84] mr3.MR3Task:    SHUFFLE_BYTES_DISK_DIRECT: 63312923

Depending on the amount of shuffle data and the size of memory allocated to individual Tasks, Hive on MR3 may write shuffle data to local disks. In the following example, 10,475,065,599 bytes of shuffle data are written to local disks.

2023-12-15T07:09:57,472  INFO [HiveServer2-Background-Pool: Thread-99] mr3.MR3Task:    SHUFFLE_BYTES_TO_MEM: 142177894794
2023-12-15T07:09:57,472  INFO [HiveServer2-Background-Pool: Thread-99] mr3.MR3Task:    SHUFFLE_BYTES_TO_DISK: 10475065599
2023-12-15T07:09:57,472  INFO [HiveServer2-Background-Pool: Thread-99] mr3.MR3Task:    SHUFFLE_BYTES_DISK_DIRECT: 13894557846

Using free memory to store shuffle output

Hive on MR3 can also utilize free memory to store shuffle output data that is to be transmitted to downstream reducers. The configuration key tez.runtime.use.free.memory.writer.output in tez-site.xml (which is set to true in the MR3 release) controls whether or not to use free memory to store shuffle output data. If there is not enough free memory available, shuffle output data is spilled to local disks.

In comparison with storing shuffle input data in memory, storing shuffle output data in memory can be more effective in reducing execution time because the same output data is typically read multiple times by downstream reducers.

When using free memory to store shuffle output data, it is strongly recommended to set the configuration key hive.mr3.delete.vertex.local.directory to true in hive-site.xml. By default, ContainerWorkers delete intermediate data only after the completion of each query. This default behavior is acceptable if shuffle output data is written to local disks. If shuffle output data is stored in memory, however, that memory becomes wasted once all downstream reducers are completed. By setting hive.mr3.delete.vertex.local.directory to true, MR3 deletes intermediate data produced by a Vertex as soon as all its downstream Vertexes have completed. For example, Map 1 and Map2 in the following diagram can delete their intermediate data after Reduce 1 gets completed.

hive.k8s.delete.vertex.local

Internally HiveServer2 sets the MR3 configuration key mr3.am.notify.destination.vertex.complete to true and ContainerWorkers are notified of the completion of all downstream Vertexes for each individual Vertex. The user can either set hive.mr3.delete.vertex.local.directory to true in hive-site.xml before starting HiveServer2 or override its value in Beeline before submitting a query.

Note that if the configuration key hive.mr3.delete.vertex.local.directory is set to true, fetch-failures may give rise to a cascade of Vertex reruns all the way up to leaf Vertexes. In the above diagram, suppose that after Reduce 1 finishes all its Tasks, a TaskAttempt of Reduce 2 reports a fetch-failure. Then a Vertex rerun occurs and Reduce 1 creates a new TaskAttempt. The new TaskAttempt, however, finds that no input data is available because both Map 1 and Map 2 have already deleted their intermediate data. As a consequence, Map 1 and Map 2 re-executes all their Tasks, which in turn cause Vertex reruns in their ancestors.

tip

When executing long-running batch queries that may occasionally trigger fault tolerance, the user should not use free memory to store shuffle output (with hive.mr3.delete.vertex.local.directory to true).

Soft references in Tez

tip

Enabling soft references is recommended only for advanced users, and only after completing basic performance tuning in a production environment.

In a cluster with ample memory relative to the number of cores (e.g., 12GB of memory per core), using soft references for ByteBuffers allocated in PipelinedSorter in Tez can make a noticeable difference. To be specific, setting the configuration key tez.runtime.pipelined.sorter.use.soft.reference to true in tez-site.xml creates soft references for ByteBuffers allocated in PipelinedSorter and allows these references to be reused across all TaskAttempts running in the same ContainerWorker, thus relieving pressure on the garbage collector. When the size of memory allocated to each ContainerWorker is small, however, using soft references is less likely to improve the performance.

vi conf/tez-site.xml

<property>
  <name>tez.runtime.pipelined.sorter.use.soft.reference</name>
  <value>true</value>
</property>

In the case of using soft references, the user should append a Java VM option SoftRefLRUPolicyMSPerMB (in milliseconds) for the configuration key mr3.container.launch.cmd-opts in mr3-site.xml. Otherwise ContainerWorkers use the default value of 1000 for SoftRefLRUPolicyMSPerMB. In the following example, we set SoftRefLRUPolicyMSPerMB to 25 milliseconds:

vi conf/mr3-site.xml

<property>
  <name>mr3.container.launch.cmd-opts</name>
  <value>... -XX:SoftRefLRUPolicyMSPerMB=25</value>
</property>

Map joins and output buffers​

Using free memory to store shuffle input​

Using free memory to store shuffle output​

Soft references in Tez​

Map joins and output buffers

Using free memory to store shuffle input

Using free memory to store shuffle output

Soft references in Tez