Memory Settings
Map joins and output buffers
The following configuration keys are arguably the most important for performance tuning, especially when running complex queries with heavy joins.
hive.auto.convert.join.noconditionaltask.size
inhive-site.xml
specifies the threshold for triggering map joins in Hive. Unlike Apache Hive which internally applies an additional formula to adjust the value specified inhive-site.xml
, Hive on MR3 uses the value with no adjustment. Hence the user is advised to choose a value much larger than recommended for Apache Hive.tez.runtime.io.sort.mb
andtez.runtime.unordered.output.buffer.size-mb
intez-site.xml
specifies the size of output buffers in Tez.
The following example shows sample values for these configuration keys.
vi conf/hive-site.xml
<property>
<name>hive.auto.convert.join.noconditionaltask.size</name>
<value>4000000000</value>
</property>
vi conf/tez-site.xml
<property>
<name>tez.runtime.io.sort.mb</name>
<value>1040</value>
</property>
<property>
<name>tez.runtime.unordered.output.buffer.size-mb</name>
<value>307</value>
</property>
The default values can be overridden for each individual query inside Beeline connections, as shown below.
0: jdbc:hive2://192.168.10.1:9852/> set hive.auto.convert.join.noconditionaltask.size=2000000000;
0: jdbc:hive2://192.168.10.1:9852/> set tez.runtime.io.sort.mb=2000;
0: jdbc:hive2://192.168.10.1:9852/> set tez.runtime.unordered.output.buffer.size-mb=600;
0: jdbc:hive2://192.168.10.1:9852/> !run /home/hive/sample.sql
Setting tez.runtime.io.sort.mb
to a large value may result in OutOfMemoryError
when Java VM cannot allocate a contiguous memory segment for the output buffer.
Caused by: java.lang.OutOfMemoryError: Java heap space
...
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.allocateSpace(PipelinedSorter.java:269)
In such a case, the user can try a smaller value for tez.runtime.io.sort.mb
.
Using free memory to store shuffle input
Hive on MR3 can utilize memory to store shuffle input data
that has been transmitted from upstream mappers.
If a ContainerWorker runs multiple Tasks concurrently,
a Task may find free memory in the Java heap
even after exhausting the entire memory allocated to it.
The configuration key tez.runtime.use.free.memory.fetched.input
in tez-site.xml
(which is set to true in the MR3 release)
controls whether or not to use such free memory to store shuffle input data.
If there is not enough free memory available, shuffle input data is written to local disks.
Setting tez.runtime.use.free.memory.fetched.input
to true can significantly reduce the execution time of a query if:
- Multiple Tasks can run concurrently in a ContainerWorker (e.g., 18 Tasks in a single ContainerWorker).
- The query produces more shuffle data than can be accommodated in memory.
- Local disks use slow storage (such as HDDs).
The following log of HiveServer2 shows that during the execution of a query, 689,693,191 bytes of shuffle data are stored in memory, with no data being written to local disks. (63,312,923 bytes of data are read directly from local disks because mappers and reducers are collocated.)
2023-12-15T06:59:16,028 INFO [HiveServer2-Background-Pool: Thread-84] mr3.MR3Task: SHUFFLE_BYTES_TO_MEM: 689693191
2023-12-15T06:59:16,028 INFO [HiveServer2-Background-Pool: Thread-84] mr3.MR3Task: SHUFFLE_BYTES_TO_DISK: 0
2023-12-15T06:59:16,028 INFO [HiveServer2-Background-Pool: Thread-84] mr3.MR3Task: SHUFFLE_BYTES_DISK_DIRECT: 63312923
Depending on the amount of shuffle data and the size of memory allocated to individual Tasks, Hive on MR3 may write shuffle data to local disks. In the following example, 10,475,065,599 bytes of shuffle data are written to local disks.
2023-12-15T07:09:57,472 INFO [HiveServer2-Background-Pool: Thread-99] mr3.MR3Task: SHUFFLE_BYTES_TO_MEM: 142177894794
2023-12-15T07:09:57,472 INFO [HiveServer2-Background-Pool: Thread-99] mr3.MR3Task: SHUFFLE_BYTES_TO_DISK: 10475065599
2023-12-15T07:09:57,472 INFO [HiveServer2-Background-Pool: Thread-99] mr3.MR3Task: SHUFFLE_BYTES_DISK_DIRECT: 13894557846
Using free memory to store shuffle output
Hive on MR3 can also utilize free memory to store shuffle output data
that is to be transmitted to downstream reducers.
The configuration key tez.runtime.use.free.memory.writer.output
in tez-site.xml
(which is set to true in the MR3 release)
controls whether or not to use free memory to store shuffle output data.
If there is not enough free memory available, shuffle output data is spilled to local disks.
In comparison with storing shuffle input data in memory, storing shuffle output data in memory can be more effective in reducing execution time because the same output data is typically read multiple times by downstream reducers.
When using free memory to store shuffle output data,
it is strongly recommended to set
the configuration key hive.mr3.delete.vertex.local.directory
to true in hive-site.xml
.
By default, ContainerWorkers delete intermediate data only after the completion of each query.
This default behavior is acceptable if shuffle output data is written to local disks.
If shuffle output data is stored in memory, however,
that memory becomes wasted once all downstream reducers are completed.
By setting hive.mr3.delete.vertex.local.directory
to true,
MR3 deletes intermediate data produced by a Vertex
as soon as all its downstream Vertexes have completed.
For example,
Map 1 and Map2 in the following diagram can delete their intermediate data
after Reduce 1 gets completed.
Internally
HiveServer2 sets the MR3 configuration key mr3.am.notify.destination.vertex.complete
to true
and ContainerWorkers are notified of the completion of all downstream Vertexes for each individual Vertex.
The user can either set hive.mr3.delete.vertex.local.directory
to true
in hive-site.xml
before starting HiveServer2
or override its value in Beeline before submitting a query.
Note that if the configuration key hive.mr3.delete.vertex.local.directory
is set to true,
fetch-failures may give rise to a cascade of Vertex reruns all the way up to leaf Vertexes.
In the above diagram,
suppose that after Reduce 1 finishes all its Tasks,
a TaskAttempt of Reduce 2 reports a fetch-failure.
Then a Vertex rerun occurs and Reduce 1 creates a new TaskAttempt.
The new TaskAttempt, however, finds that no input data is available
because both Map 1 and Map 2 have already deleted their intermediate data.
As a consequence, Map 1 and Map 2 re-executes all their Tasks,
which in turn cause Vertex reruns in their ancestors.
When executing long-running batch queries that may occasionally trigger fault tolerance,
the user should not use free memory to store shuffle output
(with hive.mr3.delete.vertex.local.directory
to true).
Soft references in Tez
Enabling soft references is recommended only for advanced users, and only after completing basic performance tuning in a production environment.
In a cluster with ample memory relative to the number of cores
(e.g., 12GB of memory per core),
using soft references for ByteBuffers allocated in PipelinedSorter in Tez can make a noticeable difference.
To be specific,
setting the configuration key tez.runtime.pipelined.sorter.use.soft.reference
to true in tez-site.xml
creates soft references for ByteBuffers allocated in PipelinedSorter
and allows these references to be reused across all TaskAttempts running in the same ContainerWorker,
thus relieving pressure on the garbage collector.
When the size of memory allocated to each ContainerWorker is small, however,
using soft references is less likely to improve the performance.
vi conf/tez-site.xml
<property>
<name>tez.runtime.pipelined.sorter.use.soft.reference</name>
<value>true</value>
</property>
In the case of using soft references,
the user should append a Java VM option SoftRefLRUPolicyMSPerMB
(in milliseconds) for the configuration key mr3.container.launch.cmd-opts
in mr3-site.xml
.
Otherwise ContainerWorkers use the default value of 1000 for SoftRefLRUPolicyMSPerMB
.
In the following example, we set SoftRefLRUPolicyMSPerMB
to 25 milliseconds:
vi conf/mr3-site.xml
<property>
<name>mr3.container.launch.cmd-opts</name>
<value>... -XX:SoftRefLRUPolicyMSPerMB=25</value>
</property>