Compaction without MapReduce
Hive on MR3 supports compaction without relying on an external execution engine MapReduce. Currently Hive-LLAP and Hive on Tez send a compaction job triggered by Metastore to MapReduce. As a result, they can perform compaction only on Hadoop where MapReduce is running, thus limiting their portability. (Note that query-based compaction is supported in Hive 4.) In contrast, Hive on MR3 converts all compaction jobs, both major and minor, to MR3 DAGs, thus enabling itself to run in any environment.
Configuring compaction
The general behavior of Hive on MR3 when performing compaction is controlled by a configuration key hive.mr3.compaction.using.mr3
in hive-site.xml
.
- If
hive.mr3.compaction.using.mr3
is set to true, Metastore uses MR3 for compaction.- If high availability is enabled, Metastore retrieves the ApplicationID of the current MR3 DAGAppMaster, to which all MR3 DAGs for compaction are sent.
- If high availability is disabled, Metastore creates its own DAGAppMaster.
In this case, the user should set the configuration key
mr3.container.idle.timeout.ms
inmr3-site.xml
to a small value (e.g., 60000 for 1 minute) so that ContainerWorkers can be reclaimed soon after a compaction job is finished. Otherwise those ContainerWorkers created by the new DAGAppMaster may stay idle for a long time if compaction jobs are triggered infrequently.
- If
hive.mr3.compaction.using.mr3
is set to false, Metastore uses MapReduce for compaction.- If
mapreduce.framework.name
is set tolocal
, it performs compaction using MapReduce LocalJobRunner inside Metastore itself. - If
mapreduce.framework.name
is set toyarn
, it performs compaction using MapReduce. - In both cases,
hive.exec.orc.writer.llap.memory.manager.enabled
should be set to false.
- If
Here are a few remarks on performing compaction in Hive on MR3:
- If Metastore uses MR3 for compaction without enabling high availability, compaction does not start if no resources for DAGAppMaster and ContainerWorker are available in the cluster. Hence it is best to enable high availability if Metastore uses MR3 for compaction.
- Since Metastore may submit MR3 DAGs to DAGAppMaster, it reads
mr3-site.xml
as well. The user should set the configuration keymr3.master.mode
appropriately. - If the configuration key
hive.metastore.uris
is set to an empty string, HiveServer2 starts an embedded Metastore. Even with an embedded Metastore, however, compaction works okay. - HiveServer2 and Metastore should use the same Kerberos principal
(specified in environment variables
HIVE_METASTORE_KERBEROS_PRINCIPAL
andHIVE_SERVER2_KERBEROS_PRINCIPAL
inenv.sh
). Otherwise URISyntaxException can be generated at the end of compaction (fromcompactor.cleaner()
). If HiveServer2 and Metastore must use different Kerberos principals, do not use system property variables (such as${hive.metastore.host}
and${hive.metastore.port}
) for the configuration keyhive.metastore.uris
inhive-site.xml
(because insideUserGroupInformation.doAs()
,HiveConf
instances may not expand such variables to system property values).
Compaction on Kubernetes
For Hive on MR3 on Kubernetes, setting environment variable MR3_APPLICATION_ID_TIMESTAMP
implicitly enables high availability.
Thus the user can enable compaction on Kubernetes after check the following:
- The configuration key
hive.mr3.compaction.using.mr3
should be set to true inhive-site.xml
. If it is set to false,mapreduce.framework.name
must be set tolocal
so that Metastore can perform compaction using MapReduce LocalJobRunner. - The configuration key
mr3.master.mode
should be set tokubernetes
inmr3-site.xml
. - If environment variable
MR3_APPLICATION_ID_TIMESTAMP
is set, Metastore tries to reuse an existing DAGAppMaster Pod of the same timestamp for compaction. If it is not set, Metastore creates its own DAGAppMaster Pod for compaction. - Both configuration keys
hive.server2.support.dynamic.service.discovery
andhive.server2.active.passive.ha.enable
(which are irrelevant to Hive on MR3 on Kubernetes) should be set to false.
Configuring Metastore for automatic compaction
To perform compaction automatically, configure Metastore in hive-site.xml
.
- set
hive.compactor.initiator.on
to true. - set
hive.compactor.worker.threads
to an integer greater than 0.
Compaction in Hive 4 on MR3
In Hive 4 on MR3,
the configuration key hive.metastore.runworker.in
decides
whether compaction is initiated in Metastore (if set to metastore
)
or HiveServer2 (if set to hs2
).
We recommend setting hive.metastore.runworker.in
to hs2
,
in which case all compaction jobs are automatically sent to the MR3 DAGAppMaster
created by HiveServer2.
Hence Metastore never creates its own MR3 DAGAppMaster.
Troubleshooting
In Hive 4 on MR3, compaction jobs may get stuck in the state of ready for cleaning
.
This is usually caused by the corruption
in the table MIN_HISTORY_LEVEL
of the Metastore database.
The user can fix this problem in one of the following ways:
- set the configuration key
metastore.txn.use.minhistorywriteid
to true inhive-site.xml
- set the configuration key
metastore.txn.use.minhistorylevel
to false inhive-site.xml
- clear the table
MIN_HISTORY_LEVEL
of the Metastore database (recommended).