Compaction without MapReduce

Hive on MR3 supports compaction without relying on an external execution engine MapReduce. Currently Hive-LLAP and Hive on Tez send a compaction job triggered by Metastore to MapReduce. As a result, they can perform compaction only on Hadoop where MapReduce is running, thus limiting their portability. (Note that query-based compaction is supported in Hive 4.) In contrast, Hive on MR3 converts all compaction jobs, both major and minor, to MR3 DAGs, thus enabling itself to run in any environment.

Configuring compaction

The general behavior of Hive on MR3 when performing compaction is controlled by a configuration key hive.mr3.compaction.using.mr3 in hive-site.xml.

  • If hive.mr3.compaction.using.mr3 is set to true, Metastore uses MR3 for compaction.
    • If high availability is enabled, Metastore retrieves the ApplicationID of the current MR3 DAGAppMaster, to which all MR3 DAGs for compaction are sent.
    • If high availability is disabled, Metastore creates its own DAGAppMaster. In this case, the user should set the configuration key mr3.container.idle.timeout.ms in mr3-site.xml to a small value (e.g., 60000 for 1 minute) so that ContainerWorkers can be reclaimed soon after a compaction job is finished. Otherwise those ContainerWorkers created by the new DAGAppMaster may stay idle for a long time if compaction jobs are triggered infrequently.
  • If hive.mr3.compaction.using.mr3 is set to false, Metastore uses MapReduce for compaction.
    • If mapreduce.framework.name is set to local, it performs compaction using MapReduce LocalJobRunner inside Metastore itself.
    • If mapreduce.framework.name is set to yarn, it performs compaction using MapReduce.
    • In both cases, hive.exec.orc.writer.llap.memory.manager.enabled should be set to false.

Here are a few remarks on performing compaction in Hive on MR3:

  • If Metastore uses MR3 for compaction without enabling high availability, compaction does not start if no resources for DAGAppMaster and ContainerWorker are available in the cluster. Hence it is best to enable high availability if Metastore uses MR3 for compaction.
  • Since Metastore may submit MR3 DAGs to DAGAppMaster, it reads mr3-site.xml as well. The user should set the configuration key mr3.master.mode appropriately.
  • If the configuration key hive.metastore.uris is set to an empty string, HiveServer2 starts an embedded Metastore. Even with an embedded Metastore, however, compaction works okay.
  • HiveServer2 and Metastore should use the same Kerberos principal (specified in environment variables HIVE_METASTORE_KERBEROS_PRINCIPAL and HIVE_SERVER2_KERBEROS_PRINCIPAL in env.sh). Otherwise URISyntaxException can be generated at the end of compaction (from compactor.cleaner()). If HiveServer2 and Metastore must use different Kerberos principals, do not use system property variables (such as ${hive.metastore.host} and ${hive.metastore.port}) for the configuration key hive.metastore.uris in hive-site.xml (because inside UserGroupInformation.doAs(), HiveConf instances may not expand such variables to system property values).

Compaction on Kubernetes

For Hive on MR3 on Kubernetes, setting environment variable MR3_APPLICATION_ID_TIMESTAMP implicitly enables high availability. Thus the user can enable compaction on Kubernetes after check the following:

  • The configuration key hive.mr3.compaction.using.mr3 should be set to true in hive-site.xml. If it is set to false, mapreduce.framework.name must be set to local so that Metastore can perform compaction using MapReduce LocalJobRunner.
  • The configuration key mr3.master.mode should be set to kubernetes in mr3-site.xml.
  • If environment variable MR3_APPLICATION_ID_TIMESTAMP is set, Metastore tries to reuse an existing DAGAppMaster Pod of the same timestamp for compaction. If it is not set, Metastore creates its own DAGAppMaster Pod for compaction.
  • Both configuration keys hive.server2.support.dynamic.service.discovery and hive.server2.active.passive.ha.enable (which are irrelevant to Hive on MR3 on Kubernetes) should be set to false.

Configuring Metastore for automatic compaction

To perform compaction automatically, configure Metastore in hive-site.xml.

  • set hive.compactor.initiator.on to true.
  • set hive.compactor.worker.threads to an integer greater than 0.

Compaction in Hive 4 on MR3

In Hive 4 on MR3, the configuration key hive.metastore.runworker.in decides whether compaction is initiated in Metastore (if set to metastore) or HiveServer2 (if set to hs2). We recommend setting hive.metastore.runworker.in to hs2, in which case all compaction jobs are automatically sent to the MR3 DAGAppMaster created by HiveServer2. Hence Metastore never creates its own MR3 DAGAppMaster.

Troubleshooting

In Hive 4 on MR3, compaction jobs may get stuck in the state of ready for cleaning. This is usually caused by the corruption in the table MIN_HISTORY_LEVEL of the Metastore database. The user can fix this problem in one of the following ways:

  • set the configuration key metastore.txn.use.minhistorywriteid to true in hive-site.xml
  • set the configuration key metastore.txn.use.minhistorylevel to false in hive-site.xml
  • clear the table MIN_HISTORY_LEVEL of the Metastore database (recommended).