Hive on MR3 supports compaction without relying on an external execution engine MapReduce. Currently Hive-LLAP and Hive on Tez send a compaction job triggered by Metastore to MapReduce. As a result, they can perform compaction only on Hadoop where MapReduce is running, thus limiting their portability. (Note that a query-based compactor will be available in Hive 4: https://issues.apache.org/jira/browse/HIVE-20699 and https://issues.apache.org/jira/browse/HIVE-20934.) In contrast, Hive on MR3 converts all compaction jobs, both major and minor, to MR3 DAGs, thus enabling itself to run in any environment.

The behavior of Hive on MR3 when performing compaction is controlled by a configuration key hive.mr3.compaction.using.mr3 in hive-site.xml.

  • If hive.mr3.compaction.using.mr3 is set to true, Metastore uses MR3 for compaction.
    • If high availability is enabled as explained in Enabling High Availability, Metastore retrieves the Yarn ApplicationID of the current MR3 DAGAppMaster, to which all MR3 DAGs for compaction are sent.
    • If high availability is disabled, Metastore creates its own DAGAppMaster. In this case, the user should set the configuration key mr3.container.idle.timeout.ms in mr3-site.xml to a small value (e.g., 60000 for 1 minute) so that ContainerWorkers can be reclaimed soon after a compaction job is finished. Otherwise those ContainerWorkers created by the new DAGAppMaster may stay idle for a long time if compaction jobs are triggered infrequently.
  • If hive.mr3.compaction.using.mr3 is set to false, Metastore uses MapReduce for compaction.
    • If mapreduce.framework.name is set to local, it performs compaction using MapReduce LocalJobRunner inside Metastore itself.
    • If mapreduce.framework.name is set to yarn, it performs compaction using MapReduce.
    • In both cases, hive.exec.orc.writer.llap.memory.manager.enabled should be set to false.

Here are a few remarks on performing compaction in Hive on MR3:

  • If Metastore uses MR3 for compaction without enabling high availability, compaction does not start if no resources for DAGAppMaster and ContainerWorker are available in the cluster. Hence it is best to enable high availability if Metastore uses MR3 for compaction.
  • Since Metastore may submit MR3 DAGs to DAGAppMaster, it reads mr3-site.xml as well. The user should set the configuration key mr3.master.mode appropriately, e.g., to yarn when high availability is enabled.
  • If the configuration key hive.metastore.uris is set to an empty string, HiveServer2 starts an embedded Metastore. Even with an embedded Metastore, however, compaction works okay.
  • HiveServer2 and Metastore should use the same Kerberos principal (specified in environment variables HIVE_METASTORE_KERBEROS_PRINCIPAL and HIVE_SERVER2_KERBEROS_PRINCIPAL in env.sh). Otherwise URISyntaxException can be generated at the end of compaction (from compactor.cleaner()). If HiveServer2 and Metastore must use different Kerberos principals, do not use system property variables (such as ${hive.metastore.host} and ${hive.metastore.port}) for the configuration key hive.metastore.uris in hive-site.xml (because inside UserGroupInformation.doAs(), HiveConf instances may not expand such variables to system property values).

Compaction using MR3 is supported only in Hive 3 and Hive 4 on MR3, and not in Hive 2 on MR3.