This page explains the additional step for configuring compaction when operating Hive on MR3 on Kubernetes with multiple nodes.

Configuring Metastore and HiveServer2

Check if the following two configuration keys are set to true. hive.mr3.compaction.using.mr3 should be set to true in order to use MR3 for compaction.

$ vi conf/hive-site.xml

<property>
  <name>hive.compactor.initiator.on</name>
  <value>true</value>
</property>

<property>
  <name>hive.mr3.compaction.using.mr3</name>
  <value>true</value>
</property>

If environment variable MR3_APPLICATION_ID_TIMESTAMP is set, Metastore tries to reuse an existing DAGAppMaster Pod of the same timestamp for compaction. If it is not set, Metastore creates its own DAGAppMaster Pod for compaction. Hence we want both Metastore and HiveServer2 to use the same value for MR3_APPLICATION_ID_TIMESTAMP so that they share the same DAGAppMaster. In addition, environment variable CLIENT_TO_AM_TOKEN_KEY should be set to the same value so as to connect to DAGAppMaster, whichever of Metastore and HiveServer2 creates the DAGAppMaster Pod.

First run Metastore and get the values for the two environment variables.

$ ./run-metastore.sh
...
CLIENT_TO_AM_TOKEN_KEY=bf4e0823-7f1c-4ac8-ac58-c7f7537642af
MR3_APPLICATION_ID_TIMESTAMP=9957
...

Before running HiveServer2, set the two environment variables.

$ export CLIENT_TO_AM_TOKEN_KEY=bf4e0823-7f1c-4ac8-ac58-c7f7537642af
$ export MR3_APPLICATION_ID_TIMESTAMP=9957

Then run HiveServer2. We see that HiveServer2 uses the same values for the environment variables.

$ ./run-hive.sh
...
CLIENT_TO_AM_TOKEN_KEY=bf4e0823-7f1c-4ac8-ac58-c7f7537642af
MR3_APPLICATION_ID_TIMESTAMP=9957
...

Example

As an example, create a table test_compaction and perform INSERT and UPDATE operations.

0: jdbc:hive2://orange1:9852/> DROP TABLE IF EXISTS test_compaction;
0: jdbc:hive2://orange1:9852/> CREATE TABLE test_compaction (key INT, value1 STRING, value2 STRING) CLUSTERED BY (key) INTO 3 BUCKETS STORED AS ORC TBLPROPERTIES ("transactional"="true");
0: jdbc:hive2://orange1:9852/> INSERT INTO TABLE test_compaction VALUES (0, 'sprite', '0000'), (1, 'water', '0001'), (2, 'coke', '0000'), (3, 'green tea', '0002');
0: jdbc:hive2://orange1:9852/> UPDATE test_compaction SET value2 = '0003' WHERE key < 2;
0: jdbc:hive2://orange1:9852/> UPDATE test_compaction SET value2 = '0004' WHERE mod(key, 2) == 0;
0: jdbc:hive2://orange1:9852/> select * from test_compaction;
...
+----------------------+-------------------------+-------------------------+
| test_compaction.key  | test_compaction.value1  | test_compaction.value2  |
+----------------------+-------------------------+-------------------------+
| 3                    | green tea               | 0002                    |
| 1                    | water                   | 0003                    |
| 0                    | sprite                  | 0004                    |
| 2                    | coke                    | 0004                    |
+----------------------+-------------------------+-------------------------+
4 rows selected (1.724 seconds)

In the directory storing the table, we see several sub-directories.

$ find .
.
./delta_0000002_0000002_0000
./delta_0000002_0000002_0000/bucket_00001
./delta_0000002_0000002_0000/_orc_acid_version
./delete_delta_0000002_0000002_0000
./delete_delta_0000002_0000002_0000/bucket_00001
./delete_delta_0000002_0000002_0000/_orc_acid_version
./delta_0000003_0000003_0000
./delta_0000003_0000003_0000/bucket_00001
./delta_0000003_0000003_0000/bucket_00002
./delta_0000003_0000003_0000/_orc_acid_version
./delete_delta_0000003_0000003_0000
./delete_delta_0000003_0000003_0000/bucket_00001
./delete_delta_0000003_0000003_0000/bucket_00002
./delete_delta_0000003_0000003_0000/_orc_acid_version
./delta_0000001_0000001_0000
./delta_0000001_0000001_0000/bucket_00001
./delta_0000001_0000001_0000/bucket_00002
./delta_0000001_0000001_0000/_orc_acid_version

Next perform a minor compaction.

0: jdbc:hive2://orange1:9852/> ALTER TABLE test_compaction COMPACT 'minor';
0: jdbc:hive2://orange1:9852/> show compactions;
...
| 8             | default   | test_compaction  |  ---       | MINOR  | succeeded  |  ---      | 1659457147000  | 3000          | MR3-compaction-8  |

We can check if Metastore has sent a MapReduce job to MR3 DAGAppMaster. Get the names of Metastore Pod and DAGAppMaster Pod.

$ kubectl get pods -n hivemr3
NAME                                   READY   STATUS    RESTARTS   AGE
hivemr3-hiveserver2-74bf7fdbdd-cqp5b   1/1     Running   0          12m
hivemr3-metastore-0                    1/1     Running   0          16m
mr3master-9957-0-75cc767845-nx78q      1/1     Running   0          11m
...

We see that Metastore sends a job called MR3-compaction-8 to DAGAppMaster.

$ kubectl logs -n hivemr3 hivemr3-metastore-0 | grep -a "Executing.*MR3-compaction-8"
Status: Running (Executing on MR3 DAGAppMaster): MR3-compaction-8
2022-08-02T07:19:07,504  INFO [hivemr3-metastore-0.metastore.hivemr3.svc.cluster.local-30] compactor.MR3CompactionHelper: Status: Running (Executing on MR3 DAGAppMaster): MR3-compaction-8

DAGAppMaster also receives the job.

$ kubectl logs -n hivemr3 mr3master-9957-0-75cc767845-nx78q | grep "submitDag.*MR3-compaction-8"
2022-08-02T07:19:07,447  INFO [IPC Server handler 23 on 8080] DAGAppMaster: submitDag() called by hive: MR3-compaction-8

After compaction, we see fewer sub-directories.

$ find .
.
./delete_delta_0000001_0000003
./delete_delta_0000001_0000003/bucket_00001
./delete_delta_0000001_0000003/bucket_00002
./delete_delta_0000001_0000003/_orc_acid_version
./delta_0000001_0000003
./delta_0000001_0000003/bucket_00001
./delta_0000001_0000003/bucket_00002
./delta_0000001_0000003/_orc_acid_version

Perform additional DELETE and INSERT operations.

0: jdbc:hive2://orange1:9852/> DELETE FROM test_compaction WHERE value1 = 'water';

0: jdbc:hive2://orange1:9852/> INSERT INTO TABLE test_compaction values (4, 'milk', '0005');

We see new sub-directories.

$ find .
.
./delta_0000005_0000005_0000
./delta_0000005_0000005_0000/bucket_00002
./delta_0000005_0000005_0000/_orc_acid_version
./delete_delta_0000001_0000003
./delete_delta_0000001_0000003/bucket_00001
./delete_delta_0000001_0000003/bucket_00002
./delete_delta_0000001_0000003/_orc_acid_version
./delta_0000001_0000003
./delta_0000001_0000003/bucket_00001
./delta_0000001_0000003/bucket_00002
./delta_0000001_0000003/_orc_acid_version
./delete_delta_0000004_0000004_0000
./delete_delta_0000004_0000004_0000/bucket_00001
./delete_delta_0000004_0000004_0000/_orc_acid_version

Finally perform a major compaction.

0: jdbc:hive2://orange1:9852/> ALTER TABLE test_compaction COMPACT 'major';
0: jdbc:hive2://orange1:9852/> show compactions;
...
| 8             | default   | test_compaction  |  ---       | MINOR  | succeeded  |  ---      | 1659457147000  | 3000          | MR3-compaction-8  |
| 9             | default   | test_compaction  |  ---       | MAJOR  | succeeded  |  ---      | 1659457314000  | 2000          | MR3-compaction-9  |

We see that Metastore sends a job called MR3-compaction-9 to DAGAppMaster.

$ kubectl logs -n hivemr3 hivemr3-metastore-0 | grep -a "Executing.*MR3-compaction-9"
Status: Running (Executing on MR3 DAGAppMaster): MR3-compaction-9
2022-08-02T07:21:54,970  INFO [hivemr3-metastore-0.metastore.hivemr3.svc.cluster.local-30] compactor.MR3CompactionHelper: Status: Running (Executing on MR3 DAGAppMaster): MR3-compaction-9

$ kubectl logs -n hivemr3 mr3master-9957-0-75cc767845-nx78q | grep "submitDag.*MR3-compaction-9"
2022-08-02T07:21:54,914  INFO [IPC Server handler 29 on 8080] DAGAppMaster: submitDag() called by hive: MR3-compaction-9

Now the directory storing the table has a single sub-directory.

$ find .
.
./base_0000005
./base_0000005/_metadata_acid
./base_0000005/bucket_00001
./base_0000005/bucket_00002
./base_0000005/_orc_acid_version