This page explains the additional step for configuring compaction when operating Hive on MR3 on Kubernetes with multiple nodes.
Configuring Metastore and HiveServer2
Check if the following two configuration keys are set to true.
hive.mr3.compaction.using.mr3
should be set to true
in order to use MR3 for compaction.
$ vi conf/hive-site.xml
<property>
<name>hive.compactor.initiator.on</name>
<value>true</value>
</property>
<property>
<name>hive.mr3.compaction.using.mr3</name>
<value>true</value>
</property>
If environment variable MR3_APPLICATION_ID_TIMESTAMP
is set,
Metastore tries to reuse an existing DAGAppMaster Pod of the same timestamp for compaction.
If it is not set, Metastore creates its own DAGAppMaster Pod for compaction.
Hence
we want both Metastore and HiveServer2 to use the same value for
MR3_APPLICATION_ID_TIMESTAMP
so that they share the same DAGAppMaster.
In addition, environment variable CLIENT_TO_AM_TOKEN_KEY
should be set to the same value
so as to connect to DAGAppMaster,
whichever of Metastore and HiveServer2 creates the DAGAppMaster Pod.
First run Metastore and get the values for the two environment variables.
$ ./run-metastore.sh
...
CLIENT_TO_AM_TOKEN_KEY=bf4e0823-7f1c-4ac8-ac58-c7f7537642af
MR3_APPLICATION_ID_TIMESTAMP=9957
...
Before running HiveServer2, set the two environment variables.
$ export CLIENT_TO_AM_TOKEN_KEY=bf4e0823-7f1c-4ac8-ac58-c7f7537642af
$ export MR3_APPLICATION_ID_TIMESTAMP=9957
Then run HiveServer2. We see that HiveServer2 uses the same values for the environment variables.
$ ./run-hive.sh
...
CLIENT_TO_AM_TOKEN_KEY=bf4e0823-7f1c-4ac8-ac58-c7f7537642af
MR3_APPLICATION_ID_TIMESTAMP=9957
...
Example
As an example, create a table test_compaction
and perform INSERT and UPDATE operations.
0: jdbc:hive2://orange1:9852/> DROP TABLE IF EXISTS test_compaction;
0: jdbc:hive2://orange1:9852/> CREATE TABLE test_compaction (key INT, value1 STRING, value2 STRING) CLUSTERED BY (key) INTO 3 BUCKETS STORED AS ORC TBLPROPERTIES ("transactional"="true");
0: jdbc:hive2://orange1:9852/> INSERT INTO TABLE test_compaction VALUES (0, 'sprite', '0000'), (1, 'water', '0001'), (2, 'coke', '0000'), (3, 'green tea', '0002');
0: jdbc:hive2://orange1:9852/> UPDATE test_compaction SET value2 = '0003' WHERE key < 2;
0: jdbc:hive2://orange1:9852/> UPDATE test_compaction SET value2 = '0004' WHERE mod(key, 2) == 0;
0: jdbc:hive2://orange1:9852/> select * from test_compaction;
...
+----------------------+-------------------------+-------------------------+
| test_compaction.key | test_compaction.value1 | test_compaction.value2 |
+----------------------+-------------------------+-------------------------+
| 3 | green tea | 0002 |
| 1 | water | 0003 |
| 0 | sprite | 0004 |
| 2 | coke | 0004 |
+----------------------+-------------------------+-------------------------+
4 rows selected (1.724 seconds)
In the directory storing the table, we see several sub-directories.
$ find .
.
./delta_0000002_0000002_0000
./delta_0000002_0000002_0000/bucket_00001
./delta_0000002_0000002_0000/_orc_acid_version
./delete_delta_0000002_0000002_0000
./delete_delta_0000002_0000002_0000/bucket_00001
./delete_delta_0000002_0000002_0000/_orc_acid_version
./delta_0000003_0000003_0000
./delta_0000003_0000003_0000/bucket_00001
./delta_0000003_0000003_0000/bucket_00002
./delta_0000003_0000003_0000/_orc_acid_version
./delete_delta_0000003_0000003_0000
./delete_delta_0000003_0000003_0000/bucket_00001
./delete_delta_0000003_0000003_0000/bucket_00002
./delete_delta_0000003_0000003_0000/_orc_acid_version
./delta_0000001_0000001_0000
./delta_0000001_0000001_0000/bucket_00001
./delta_0000001_0000001_0000/bucket_00002
./delta_0000001_0000001_0000/_orc_acid_version
Next perform a minor compaction.
0: jdbc:hive2://orange1:9852/> ALTER TABLE test_compaction COMPACT 'minor';
0: jdbc:hive2://orange1:9852/> show compactions;
...
| 8 | default | test_compaction | --- | MINOR | succeeded | --- | 1659457147000 | 3000 | MR3-compaction-8 |
We can check if Metastore has sent a MapReduce job to MR3 DAGAppMaster. Get the names of Metastore Pod and DAGAppMaster Pod.
$ kubectl get pods -n hivemr3
NAME READY STATUS RESTARTS AGE
hivemr3-hiveserver2-74bf7fdbdd-cqp5b 1/1 Running 0 12m
hivemr3-metastore-0 1/1 Running 0 16m
mr3master-9957-0-75cc767845-nx78q 1/1 Running 0 11m
...
We see that Metastore sends a job called MR3-compaction-8
to DAGAppMaster.
$ kubectl logs -n hivemr3 hivemr3-metastore-0 | grep -a "Executing.*MR3-compaction-8"
Status: Running (Executing on MR3 DAGAppMaster): MR3-compaction-8
2022-08-02T07:19:07,504 INFO [hivemr3-metastore-0.metastore.hivemr3.svc.cluster.local-30] compactor.MR3CompactionHelper: Status: Running (Executing on MR3 DAGAppMaster): MR3-compaction-8
DAGAppMaster also receives the job.
$ kubectl logs -n hivemr3 mr3master-9957-0-75cc767845-nx78q | grep "submitDag.*MR3-compaction-8"
2022-08-02T07:19:07,447 INFO [IPC Server handler 23 on 8080] DAGAppMaster: submitDag() called by hive: MR3-compaction-8
After compaction, we see fewer sub-directories.
$ find .
.
./delete_delta_0000001_0000003
./delete_delta_0000001_0000003/bucket_00001
./delete_delta_0000001_0000003/bucket_00002
./delete_delta_0000001_0000003/_orc_acid_version
./delta_0000001_0000003
./delta_0000001_0000003/bucket_00001
./delta_0000001_0000003/bucket_00002
./delta_0000001_0000003/_orc_acid_version
Perform additional DELETE and INSERT operations.
0: jdbc:hive2://orange1:9852/> DELETE FROM test_compaction WHERE value1 = 'water';
0: jdbc:hive2://orange1:9852/> INSERT INTO TABLE test_compaction values (4, 'milk', '0005');
We see new sub-directories.
$ find .
.
./delta_0000005_0000005_0000
./delta_0000005_0000005_0000/bucket_00002
./delta_0000005_0000005_0000/_orc_acid_version
./delete_delta_0000001_0000003
./delete_delta_0000001_0000003/bucket_00001
./delete_delta_0000001_0000003/bucket_00002
./delete_delta_0000001_0000003/_orc_acid_version
./delta_0000001_0000003
./delta_0000001_0000003/bucket_00001
./delta_0000001_0000003/bucket_00002
./delta_0000001_0000003/_orc_acid_version
./delete_delta_0000004_0000004_0000
./delete_delta_0000004_0000004_0000/bucket_00001
./delete_delta_0000004_0000004_0000/_orc_acid_version
Finally perform a major compaction.
0: jdbc:hive2://orange1:9852/> ALTER TABLE test_compaction COMPACT 'major';
0: jdbc:hive2://orange1:9852/> show compactions;
...
| 8 | default | test_compaction | --- | MINOR | succeeded | --- | 1659457147000 | 3000 | MR3-compaction-8 |
| 9 | default | test_compaction | --- | MAJOR | succeeded | --- | 1659457314000 | 2000 | MR3-compaction-9 |
We see that
Metastore sends a job called MR3-compaction-9
to DAGAppMaster.
$ kubectl logs -n hivemr3 hivemr3-metastore-0 | grep -a "Executing.*MR3-compaction-9"
Status: Running (Executing on MR3 DAGAppMaster): MR3-compaction-9
2022-08-02T07:21:54,970 INFO [hivemr3-metastore-0.metastore.hivemr3.svc.cluster.local-30] compactor.MR3CompactionHelper: Status: Running (Executing on MR3 DAGAppMaster): MR3-compaction-9
$ kubectl logs -n hivemr3 mr3master-9957-0-75cc767845-nx78q | grep "submitDag.*MR3-compaction-9"
2022-08-02T07:21:54,914 INFO [IPC Server handler 29 on 8080] DAGAppMaster: submitDag() called by hive: MR3-compaction-9
Now the directory storing the table has a single sub-directory.
$ find .
.
./base_0000005
./base_0000005/_metadata_acid
./base_0000005/bucket_00001
./base_0000005/bucket_00002
./base_0000005/_orc_acid_version