Using Amazon S3 instead of PersistentVolumes |

In the previous approach, we create a PersistentVolume for storing transient data such as results of running queries. With access to Amazon S3, the user can dispense with PersistentVolumes for Metastore and HiveServer2. In order to use S3, the user should skip or adjust those steps in the previous approach that deal with PersistentVolume workdir-pv and PersistentVolumeClaim workdir-pvc. For PersistentVolumes for Timeline Server and Apache Ranger, see Using HDFS instead of PersistentVolumes.

`kubernetes/conf/mr3-site.xml`

By default, MR3 DAGAppMaster checks the ownership and permission of its staging directory (which is specified by the configuration key mr3.am.staging-dir in mr3-site.xml and automatically set by HiveServer2) for security purpose. Since S3 is an object store which only simulates directories without maintaining ownership and permission, we should set the configuration key mr3.am.staging.dir.check.ownership.permission to false so as to skip checking the ownership and permission of the staging directory.

$ vi kubernetes/conf/mr3-site.xml

<property>
  <name>mr3.am.staging.dir.check.ownership.permission</name>
  <value>false</value>
</property>

`kubernetes/conf/hive-site.xml`

Set the configuration key hive.exec.scratchdir in hive-site.xml to point to the S3 bucket for the scratch directory of HiveServer2 (under which a staging directory for MR3 DAGAppMaster is created). Do not update the configuration key hive.downloaded.resources.dir because it should point to a directory on the local file system.

$ vi kubernetes/conf/hive-site.xml

<property>
  <name>hive.exec.scratchdir</name>
  <value>s3a://hivemr3-warehouse-dir/workdir/${user.name}</value>
</property>

If the query results cache is enabled with the configuration key hive.query.results.cache.enabled set to true, the configuration key hive.query.results.cache.directory should point to another S3 bucket. Otherwise the query results cache is never used.

Removing PersistentVolume `workdir-pv` and PersistentVolumeClaim `workdir-pvc`

Open kubernetes/env.sh and set the following two environment variables to empty values.

$ vi kubernetes/env.sh

WORK_DIR_PERSISTENT_VOLUME_CLAIM=
WORK_DIR_PERSISTENT_VOLUME_CLAIM_MOUNT_DIR=

Set METASTORE_USE_PERSISTENT_VOLUME to false in env.sh.

$ vi kubernetes/env.sh

METASTORE_USE_PERSISTENT_VOLUME=false

Open kubernetes/yaml/metastore.yaml and comment out the following lines:

$ vi kubernetes/yaml/metastore.yaml

# - name: work-dir-volume
#   mountPath: /opt/mr3-run/work-dir/

# - name: work-dir-volume
#   persistentVolumeClaim:
#     claimName: workdir-pvc

Open kubernetes/yaml/hive.yaml and comment out the following lines:

$ vi kubernetes/yaml/hive.yaml

# - name: work-dir-volume
#   mountPath: /opt/mr3-run/work-dir

# - name: work-dir-volume
#   persistentVolumeClaim:
#     claimName: workdir-pvc

Now the user can run Hive on MR3 on Kubernetes without using PersistentVolumes. If, however, the Docker image does not contain a MySQL connector jar file and Metastore/Ranger do not automatically download such a jar file, the user should use a hostPath volume to mount such a jar file in the directory /opt/mr3-run/host-lib inside the Metastore Pod. See Downloading a MySQL connector in Creating an EKS cluster for an example.

Using UDFs

The user can also use user defined functions (UDFs) by uploading jar files to S3.

...
Connecting to jdbc:hive2://10.1.91.41:9852/;;
Connected to: Apache Hive (version 3.1.3)
Driver: Hive JDBC (version 3.1.3)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.3 by Apache Hive
0: jdbc:hive2://10.1.91.41:9852/> use tpcds_partitioned_10_orc_s3a;
0: jdbc:hive2://10.1.91.41:9852/> add jar s3a://hivemr3-warehouse-dir/temp1.jar;
INFO  : Added [/opt/mr3-run/work-dir/04e3e74b-6c98-4ef6-8706-d8466fb6223c_resources/temp1.jar] to class path
INFO  : Added resources: [s3a://hivemr3-warehouse-dir/temp1.jar]
No rows affected (0.132 seconds)
0: jdbc:hive2://10.1.91.41:9852/> create temporary function foo as 'test.simple.SimpleClass1';
0: jdbc:hive2://10.1.91.41:9852/> select foo(s_zip) from store limit 5;
+---------------------------+
|            _c0            |
+---------------------------+
| simple1-0713114453 53604  |
| simple1-0713114453 51904  |
| simple1-0713114453 31904  |
| simple1-0713114453 33604  |
| simple1-0713114453 59231  |
+---------------------------+
5 rows selected (1.253 seconds)
0: jdbc:hive2://10.1.91.41:9852/> select foo(cc_city) from call_center, store where foo(cc_city) = foo(s_city) limit 10;
...
10 rows selected (5.517 seconds)
0: jdbc:hive2://10.1.91.41:9852/>

Note that those directories and files created under the scratch directory survive HiveServer2. Hence the user should manually clean the scratch directory of S3 if necessary.

kubernetes/conf/mr3-site.xml

kubernetes/conf/hive-site.xml

Removing PersistentVolume workdir-pv and PersistentVolumeClaim workdir-pvc

Using UDFs

`kubernetes/conf/mr3-site.xml`

`kubernetes/conf/hive-site.xml`

Removing PersistentVolume `workdir-pv` and PersistentVolumeClaim `workdir-pvc`