In the previous approach, we create a PersistentVolume using EFS. Alternatively the user can use S3 and dispense with PersistentVolumes altogether. Using S3 instead of PersistentVolumes has the following pros and cons:

  • Setting up Hive on MR3 is simpler because the user skips the step of creating a PersistentVolume using EFS.
  • The execution can be slower because the results of queries are now written to (by ContainerWorkers) and read from S3 (by HiveServer2).
  • The AWS cost can be slightly higher because of continual operations on S3.

Using S3 instead of PersistentVolumes

In order to use S3, the user should skip or adjust those steps in the previous approach that deal with PersistentVolume workdir-pv and PersistentVolumeClaim workdir-pvc.

If the Docker image does not contain a MySQL connector jar file and Metastore/Ranger do not automatically download such a jar file, the user should use the preBootstrapCommands field in the specification of the mr3-master node group and automatically download a MySQL connector jar file because no PersistentVolume is available (see Downloading a MySQL connector in Creating an EKS cluster).

kubernetes/conf/mr3-site.xml

Set the configuration key mr3.am.staging.dir.check.ownership.permission to false. For details, see Using Amazon S3 instead of PersistentVolumes.

$ vi kubernetes/conf/mr3-site.xml

<property>
  <name>mr3.am.staging.dir.check.ownership.permission</name>
  <value>false</value>
</property>

kubernetes/conf/hive-site.xml

Set the configuration key hive.exec.scratchdir in hive-site.xml to point to the S3 bucket for the scratch directory of HiveServer2 (under which a staging directory for MR3 DAGAppMaster is created). Do not update the configuration key hive.downloaded.resources.dir because it should point to a directory on the local file system.

$ vi kubernetes/conf/hive-site.xml

<property>
  <name>hive.exec.scratchdir</name>
  <value>s3a://hivemr3-warehouse-dir/workdir/${user.name}</value>
</property>

If necessary, include an additional inline policy in the IAM roles for the mr3-master and mr3-worker node groups so that every Pod can access the S3 bucket for the scratch directory.

Removing PersistentVolume workdir-pv and PersistentVolumeClaim workdir-pvc

Open kubernetes/env.sh and set the following two environment variables to empty values.

$ vi kubernetes/env.sh

WORK_DIR_PERSISTENT_VOLUME_CLAIM=
WORK_DIR_PERSISTENT_VOLUME_CLAIM_MOUNT_DIR=

Open kubernetes/yaml/metastore.yaml and comment out the following lines:

$ vi kubernetes/yaml/metastore.yaml

# - name: work-dir-volume
#   mountPath: /opt/mr3-run/work-dir/

# - name: work-dir-volume
#   persistentVolumeClaim:
#     claimName: workdir-pvc

Open kubernetes/yaml/hive.yaml and comment out the following lines:

$ vi kubernetes/yaml/hive.yaml

# - name: work-dir-volume
#   mountPath: /opt/mr3-run/work-dir

# - name: work-dir-volume
#   persistentVolumeClaim:
#     claimName: workdir-pvc

Now the user can run Hive on MR3 on Amazon EKS without using PersistentVolumes.