In the previous approach, we create a PersistentVolume using EFS. Alternatively the user can use S3 and dispense with PersistentVolumes altogether. Using S3 instead of PersistentVolumes has the following pros and cons:
- Setting up Hive on MR3 is simpler because the user skips the step of creating a PersistentVolume using EFS.
- The execution can be slower because the results of queries are now written to (by ContainerWorkers) and read from S3 (by HiveServer2).
- The AWS cost can be slightly higher because of continual operations on S3.
Using S3 instead of PersistentVolumes
In order to use S3,
the user should skip or adjust those steps in the previous approach that deal with PersistentVolume workdir-pv
and PersistentVolumeClaim workdir-pvc
.
If the Docker image does not contain a MySQL connector jar file
and Metastore/Ranger do not automatically download such a jar file,
the user should use the preBootstrapCommands
field in the specification of the mr3-master
node group
and automatically download a MySQL connector jar file because no PersistentVolume is available
(see Downloading a MySQL connector in Creating an EKS cluster).
kubernetes/conf/mr3-site.xml
Set the configuration key mr3.am.staging.dir.check.ownership.permission
to false.
For details, see Using Amazon S3 instead of PersistentVolumes.
$ vi kubernetes/conf/mr3-site.xml
<property>
<name>mr3.am.staging.dir.check.ownership.permission</name>
<value>false</value>
</property>
kubernetes/conf/hive-site.xml
Set the configuration key hive.exec.scratchdir
in hive-site.xml
to point to the S3 bucket for the scratch directory of HiveServer2
(under which a staging directory for MR3 DAGAppMaster is created).
Do not update the configuration key hive.downloaded.resources.dir
because it should point to a directory on the local file system.
$ vi kubernetes/conf/hive-site.xml
<property>
<name>hive.exec.scratchdir</name>
<value>s3a://hivemr3-warehouse-dir/workdir/${user.name}</value>
</property>
If necessary,
include an additional inline policy in the IAM roles for the mr3-master
and mr3-worker
node groups
so that every Pod can access the S3 bucket for the scratch directory.
Removing PersistentVolume workdir-pv
and PersistentVolumeClaim workdir-pvc
Open kubernetes/env.sh
and set the following two environment variables to empty values.
$ vi kubernetes/env.sh
WORK_DIR_PERSISTENT_VOLUME_CLAIM=
WORK_DIR_PERSISTENT_VOLUME_CLAIM_MOUNT_DIR=
Open kubernetes/yaml/metastore.yaml
and comment out the following lines:
$ vi kubernetes/yaml/metastore.yaml
# - name: work-dir-volume
# mountPath: /opt/mr3-run/work-dir/
# - name: work-dir-volume
# persistentVolumeClaim:
# claimName: workdir-pvc
Open kubernetes/yaml/hive.yaml
and comment out the following lines:
$ vi kubernetes/yaml/hive.yaml
# - name: work-dir-volume
# mountPath: /opt/mr3-run/work-dir
# - name: work-dir-volume
# persistentVolumeClaim:
# claimName: workdir-pvc
Now the user can run Hive on MR3 on Amazon EKS without using PersistentVolumes.