Hive on MR3 needs a PersistentVolume for several purposes:

  • HiveServer2 distributes resources for user defined functions (UDFs) using the PersistentVolume.
  • MR3 DAGAppMaster uses the PersistentVolume for staging directories where DAG recovery files (dag.submitted, dag.summary, dag.reported) are kept.
  • MR3 ContainerWorkers use the PersistentVolume to send the results of queries back to HiveServer2.

On Amazon EKS, we usually create PersistentVolumes using EFS. See Using S3 instead of EFS to learn how to use S3 instead of PersistentVolumes.

There are two methods to create a PersistentVolume using EFS:

  1. With the Amazon EFS CSI driver
  2. With an external storage provisioner

The first methods works only with kubectl version 1.14 and later.

For running Hive on MR3, the second method is slightly simpler. The quick start guide On EKS with Autoscaling uses the second method.

1. With the Amazon EFS CSI driver

Deploy the Amazon EFS CSI driver to an EKS cluster. For the instruction, see AWS User Guide. For creating a Fargate-only cluster (see Creating a Fargate-only Cluster), deploy the driver with the following command.

$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/aws-efs-csi-driver/master/deploy/kubernetes/base/csidriver.yaml
csidriver.storage.k8s.io/efs.csi.aws.com created

Obtain the CIDR range of the VPC of the EKS cluster.

$ aws eks describe-cluster --name hive-mr3 --query "cluster.resourcesVpcConfig.vpcId" --output text
vpc-0286f0b6b501ada72
$ aws ec2 describe-vpcs --vpc-ids vpc-0286f0b6b501ada72 --query "Vpcs[].CidrBlock" --output text
192.168.0.0/16

Open the AWS console and create a new security group that allows inbound NFS traffic for EFS mount points. The security group should use the VPC of the EKS cluster and include an inbound rule: 1) of type NFS; 2) using the CIDR range obtained in the previous step.

Create EFS on the AWS Console. When creating EFS, choose the VPC of the EKS cluster. For every mount target, use the security group created in the previous step. Change the permission of EFS if necessary.

Retrieve your Amazon EFS file system ID.

$ aws efs describe-file-systems --query "FileSystems[*].FileSystemId" --output text
fs-21726c00

Set the file system ID in kubernetes/efs-csi/workdir-pv.yaml.

$ vi kubernetes/efs-csi/workdir-pv.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: workdir-pv
spec:
  capacity:
    storage: 5Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-sc
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-21726c00

Create a new storage class efs-sc.

$ kubectl create -f kubernetes/efs-csi/sc.yaml
storageclass.storage.k8s.io/efs-sc created

Create a new PersistentVolumeClaim workdir-pvc. There is no limit in the capacity of the PersistentVolumeClaim workdir-pvc, so the user can ignored its nominal capacity 1Mi. All Pods of Hive on MR3 will share the PersistentVolumeClaim workdir-pvc.

$ kubectl create -f kubernetes/efs-csi/workdir-pv.yaml 
persistentvolume/workdir-pv created
$ kubectl create -f kubernetes/efs-csi/workdir-pvc.yaml 
persistentvolumeclaim/workdir-pvc created

$ kubectl get pv -n hivemr3
NAME         CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                 STORAGECLASS   REASON   AGE
workdir-pv   5Gi        RWX            Retain           Bound    hivemr3/workdir-pvc   efs-sc                  4m8s
$ kubectl get pvc -n hivemr3
NAME          STATUS   VOLUME       CAPACITY   ACCESS MODES   STORAGECLASS   AGE
workdir-pvc   Bound    workdir-pv   5Gi        RWX            efs-sc         3m25s

2. With an external storage provisioner

When using an external storage provisioner, the user should take the following steps: 1) create EFS for the EKS cluster; 2) if necessary, configure security groups so that the EKS cluster can access EFS; 3) create a StorageClass for EFS.

1) Creating EFS

The user can create EFS on the AWS Console. When creating EFS, choose the VPC of the EKS cluster. Make sure that a mount target is created for each Availability Zone. If the user can choose the security group for mount targets, use the security group for the EKS cluster.

For creating EFS using AWS CLI, see the quick start guide On EKS with Autoscaling.

2) Configuring security groups when necessary

If the user cannot choose the security group for mount targets, security groups should be configured manually. Identify two security groups: 1) the security group for the EC2 instances constituting the EKS cluster; 2) the security group associated with the EFS mount targets.

For the first security group (for the EC2 instances), add a rule to allow inbound access using Secure Shell (SSH) from any host so that EFS can access the EKS cluster, as shown below: eks.security.group.ssh

For the second security group, add a rule to allow inbound access using NFS from the first security group so that the EKS cluster can access EFS, as shown below: eks.security.group.nfs Here sg-096f3c3dff95ad6ae is the first security group.

In order to check if the EKS cluster can mount EFS to a local directory in its EC2 instances, get the file system ID of EFS (e.g., fs-0440de65) and the region ID (e.g., ap-northeast-2). Then log on to an EC2 instance and execute the following commands.

$ mkdir -p /home/ec2-user/efs
$ sudo mount -t nfs -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport fs-0440de65.efs.ap-northeast-2.amazonaws.com:/ /home/ec2-user/efs

If the security groups are properly configured, EFS is mounted to the directory /home/ec2-user/efs.

3) Creating a StorageClass

We use the external storage provisioner for AWS EFS (available at https://github.com/kubernetes-incubator/external-storage) to create a StorageClass for EFS. The external storage provisioner assists PersistentVolumeClaims asking for the StorageClass for EFS by creating PersistentVolumes on EFS.

For creating a StorageClass for EFS, the directory kubernetes/efs contains four YAML files for starting the storage provisioner for EFS and creating a PersistentVolumeClaim workdir-pvc. Set the file system ID of EFS, the region ID, and the NFS server address in kubernetes/efs/manifest.yaml.

$ vi kubernetes/efs/manifest.yaml

data:
  file.system.id: fs-0440de65
  aws.region: ap-northeast-2
  provisioner.name: example.com/aws-efs
  dns.name: ""
spec:
  template:
    spec:
      volumes:
      - name: pv-volume
        nfs:
          server: fs-0440de65.efs.ap-northeast-2.amazonaws.com
          path: /

Then execute the script kubernetes/mount-efs.sh which uses the namespace specified by the environment variable MR3_NAMESPACE in kubernetes/env.sh.

$ kubernetes/mount-efs.sh

The user can find a new StorageClass aws-efs, a new Pod (e.g., efs-provisioner-785b649d46-pt8rh) running the storage provisioner, and a new PersistentVolumeClaim workdir-pvc. There is no limit in the capacity of the PersistentVolumeClaim workdir-pvc, so the user can ignored its nominal capacity 1Mi. All Pods of Hive on MR3 will share the PersistentVolumeClaim workdir-pvc.

$ kubectl get sc 
NAME            PROVISIONER             AGE
aws-efs         example.com/aws-efs     90s
gp2 (default)   kubernetes.io/aws-ebs   48m
$ kubectl get pods -n hivemr3
NAME                               READY   STATUS    RESTARTS   AGE
efs-provisioner-785b649d46-pt8rh   1/1     Running   0          65s
$ kubectl get pvc -n hivemr3
NAME          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
workdir-pvc   Bound    pvc-4cf7cdeb-c41f-11e9-8776-0a0b4be6c9fa   1Mi        RWX            aws-efs        112s

The user can verify that a new directory for the PersistentVolumeClaim workdir-pvc has been created on EFS.

$ kubectl exec -n hivemr3 -it efs-provisioner-785b649d46-pt8rh -- apk add bash
$ kubectl exec -n hivemr3 -it efs-provisioner-785b649d46-pt8rh -- /bin/bash
$ ls /persistentvolumes   # inside the storage provisioner Pod
/persistentvolumes/workdir-pvc-pvc-4cf7cdeb-c41f-11e9-8776-0a0b4be6c9fa

Copying a MySQL connector jar file to EFS

Assume that the Docker image does not contain a MySQL connector jar file and that Metastore/Ranger do not automatically download such a jar file for the mr3-master node group. Then the user should manually copy a MySQL connector jar file to a subdirectory /lib of the directory on EFS so that a MySQL connector becomes available to every Pod.

Log on to an EC2 instance and mount EFS to a local directory.

$ mkdir -p /home/ec2-user/efs
$ sudo mount -t nfs -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport fs-0440de65.efs.ap-northeast-2.amazonaws.com:/ /home/ec2-user/efs

Then find the directory for the PersistentVolumeClaim workdir-pvc and create a subdirectory /lib.

$ ls /home/ec2-user/efs/
workdir-pvc-pvc-4cf7cdeb-c41f-11e9-8776-0a0b4be6c9fa
$ sudo mkdir -p /home/ec2-user/efs/workdir-pvc-pvc-4cf7cdeb-c41f-11e9-8776-0a0b4be6c9fa/lib
$ cd /home/ec2-user/efs/workdir-pvc-pvc-4cf7cdeb-c41f-11e9-8776-0a0b4be6c9fa/lib
$ sudo chmod 777 .

Alternatively the user can connect directly to the storage provisioner Pod and create a subdirectory /lib under the directory for the PersistentVolumeClaim workdir-pvc.

After creating a subdirectory /lib, copy the jar file (e.g., by downloading it from the internet).

$ wget http://your.server.address/mysql-connector-java-8.0.12.jar
$ ls 
mysql-connector-java-8.0.12.jar

In order for Metastore to use the MySQL connector provided by the user, the Metastore Pod should use the PersistentVolumeClaim workdir-pvc. Edit kubernetes/yaml/metastore.yaml and uncomment the following block:

$ vi kubernetes/yaml/metastore.yaml

spec:
  template:
    spec:
      containers:
        volumeMounts:
        - name: work-dir-volume
          mountPath: /opt/mr3-run/lib
          subPath: lib
      volumes:
      - name: work-dir-volume
        persistentVolumeClaim:
          claimName: workdir-pvc