We use the command eksctl to create an EKS cluster with two node groups: mr3-master and mr3-worker. The mr3-master node group is intended for those Pods that should always be running, i.e., HiveServer2, DAGAppMaster, Metastore, Ranger, and Timeline Server Pods. Hence on-demand instances are appropriate for the mr3-master node group so as to minimize the chance of temporary service disruption. The mr3-worker node group is intended for ContainerWorker Pods. The following two properties of ContainerWorker Pods make it desirable to create mr3-worker as a separate node group:

  • By virtue of the optimization implemented for fault tolerance of MR3, preempting ContainerWorker Pods does not incur considerable delays in completing queries. Hence spot instances are acceptable for running ContainerWorker Pods.
  • ContainerWorker Pods need local disks for writing intermediate data. Hence the user may choose to provision local disks to every node in the mr3-worker node group.

In our example, we allocate a single on-demand instance to the mr3-master node group and three spot instances to the mr3-worker node group. For the mr3-worker node group, we may choose instance types that are equipped with a single local disk. Here is a sample specification (kubernetes/eks/cluster.yaml) of an EKS cluster to be used as input to the command eksctl.

$ vi kubernetes/eks/cluster.yaml

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

  name: hive-mr3
  region: ap-northeast-2

    gateway: Single

  - name: mr3-master
    instanceType: m5.large
    labels: { roles: masters }
    desiredCapacity: 1
      allow: true
  - name: mr3-worker
    instanceType: mixed
    labels: { roles: workers }
    desiredCapacity: 3
    minSize: 1
    maxSize: 4
      maxPrice: 0.068
      instanceTypes: ["m5.large", "m5.xlarge"] 
      onDemandBaseCapacity: 0
      onDemandPercentageAboveBaseCapacity: 0
      spotInstancePools: 4

Using instance storage

ContainerWorkers of MR3 write intermediate data, such as output of TaskAttempts or input to TaskAttempts fetched through shuffle handlers, to local disks. In the case of running Hive on MR3 on Kubernetes, there are three ways to simulate local disks for ContainerWorker Pods:

  1. Use emptyDir volumes specified by the configuration key mr3.k8s.pod.worker.emptydirs. Each directory in the configuration value is mapped to an emptyDir volume.
  2. Use hostPath volumes specified by the configuration key mr3.k8s.pod.worker.hostpaths. Each directory in the configuration value (which should be ready on the host node) is mapped to a hostPath volume.
  3. Use persistentVolumeClaim volumes (mounting PersistentVolumes) specified by the configuration key mr3.k8s.worker.local.dir.persistentvolumes along with mr3.k8s.local.dir.persistentvolume.storageclass and mr3.k8s.local.dir.persistentvolume.storage. Each directory in the configuration value is mapped to a persistentVolumeClaim volume which is created dynamically according to the storage class specified by mr3.k8s.local.dir.persistentvolume.storageclass (e.g., gp2 for EBS) and the size specified by mr3.k8s.local.dir.persistentvolume.storage (e.g., 2Gi).

For Amazon EKS, the first option is okay only for running small queries with a low concurrency level because of the small size of the root partition (20GB by default). Or the user can increase the size of the root partition with the --node-volume-size flag of eksctl when creating an EKS cluster. The second option offers the best performance (at a slightly higher cost) if we use instance storage which is physically attached to the host node. The third option works only in limited cases: 1) the EKS cluster should be running in a single Availability Zone, and 2) Docker containers should run as a root user.

In order to use the second option, the user should create an EKS cluster whose mr3-worker node group uses an EC2 instance type that is equipped with instance storage. In addition, the preBootstrapCommands field in the the specification of the mr3-worker node group should include commands for formatting and mounting the instance storage. For example, for the instance type m5d.xlarge which has a single NVMe disk, we can extend the specification of the EKS cluster as follow.

$ vi kubernetes/eks/cluster.yaml

  - name: mr3-worker
      - "mkfs -t ext4 /dev/nvme1n1"
      - "mkdir -p /ephemeral1"
      - "mount /dev/nvme1n1 /ephemeral1"
      - "chown ec2-user:ec2-user /ephemeral1"

Here the sequence of commands in the preBootstrapCommands field formats the partition residing on the NVMe disk and mounts it to the directory /ephemeral1. (It does not help to extend /etc/eks/bootstrap.sh in the Amazon Machine Image (AMI) because eksctl does not read the file.) The following example illustrates how to initialize instance storage to mount multiple local disks.

$ vi kubernetes/eks/cluster.yaml

  - name: mr3-worker
      - IDX=1
      - for DEV in /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_*-ns-1; do mkfs.xfs ${DEV}; mkdir -p /ephemeral${IDX}; echo ${DEV} /ephemeral${IDX} xfs defaults,noatime 1 2 >> /etc/fstab; IDX=$((${IDX} + 1)); done
      - mount -a

Then the user can use /ephemeral1 for the configuration key mr3.k8s.pod.worker.hostpaths in kubernetes/conf/mr3-site.xml.

$ vi kubernetes/conf/mr3-site.xml


Downloading a MySQL connector

Running Metastore on Amazon EKS is essentially no different from running it on Kubernetes in general. As an additional step specific to EKS, however, the user should make a MySQL connector available in every Pod, such as a Metastore Pod, that requires connections to MySQL. If the Docker image does not contain such a jar file and Metastore does not automatically download such a jar file, there are two ways to make a MySQL connector available in a Pod:

  1. We use the preBootstrapCommands field in the the specification of the mr3-master node group to automatically download a MySQL connector jar file.
  2. The user manually copies a MySQL connector jar file to the PersistentVolumeClaim workdir-pvc.

Here we explain the first approach. The second approach is explained in Creating a PersistentVolume using EFS.

In order to automatically download a MySQL connector jar file, we extend the specification of the EKS cluster (where your.server.address should be replaced appropriately).

$ vi kubernetes/eks/cluster.yaml

  - name: mr3-master
      - "wget http://your.server.address/mysql-connector-java-8.0.12.jar"
      - "mkdir -p /home/ec2-user/lib"
      - "mv mysql-connector-java-8.0.12.jar /home/ec2-user/lib"

Now every node instance in the mr3-master node group starts with mysql-connector-java-8.0.12.jar in the directory /home/ec2-user/lib. Next we extend kubernetes/yaml/metastore.yaml to create a hostPath volume and mount it under the directory /opt/mr3-run/host-lib.

$ vi kubernetes/eks/cluster.yaml

        - name: host-lib-volume
          mountPath: /opt/mr3-run/host-lib
      - name: host-lib-volume
          path: /home/ec2-user/lib
          type: Directory

Since the directory /opt/mr3-run/host-lib is included in the classpath (see kubernetes/hive/hive/hive-setup.sh), the MySQL connector jar file is accessible to Metastore.

Data transfer cost between multiple Availability Zones

By default, the command eksctl uses all Availability Zones (AZs) from the region specified by the field metadata/region. As a result, ContainerWorker Pods usually spread across multiple AZs and may not be collocated with the HiveServer2 Pod. The use of multiple AZs, however, can have an unintended consequence because Amazon charges for data transfer between different AZs ($0.01/GB to $0.02/GB). Specifically ContainerWorkers exchange intermediate data very often and in large quantities, and the data transfer cost can be surprisingly high, sometimes surpassing the total cost of EC2 instances.

Thus the user may want to restrict an EKS cluster to a single AZ and avoid high data transfer costs. The user can use a single AZ by updating eks/cluster.yaml as follows:

$ vi kubernetes/eks/cluster.yaml

availabilityZones: ["ap-northeast-2a", "ap-northeast-2b", "ap-northeast-2c"]

  - name: mr3-master
    availabilityZones: ["ap-northeast-2a"]
  - name: mr3-worker
    availabilityZones: ["ap-northeast-2a"]

If eksctl does not accepts this update, upgrade it to the latest version.

Enabling autoscaling

In order to enable autoscaling, create an IAM policy EKSAutoScalingWorkerPolicy.

$ vi EKSAutoScalingWorkerPolicy.json
    "Version": "2012-10-17",
    "Statement": [
            "Action": [
            "Resource": "*",
            "Effect": "Allow"

$ aws iam create-policy --policy-name EKSAutoScalingWorkerPolicy --policy-document file://EKSAutoScalingWorkerPolicy.json
    "Policy": {
        "Arn": "arn:aws:iam::111111111111:policy/EKSAutoScalingWorkerPolicy",

Then use the ARN (Amazon Resource Name) of the IAM policy in the iam/attachPolicyARNs field of both node groups. (Without using the ARN for mr3-master, the user cannot check the status of the Kubernetes Autoscaler.)

  - name: mr3-master
        - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
        - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
        - arn:aws:iam::111111111111:policy/EKSAutoScalingWorkerPolicy
  - name: mr3-worker
        - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
        - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
        - arn:aws:iam::111111111111:policy/EKSAutoScalingWorkerPolicy

Accessing S3

If an IAM policy for accessing S3 buckets is available, include its ARN in the the iam/attachPolicyARNs field of both node groups (similarly to including the ARN of the IAM policy EKSAutoScalingWorkerPolicy). Then every Pod is allowed to access S3 buckets.

For accessing S3 buckets when an IAM policy is not available at the time of creating an EKS cluster, see Accessing S3 Buckets.

Executing the command eksctl

We create the EKS cluster as follows.

$ eksctl create cluster -f kubernetes/eks/cluster.yaml
[ℹ]  using region ap-northeast-2
[ℹ]  setting availability zones to [ap-northeast-2c ap-northeast-2b ap-northeast-2a]
[✔]  EKS cluster "hive-mr3" in "ap-northeast-2" region is ready

The user can verify that a total of four nodes are available in the EKS cluster.

$ kubectl get nodes
NAME                                                STATUS   ROLES    AGE   VERSION
ip-192-168-57-153.ap-northeast-2.compute.internal   Ready    <none>   15m   v1.13.7-eks-c57ff8
ip-192-168-59-250.ap-northeast-2.compute.internal   Ready    <none>   14m   v1.13.7-eks-c57ff8
ip-192-168-8-69.ap-northeast-2.compute.internal     Ready    <none>   14m   v1.13.7-eks-c57ff8
ip-192-168-94-214.ap-northeast-2.compute.internal   Ready    <none>   14m   v1.13.7-eks-c57ff8

Configuring Kubernetes Autoscaler

If autoscaling is enabled, open eks/cluster-autoscaler-autodiscover.yaml and change the configuration for autoscaling if necessary. By default, the Kubernetes Autoscaler removes nodes that stay idle for 1 minute (as specified by --scale-down-unneeded-time).

$ vi eks/cluster-autoscaler-autodiscover.yaml

            - --scale-down-delay-after-add=5m
            - --scale-down-unneeded-time=1m

Start the Kubernetes Autoscaler.

$ kubectl apply -f kubernetes/eks/cluster-autoscaler-autodiscover.yaml
serviceaccount/cluster-autoscaler created
clusterrole.rbac.authorization.k8s.io/cluster-autoscaler created
role.rbac.authorization.k8s.io/cluster-autoscaler created
clusterrolebinding.rbac.authorization.k8s.io/cluster-autoscaler created
rolebinding.rbac.authorization.k8s.io/cluster-autoscaler created
deployment.apps/cluster-autoscaler created

The user can check that the Kubernetes Autoscaler has started properly.

$ kubectl logs -f deployment/cluster-autoscaler -n kube-system
I0717 08:40:39.817460       1 utils.go:543] Skipping ip-192-168-10-97.ap-northeast-2.compute.internal - no node group config
I0717 08:40:39.817502       1 static_autoscaler.go:393] Scale down status: unneededOnly=true lastScaleUpTime=2020-07-17 08:40:29.816626868 +0000 UTC m=+2.704022951 lastScaleDownDeleteTime=2020-07-17 08:40:29.816626974 +0000 UTC m=+2.704023060 lastScaleDownFailTime=2020-07-17 08:40:29.816627087 +0000 UTC m=+2.704023171 scaleDownForbidden=false isDeleteInProgress=false

Since the Kubernetes Autoscaler is configured to remove nodes that remain idle for 1 minute for fast scale-in, mr3.auto.scale.in.grace.period.secs in conf/mr3-site.xml can be set to 90 seconds (60 seconds of idle time and extra 30 seconds to account for delays). If the user wants to increase the value of --scale-down-unneeded-time in eks/cluster-autoscaler-autodiscover.yaml, the configuration key mr3.auto.scale.in.grace.period.secs should be adjusted accordingly.

Deleting the EKS cluster

Because of the additional components configured manually, it take a few extra steps to delete the EKS cluster. In order to delete the EKS cluster (created with eksctl), proceed in the following order.

  1. If necessary, remove the new inbound rule (using NFS) in the security group for EFS.
  2. Delete EFS on the AWS console.
    $ aws efs delete-file-system --file-system-id $EFSID
  3. Remove additionoal policies from the IAM roles for the mr3-master and mr3-worker node groups.
  4. Execute the command eksctl delete cluster -f kubernetes/eks/cluster.yaml.
    $ eksctl delete cluster -f kubernetes/eks/cluster.yaml

If the last command fails, the user should delete the EKS cluster manually. Proceed in the following order on the AWS console.

  1. Delete security groups manually.
  2. Delete the NAT gateway created for the EKS cluster, delete the VPC, and then delete the Elastic IP address.
  3. Delete the LoadBalancer.
  4. Delete IAM roles.
  5. Delete CloudFormations.