We use the command eksctl
to create an EKS cluster with two node groups: mr3-master
and mr3-worker
.
The mr3-master
node group is intended for
those Pods that should always be running, i.e., HiveServer2, DAGAppMaster, Metastore, Ranger, and Timeline Server Pods.
Hence on-demand instances are appropriate for the mr3-master
node group
so as to minimize the chance of temporary service disruption.
The mr3-worker
node group is intended for ContainerWorker Pods.
The following two properties of ContainerWorker Pods make it desirable to create mr3-worker
as a separate node group:
- By virtue of the optimization implemented for fault tolerance of MR3, preempting ContainerWorker Pods does not incur considerable delays in completing queries. Hence spot instances are acceptable for running ContainerWorker Pods.
- ContainerWorker Pods need local disks for writing intermediate data.
Hence the user may choose to provision local disks to every node in the
mr3-worker
node group.
In our example, we allocate a single on-demand instance to the mr3-master
node group
and three spot instances to the mr3-worker
node group.
For the mr3-worker
node group, we may choose instance types that are equipped with a single local disk.
Here is a sample specification (kubernetes/eks/cluster.yaml
) of an EKS cluster to be used as input to the command eksctl
.
$ vi kubernetes/eks/cluster.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: hive-mr3
region: ap-northeast-2
vpc:
nat:
gateway: Single
nodeGroups:
- name: mr3-master
instanceType: m5.large
labels: { roles: masters }
desiredCapacity: 1
ssh:
allow: true
- name: mr3-worker
instanceType: mixed
labels: { roles: workers }
desiredCapacity: 3
minSize: 1
maxSize: 4
instancesDistribution:
maxPrice: 0.068
instanceTypes: ["m5.large", "m5.xlarge"]
onDemandBaseCapacity: 0
onDemandPercentageAboveBaseCapacity: 0
spotInstancePools: 4
Using instance storage
ContainerWorkers of MR3 write intermediate data, such as output of TaskAttempts or input to TaskAttempts fetched through shuffle handlers, to local disks. In the case of running Hive on MR3 on Kubernetes, there are three ways to simulate local disks for ContainerWorker Pods:
- Use emptyDir volumes specified by the configuration key
mr3.k8s.pod.worker.emptydirs
. Each directory in the configuration value is mapped to an emptyDir volume. - Use hostPath volumes specified by the configuration key
mr3.k8s.pod.worker.hostpaths
. Each directory in the configuration value (which should be ready on the host node) is mapped to a hostPath volume. - Use persistentVolumeClaim volumes (mounting PersistentVolumes) specified by the configuration key
mr3.k8s.worker.local.dir.persistentvolumes
along withmr3.k8s.local.dir.persistentvolume.storageclass
andmr3.k8s.local.dir.persistentvolume.storage
. Each directory in the configuration value is mapped to a persistentVolumeClaim volume which is created dynamically according to the storage class specified bymr3.k8s.local.dir.persistentvolume.storageclass
(e.g.,gp2
for EBS) and the size specified bymr3.k8s.local.dir.persistentvolume.storage
(e.g.,2Gi
).
For Amazon EKS, the first option is okay only for running small queries with a low concurrency level because of the small size of the root partition (20GB by default).
Or the user can increase the size of the root partition with the --node-volume-size
flag of eksctl
when creating an EKS cluster.
The second option offers the best performance (at a slightly higher cost) if we use instance storage which is physically attached to the host node.
The third option works only in limited cases: 1) the EKS cluster should be running in a single Availability Zone, and 2) Docker containers should run as a root user.
In order to use the second option, the user should create an EKS cluster whose mr3-worker
node group uses an EC2 instance type that is equipped with instance storage.
In addition, the preBootstrapCommands
field in the the specification of the mr3-worker
node group should include commands for formatting and mounting the instance storage.
For example, for the instance type m5d.xlarge
which has a single NVMe disk, we can extend the specification of the EKS cluster as follow.
$ vi kubernetes/eks/cluster.yaml
nodeGroups:
- name: mr3-worker
preBootstrapCommands:
- "mkfs -t ext4 /dev/nvme1n1"
- "mkdir -p /ephemeral1"
- "mount /dev/nvme1n1 /ephemeral1"
- "chown ec2-user:ec2-user /ephemeral1"
Here the sequence of commands in the preBootstrapCommands
field formats the partition residing on the NVMe disk and mounts it to the directory /ephemeral1
.
(It does not help to extend /etc/eks/bootstrap.sh
in the Amazon Machine Image (AMI) because eksctl
does not read the file.)
The following example illustrates how to initialize instance storage to mount multiple local disks.
$ vi kubernetes/eks/cluster.yaml
nodeGroups:
- name: mr3-worker
preBootstrapCommands:
- IDX=1
- for DEV in /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_*-ns-1; do mkfs.xfs ${DEV}; mkdir -p /ephemeral${IDX}; echo ${DEV} /ephemeral${IDX} xfs defaults,noatime 1 2 >> /etc/fstab; IDX=$((${IDX} + 1)); done
- mount -a
Then the user can use /ephemeral1
for the configuration key mr3.k8s.pod.worker.hostpaths
in kubernetes/conf/mr3-site.xml
.
$ vi kubernetes/conf/mr3-site.xml
<property>
<name>mr3.k8s.pod.worker.hostpaths</name>
<value>/ephemeral1</value>
</property>
Downloading a MySQL connector
Running Metastore on Amazon EKS is essentially no different from running it on Kubernetes in general. As an additional step specific to EKS, however, the user should make a MySQL connector available in every Pod, such as a Metastore Pod, that requires connections to MySQL. If the Docker image does not contain such a jar file and Metastore does not automatically download such a jar file, there are two ways to make a MySQL connector available in a Pod:
- We use the
preBootstrapCommands
field in the the specification of themr3-master
node group to automatically download a MySQL connector jar file. - The user manually copies a MySQL connector jar file to the PersistentVolumeClaim
workdir-pvc
.
Here we explain the first approach. The second approach is explained in Creating a PersistentVolume using EFS.
In order to automatically download a MySQL connector jar file,
we extend the specification of the EKS cluster (where your.server.address
should be replaced appropriately).
$ vi kubernetes/eks/cluster.yaml
nodeGroups:
- name: mr3-master
preBootstrapCommands:
- "wget http://your.server.address/mysql-connector-java-8.0.12.jar"
- "mkdir -p /home/ec2-user/lib"
- "mv mysql-connector-java-8.0.12.jar /home/ec2-user/lib"
Now every node instance in the mr3-master
node group starts with mysql-connector-java-8.0.12.jar
in the directory /home/ec2-user/lib
.
Next we extend kubernetes/yaml/metastore.yaml
to create a hostPath volume and mount it under the directory /opt/mr3-run/host-lib
.
$ vi kubernetes/eks/cluster.yaml
spec:
template:
spec:
containers:
volumeMounts:
- name: host-lib-volume
mountPath: /opt/mr3-run/host-lib
volumes:
- name: host-lib-volume
hostPath:
path: /home/ec2-user/lib
type: Directory
Since the directory /opt/mr3-run/host-lib
is included in the classpath (see kubernetes/hive/hive/hive-setup.sh
),
the MySQL connector jar file is accessible to Metastore.
Data transfer cost between multiple Availability Zones
By default, the command eksctl
uses all Availability Zones (AZs) from the region specified by the field metadata/region
.
As a result, ContainerWorker Pods usually spread across multiple AZs and may not be collocated with the HiveServer2 Pod.
The use of multiple AZs, however, can have an unintended consequence because
Amazon charges for data transfer between different AZs ($0.01/GB to $0.02/GB).
Specifically ContainerWorkers exchange intermediate data very often and in large quantities,
and the data transfer cost can be surprisingly high, sometimes surpassing the total cost of EC2 instances.
Thus the user may want to restrict an EKS cluster to a single AZ and avoid high data transfer costs.
The user can use a single AZ by updating eks/cluster.yaml
as follows:
$ vi kubernetes/eks/cluster.yaml
availabilityZones: ["ap-northeast-2a", "ap-northeast-2b", "ap-northeast-2c"]
nodeGroups:
- name: mr3-master
availabilityZones: ["ap-northeast-2a"]
- name: mr3-worker
availabilityZones: ["ap-northeast-2a"]
If eksctl
does not accepts this update, upgrade it to the latest version.
Enabling autoscaling
In order to enable autoscaling, create an IAM policy EKSAutoScalingWorkerPolicy
.
$ vi EKSAutoScalingWorkerPolicy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeLaunchConfigurations",
"autoscaling:DescribeTags",
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup",
"ec2:DescribeLaunchTemplateVersions"
],
"Resource": "*",
"Effect": "Allow"
}
]
}
$ aws iam create-policy --policy-name EKSAutoScalingWorkerPolicy --policy-document file://EKSAutoScalingWorkerPolicy.json
{
"Policy": {
...
"Arn": "arn:aws:iam::111111111111:policy/EKSAutoScalingWorkerPolicy",
...
Then use the ARN (Amazon Resource Name) of the IAM policy
in the iam/attachPolicyARNs
field of both node groups.
(Without using the ARN for mr3-master
, the user cannot check the status of the Kubernetes Autoscaler.)
nodeGroups:
- name: mr3-master
iam:
attachPolicyARNs:
- arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
- arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
- arn:aws:iam::111111111111:policy/EKSAutoScalingWorkerPolicy
- name: mr3-worker
iam:
attachPolicyARNs:
- arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
- arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
- arn:aws:iam::111111111111:policy/EKSAutoScalingWorkerPolicy
Accessing S3
If an IAM policy for accessing S3 buckets is available,
include its ARN in the the iam/attachPolicyARNs
field of both node groups
(similarly to including the ARN of the IAM policy EKSAutoScalingWorkerPolicy
).
Then every Pod is allowed to access S3 buckets.
For accessing S3 buckets when an IAM policy is not available at the time of creating an EKS cluster, see Accessing S3 Buckets.
Executing the command eksctl
We create the EKS cluster as follows.
$ eksctl create cluster -f kubernetes/eks/cluster.yaml
[ℹ] using region ap-northeast-2
[ℹ] setting availability zones to [ap-northeast-2c ap-northeast-2b ap-northeast-2a]
...
[✔] EKS cluster "hive-mr3" in "ap-northeast-2" region is ready
The user can verify that a total of four nodes are available in the EKS cluster.
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-192-168-57-153.ap-northeast-2.compute.internal Ready <none> 15m v1.13.7-eks-c57ff8
ip-192-168-59-250.ap-northeast-2.compute.internal Ready <none> 14m v1.13.7-eks-c57ff8
ip-192-168-8-69.ap-northeast-2.compute.internal Ready <none> 14m v1.13.7-eks-c57ff8
ip-192-168-94-214.ap-northeast-2.compute.internal Ready <none> 14m v1.13.7-eks-c57ff8
Configuring Kubernetes Autoscaler
If autoscaling is enabled,
open eks/cluster-autoscaler-autodiscover.yaml
and change the configuration for autoscaling if necessary.
By default, the Kubernetes Autoscaler removes nodes that stay idle for 1 minute (as specified by --scale-down-unneeded-time
).
$ vi eks/cluster-autoscaler-autodiscover.yaml
spec:
template:
spec:
containers:
command:
- --scale-down-delay-after-add=5m
- --scale-down-unneeded-time=1m
Start the Kubernetes Autoscaler.
$ kubectl apply -f kubernetes/eks/cluster-autoscaler-autodiscover.yaml
serviceaccount/cluster-autoscaler created
clusterrole.rbac.authorization.k8s.io/cluster-autoscaler created
role.rbac.authorization.k8s.io/cluster-autoscaler created
clusterrolebinding.rbac.authorization.k8s.io/cluster-autoscaler created
rolebinding.rbac.authorization.k8s.io/cluster-autoscaler created
deployment.apps/cluster-autoscaler created
The user can check that the Kubernetes Autoscaler has started properly.
$ kubectl logs -f deployment/cluster-autoscaler -n kube-system
...
I0717 08:40:39.817460 1 utils.go:543] Skipping ip-192-168-10-97.ap-northeast-2.compute.internal - no node group config
I0717 08:40:39.817502 1 static_autoscaler.go:393] Scale down status: unneededOnly=true lastScaleUpTime=2020-07-17 08:40:29.816626868 +0000 UTC m=+2.704022951 lastScaleDownDeleteTime=2020-07-17 08:40:29.816626974 +0000 UTC m=+2.704023060 lastScaleDownFailTime=2020-07-17 08:40:29.816627087 +0000 UTC m=+2.704023171 scaleDownForbidden=false isDeleteInProgress=false
Since the Kubernetes Autoscaler is configured to remove nodes that remain idle for 1 minute for fast scale-in,
mr3.auto.scale.in.grace.period.secs
in conf/mr3-site.xml
can be set to 90 seconds
(60 seconds of idle time and extra 30 seconds to account for delays).
If the user wants to increase the value of --scale-down-unneeded-time
in eks/cluster-autoscaler-autodiscover.yaml
,
the configuration key mr3.auto.scale.in.grace.period.secs
should be adjusted accordingly.
Deleting the EKS cluster
Because of the additional components configured manually,
it take a few extra steps to delete the EKS cluster.
In order to delete the EKS cluster (created with eksctl
), proceed in the following order.
- If necessary, remove the new inbound rule (using NFS) in the security group for EFS.
- Delete EFS on the AWS console.
$ aws efs delete-file-system --file-system-id $EFSID
- Remove additionoal policies from the IAM roles for the
mr3-master
andmr3-worker
node groups. - Execute the command
eksctl delete cluster -f kubernetes/eks/cluster.yaml
.$ eksctl delete cluster -f kubernetes/eks/cluster.yaml
If the last command fails, the user should delete the EKS cluster manually. Proceed in the following order on the AWS console.
- Delete security groups manually.
- Delete the NAT gateway created for the EKS cluster, delete the VPC, and then delete the Elastic IP address.
- Delete the LoadBalancer.
- Delete IAM roles.
- Delete CloudFormations.