Prerequisites

Using MR3 Cloud on Amazon EKS has the following prerequisites:

  1. The user can create IAM policies.
  2. The user has access to an S3 bucket storing the warehouse and all S3 buckets containing datasets.
  3. The user can create an EKS cluster with the command eksctl.
  4. The user can configure LoadBalancers.
  5. The user can create EFS.
  6. A database server for Metastore is ready and accessible from the EKS cluster.
  7. A database server for Ranger is ready and accessible from the EKS cluster. It is okay to use the same database server for Metastore.
  8. The user can run Beeline to connect to HiveServer2 running at a given address.

The user may create new resources (such as IAM policies) either on the AWS console or by executing AWS CLI.

Load

After loading a configuration, the menu bar at the top shows several menus colored either in red or in grey.

topmenu

A red menu means that some parameters are wrong or missing, and a grey menu means that it has not been visited. After downloading a YAML file, the menu turns green.

topmenu

The menu Connect turns green when all the other menus are green.

Download

When all the input fields are okay, press the Download button to download a YAML file. Then the user can execute the command eksctl or kubectl.

download

The quick start guide Using MR3 Cloud on EKS contains more details.

EKS page

1. IAM policy for autoscaling

Create an IAM policy for autoscaling as shown below. Get the ARN (Amazon Resource Name) of the IAM policy. In our example, we create an IAM policy called EKSAutoScalingWorkerPolicy.

$ vi EKSAutoScalingWorkerPolicy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "autoscaling:DescribeAutoScalingGroups",
        "autoscaling:DescribeAutoScalingInstances",
        "autoscaling:DescribeLaunchConfigurations",
        "autoscaling:DescribeTags",
        "ec2:DescribeInstanceTypes",
        "ec2:DescribeLaunchTemplateVersions"
      ],
      "Resource": ["*"]
    },
    {
      "Effect": "Allow",
      "Action": [
        "autoscaling:SetDesiredCapacity",
        "autoscaling:TerminateInstanceInAutoScalingGroup",
        "ec2:DescribeInstanceTypes",
        "eks:DescribeNodegroup"
      ],
      "Resource": ["*"]
    }
  ]
}

$ aws iam create-policy --policy-name EKSAutoScalingWorkerPolicy --policy-document file://EKSAutoScalingWorkerPolicy.json
{
    "Policy": {
...
        "Arn": "arn:aws:iam::111111111111:policy/EKSAutoScalingWorkerPolicy",
...

Use the ARN in the field Autoscaling Policy in the section IAM Policy.

2. IAM policy for accessing S3 buckets

Create an IAM policy for allowing every Pod to access S3 buckets storing the warehouse and containing datasets. Adjust the Action field to restrict the set of operations permitted to Pods. Get the ARN of the IAM policy. In our example, we create an IAM policy called MR3AccessS3.

$ vi MR3AccessS3.json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:*"
            ],
            "Resource": [
                "arn:aws:s3:::hive-warehouse-dir",
                "arn:aws:s3:::hive-warehouse-dir/*",
                "arn:aws:s3:::hive-partitioned-1000-orc",
                "arn:aws:s3:::hive-partitioned-1000-orc/*"
            ]
        }
    ]
}

$ aws iam create-policy --policy-name MR3AccessS3 --policy-document file://MR3AccessS3.json
{
    "Policy": {
...
        "Arn": "arn:aws:iam::111111111111:policy/MR3AccessS3",
...

Use the ARN in the field S3 Access Policy in the section IAM Policy.

3. Creating an EKS cluster

Fill in all the input fields. Choose a unique name for an EKS cluster and use it in the field Name. Download a YAML file eks-cluster.yaml and execute the command eksctl.

$ eksctl create cluster -f eks-cluster.yaml
2022-05-24 17:23:31 [ℹ]  eksctl version 0.86.0
2022-05-24 17:23:31 [ℹ]  using region ap-northeast-2
2022-05-24 17:23:31 [ℹ]  setting availability zones to [ap-northeast-2c ap-northeast-2a ap-northeast-2d]
..
2022-05-24 17:39:49 [✔]  EKS cluster "hive-mr3" in "ap-northeast-2" region is ready

The user can verify that only a master node is available in the EKS cluster.

$ kubectl get nodes
NAME                                                STATUS   ROLES    AGE   VERSION
ip-192-168-55-210.ap-northeast-2.compute.internal   Ready    <none>   59s   v1.21.5-eks-9017834

Get the public IP address of the master node which we may need when checking access to the database servers for Metastore and Ranger on the Applications page.

$ kubectl describe node ip-192-168-55-210.ap-northeast-2.compute.internal | grep -e InternalIP -e ExternalIP
  InternalIP:   192.168.55.210
  ExternalIP:   3.34.187.73

Autoscaler page

Download a YAML file autoscaler.yaml and execute the command kubectl to start the Kubernetes Autoscaler.

$ kubectl apply -f autoscaler.yaml
serviceaccount/cluster-autoscaler created
deployment.apps/cluster-autoscaler created
clusterrole.rbac.authorization.k8s.io/cluster-autoscaler created
clusterrolebinding.rbac.authorization.k8s.io/cluster-autoscaler created
role.rbac.authorization.k8s.io/cluster-autoscaler created
rolebinding.rbac.authorization.k8s.io/cluster-autoscaler created

The user can check that the Kubernetes Autoscaler has started properly.

$ kubectl get pods -n kube-system | grep autoscaler
NAME                                READY   STATUS    RESTARTS   AGE
cluster-autoscaler-cbd5c6cf-msbpx   1/1     Running   0          18s

Services page

Choose a namespace and use it in the field Namespace. In order to use HTTPS when connecting to the Apache server, the user should provide an SSL certificate created with AWS Certificate Manager. Use its ARN in the field SSL Certificate ARN.

Download a YAML file service.yaml and execute the command kubectl to create two LoadBalancer services. Later the LoadBalancer with LoadBalancerPort 8080 is connected to an Apacher server while the LoadBalancer with LoadBalancerPort 10001 is connect to HiveServer2.

$ kubectl create -f service.yaml
namespace/hivemr3 created
service/apache created
service/hiveserver2 created

$ aws elb describe-load-balancers
...
            "CanonicalHostedZoneName": "a75c6382cd52746b4bc0370f0495d3c8-1372639668.ap-northeast-2.elb.amazonaws.com",
            "CanonicalHostedZoneNameID": "ZWKZPGTI48KDX",
            "ListenerDescriptions": [
                {
                    "Listener": {
                        "Protocol": "TCP",
                        "LoadBalancerPort": 8080,
...
            "CanonicalHostedZoneName": "ac004dea1e4224b249b0ea88183d96c8-2015438666.ap-northeast-2.elb.amazonaws.com",
            "CanonicalHostedZoneNameID": "ZWKZPGTI48KDX",
            "ListenerDescriptions": [
                {
                    "Listener": {
                        "Protocol": "TCP",
                        "LoadBalancerPort": 10001,
...

Get the address and hostname of each service (which we use on the Applications page).

$ nslookup a75c6382cd52746b4bc0370f0495d3c8-1372639668.ap-northeast-2.elb.amazonaws.com
...
Name:	a75c6382cd52746b4bc0370f0495d3c8-1372639668.ap-northeast-2.elb.amazonaws.com
Address: 3.39.33.1
Name:	a75c6382cd52746b4bc0370f0495d3c8-1372639668.ap-northeast-2.elb.amazonaws.com
Address: 13.209.124.90

$ nslookup ac004dea1e4224b249b0ea88183d96c8-2015438666.ap-northeast-2.elb.amazonaws.com
...
Name:	ac004dea1e4224b249b0ea88183d96c8-2015438666.ap-northeast-2.elb.amazonaws.com
Address: 3.38.63.119
Name:	ac004dea1e4224b249b0ea88183d96c8-2015438666.ap-northeast-2.elb.amazonaws.com
Address: 13.209.183.4

EFS page

1. Subnet ID and security group ID

Assuming that the name of the EKS cluster is hive-mr3 (specified in the field Name on the EKS page), get the VPC ID of CloudFormation eksctl-hive-mr3-eks-cluster.

$ aws ec2 describe-vpcs --filter Name=tag:aws:cloudformation:stack-name,Values=eksctl-hive-mr3-cluster --query "Vpcs[*].[VpcId]"
[
    [
        "vpc-06e54d3ea607cc43b"
    ]
]

$ VPCID=vpc-06e54d3ea607cc43b

Get the public subnet ID of CloudFormation eksctl-hive-mr3-eks-cluster.

$ aws ec2 describe-subnets --filter Name=vpc-id,Values=$VPCID Name=availability-zone,Values=ap-northeast-2a Name=tag:aws:cloudformation:stack-name,Values=eksctl-hive-mr3-cluster Name=tag:Name,Values="*Public*" --query "Subnets[*].[SubnetId]"
[
    [
        "subnet-07c676d17a301e4af"
    ]
]

$ SUBNETID=subnet-07c676d17a301e4af

Get the ID of the security group for the EKS cluster that matches the pattern eksctl-hive-mr3-eks-cluster-ClusterSharedNodeSecurityGroup-*.

$ aws ec2 describe-security-groups --filters Name=vpc-id,Values=$VPCID Name=group-name,Values="eksctl-hive-mr3-cluster-ClusterSharedNodeSecurityGroup-*" --query "SecurityGroups[*].[GroupName,GroupId]"
[
    [
        "eksctl-hive-mr3-cluster-ClusterSharedNodeSecurityGroup-156X37EGV080",
        "sg-0280692de7b048468"
    ]
]

$ SGROUPALL=sg-0280692de7b048468

2. Creating and mounting EFS

Create EFS in the Availability Zone specified in the section General on the EKS page. Get the file system ID of EFS.

$ aws efs create-file-system --performance-mode generalPurpose --throughput-mode bursting --availability-zone-name ap-northeast-2a
...
    "FileSystemId": "fs-0226705cce380a0cd",
...

$ EFSID=fs-0226705cce380a0cd

Create a mount target using the subnet ID of CloudFormation eksctl-hive-mr3-cluster and the security group ID for the EKS cluster. Get the mount target ID which is necessary when deleting the EKS cluster.

$ aws efs create-mount-target --file-system-id $EFSID --subnet-id $SUBNETID --security-groups $SGROUPALL
...
    "MountTargetId": "fsmt-0cd125aee66e5d71a",
...

$ MOUNTID=fsmt-0cd125aee66e5d71a

3. Creating a StorageClass

Use the EFS ID (not the mount target ID) in the field EFS ID. Download a YAML file efs.yaml and execute the command kubectl.

$ kubectl create -f efs.yaml
serviceaccount/efs-provisioner created
configmap/efs-provisioner created
deployment.apps/efs-provisioner created
storageclass.storage.k8s.io/aws-efs created
clusterrole.rbac.authorization.k8s.io/efs-provisioner-runner created
clusterrolebinding.rbac.authorization.k8s.io/run-efs-provisioner created
role.rbac.authorization.k8s.io/leader-locking-efs-provisioner created
rolebinding.rbac.authorization.k8s.io/leader-locking-efs-provisioner created

The user can find a new StorageClass aws-efs and a new Pod in the namespace specified on the Services page.

$ kubectl get sc
NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
aws-efs         example.com/aws-efs     Delete          Immediate              false                  28s
gp2 (default)   kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   false                  22m

$ kubectl get pods -n hivemr3
NAME                               READY   STATUS    RESTARTS   AGE
efs-provisioner-749fcdf477-v5jb5   1/1     Running   0          47s

Applications page

1. Access to the database servers for Metastore and Ranger

Check if the database servers for Metastore and Ranger are accessible from the master node. If the database server is running on Amazon AWS, the user may have to update its security group or VPC configuration.

2. Start all the components

Fill in all necessary nput fields. Download a YAML file apps.yaml and execute the command eksctl.

$ kubectl create -f apps.yaml

The user can find that a PersistentVolumeClaim workdir-pvc is in use.

$ kubectl get pvc -n hivemr3
NAME          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
workdir-pvc   Bound    pvc-668a010b-05e8-408a-b726-98c3d1dc0fc3   100Gi      RWX            aws-efs        95s

Before executing queries, a total of 9 Pods are created in the namespace specified on the Services page.

$ kubectl get pods -n hivemr3
NAME                                            READY   STATUS    RESTARTS   AGE
efs-provisioner-749fcdf477-v5jb5                1/1     Running   0          97m
hivemr3-apache-0                                1/1     Running   0          2m3s
hivemr3-hiveserver2-789bb49978-tb497            1/1     Running   0          2m4s
hivemr3-hiveserver2-internal-548d4454c4-wtx57   1/1     Running   0          2m4s
hivemr3-metastore-0                             1/1     Running   0          2m4s
hivemr3-ranger-0                                2/2     Running   0          2m4s
hivemr3-superset-0                              1/1     Running   0          2m4s
hivemr3-timeline-0                              4/4     Running   0          2m3s
mr3master-5848-0-8669b5564d-jbnzm               1/1     Running   0          102s

Connect page

cloud.connect

On the Connect page, the user can view the current configuration and save the current configuration in a JSON file with a given name.

In the left column, the user can find HiveServer2 connection URLs and links to Ranger, MR3-UI, Grafana, and Superset. Note that all the links point to the address of the Apache server or its subpaths.

Deleting the EKS cluster

Because of the additional components configured manually, it take a few extra steps to delete the EKS cluster. In order to delete the EKS cluster, proceed in the following order.

  1. Delete all the components.
    $ kubectl delete -f apps.yaml
    
  2. Delete resources created automatically by Hive on MR3.
    $ kubectl -n hivemr3 delete configmap mr3conf-configmap-master mr3conf-configmap-worker
    $ kubectl -n hivemr3 delete svc service-master-5848-0 service-worker
    $ kubectl -n hivemr3 delete deployment --all
    $ kubectl -n hivemr3 delete pods --all
    
  3. Delete the resources for EFS.
    $ kubectl delete -f efs.yaml
    
  4. Delete the services.
    $ kubectl delete -f service.yaml
    
  5. Remove the mount targe for EFS.
    $ aws efs delete-mount-target --mount-target-id $MOUNTID
    
  6. Delete EFS if necessary. Note that the same EFS can be reused for the next installation of Hive on MR3.
    $ aws efs delete-file-system --file-system-id $EFSID
    
  7. Stop Kubernetes Autoscaler
    $ kubectl delete -f eks-autoscaler-autodiscover.yaml
    
  8. Delete EKS with eksctl.
    $ eksctl delete cluster -f eks-cluster.yaml 
    

If the last command fails, the user should delete the EKS cluster manually. Proceed in the following order on the AWS console.

  1. Delete security groups manually.
  2. Delete the NAT gateway created for the EKS cluster, delete the VPC, and then delete the Elastic IP address.
  3. Delete LoadBalancers.
  4. Delete CloudFormations.