Hive on MR3 supports four different ways to access S3 buckets within an EKS cluster.

  1. Use environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
  2. Update IAM (Identity and Access Management) roles for node groups in the EKS cluster
  3. Use IAM roles for ServiceAccounts
  4. Use IAM roles for ServiceAccounts created by eksctl (e.g., on EKS)

Accessing S3 buckets with environment variables proceeds in the same way whether from the inside or from the outside of AWS, so the user can follow the instruction in Accessing Amazon S3. The remaining ways rely on IAM roles to manage access to S3.

2. Update IAM roles for node groups in the EKS cluster

If an IAM policy for accessing S3 buckets is available before creating an EKS cluster, the user can include its ARN in the the iam/attachPolicyARNs field of node groups mr3-master and mr3-worker in kubernetes/eks/cluster.yaml. Then every Pod is allowed to access S3 buckets.

If an EKS cluster is created without using an IAM policy for accessing S3 buckets, find the IAM roles for the mr3-master and mr3-worker node groups (which typically look like eksctl-hive-mr3-nodegroup-mr3-mas-NodeInstanceRole-448MRIYIQ3F8 and eksctl-hive-mr3-nodegroup-mr3-wor-NodeInstanceRole-E19NHT8X0UJ7). For both IAM roles, add the following inline policy or its variant so that every Pod can access the target S3 bucket. Adjust the Action field to restrict the set of operations permitted to Pods.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:*"
            ],
            "Resource": [
                "arn:aws:s3:::mr3-tpcds-partitioned-2-orc",
                "arn:aws:s3:::mr3-tpcds-partitioned-2-orc/*"
            ]
        }
    ]
}

Now all Pods can access the target S3 bucket.

Depending on the ownership of the target S3 bucket, the user may also have to create a bucket policy. If the target S3 bucket is owned by the same user creating the EKS cluster, a bucket policy is unnecessary.

3. Use IAM roles for ServiceAccounts

By default, Hive on MR3 creates three ServiceAccounts specified by hive-service-account.yaml, master-service-account.yaml, and worker-service-account.yaml in the directory kubernetes/yaml.

  • ServiceAccount hive-service-account for Metastore and HiveServer2 Pods
  • ServiceAccount master-service-account for DAGAppMaster Pod
  • ServiceAccount worker-service-account for ContainerWorker Pods

If the EKS cluster has enabled IAM roles for ServiceAccounts, the user can create an IAM role with a policy for accessing the target S3 bucket and associate it with these ServiceAccounts. Then every Pod can access the target S3 bucket.

  • Enable IAM roles for ServiceAccounts by creating an OIDC identity provider. For more information, see AWS User Guide.
    $ eksctl utils associate-iam-oidc-provider --cluster hive-mr3 --approve
    [ℹ]  eksctl version 0.27.0
    [ℹ]  using region ap-northeast-1
    [ℹ]  will create IAM Open ID Connect provider for cluster "hive-mr3" in "ap-northeast-1"
    [✔]  created IAM Open ID Connect provider for cluster "hive-mr3" in "ap-northeast-1"
    
  • Create an IAM role with a policy for accessing the target S3 bucket. The user may follow the instruction in AWS User Guide, but do not manually create a new ServiceAccount using eksctl because Hive on MR3 creates ServiceAccounts.
  • Associate the IAM role with ServiceAccounts by adding an annotation. The following example shows how to add an annotation in hive-service-account.yaml where NEW_IAM_ROLE_NAME is the name of the IAM role created in the previous step.
    $ vi kubernetes/yaml/hive-service-account.yaml
    
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      namespace: hivemr3
      name: hive-service-account
      annotations:
        eks.amazonaws.com/role-arn: arn:aws:iam::111111111111:role/NEW_IAM_ROLE_NAME
    
  • Set the configuration key fs.s3a.aws.credentials.provider to com.amazonaws.auth.InstanceProfileCredentialsProvider in kubernetes/conf/core-site.xml.
    $ vi kubernetes/conf/core-site.xml
    
    <property>
      <name>fs.s3a.aws.credentials.provider</name>
      <value>com.amazonaws.auth.InstanceProfileCredentialsProvider</value>
    </property>
    
  • If necessary (on Kubernetes 1.18 and earlier), rebuild the Docker image so that all containers run as user root.

4. Use IAM roles for ServiceAccounts created by eksctl

Alternatively the user can create ServiceAccounts with eksctl and use WebIdentityTokenCredentialsProvider instead of InstanceProfileCredentialsProvider. On EKS, we recommend the use of WebIdentityTokenCredentialsProvider.

  • Set the environment variable CREATE_SERVICE_ACCOUNTS to false in kubernetes/env.sh (because we will create ServiceAccounts with eksctl later). When using Helm, set the field create/serviceAccount to false in values.yaml.
    $ vi kubernetes/env.sh
    
    CREATE_SERVICE_ACCOUNTS=false
    
  • Set the environment variable AWS_REGION to a string representing the AWS region in kubernetes/env.sh.
    $ export AWS_REGION=ap-northeast-1  # to be able to execute 'eksctl' without '--region' 
    $ vi kubernetes/env.sh
    
    export AWS_REGION=ap-northeast-1
    

    Append AWS_REGION the values of the configuration keys mr3.am.launch.env and mr3.container.launch.env in kubernetes/conf/mr3-site.xml.

    $ vi kubernetes/conf/mr3-site.xml
    
    <property>
      <name>mr3.am.launch.env</name>
      <value>LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native/,HADOOP_CREDSTORE_PASSWORD,AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_REGION,USE_JAVA_17</value>
    </property>
    
    <property>
      <name>mr3.container.launch.env</name>
      <value>LD_LIBRARY_PATH=/opt/mr3-run/hadoop/apache-hadoop/lib/native,HADOOP_CREDSTORE_PASSWORD,AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_REGION,USE_JAVA_17</value>
    </property>
    

    Without the environment variable AWS_REGION set appropriately, WebIdentityTokenCredentialsProvider fails with the following error:

    WARNING: Unable to retrieve the requested metadata (/latest/dynamic/instance-identity/document). Failed to connect to service endpoint:
    com.amazonaws.SdkClientException: Failed to connect to service endpoint:
      at com.amazonaws.internal.EC2ResourceFetcher.doReadResource(EC2ResourceFetcher.java:100)
    ...
    at com.amazonaws.util.EC2MetadataUtils.getEC2InstanceRegion(EC2MetadataUtils.java:282)
    ...
      at com.amazonaws.auth.WebIdentityTokenCredentialsProvider.getCredentials(WebIdentityTokenCredentialsProvider.java:76)
      at org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:117)
    
  • Enable IAM roles for ServiceAccounts by creating an OIDC identity provider.
    $ eksctl utils associate-iam-oidc-provider --cluster hive-mr3 --approve
    
  • Create an IAM role with a policy for accessing the target S3 bucket.
  • Create ServiceAccounts with eksctl using the IAM role (e.g., arn:aws:iam::111111111111:policy/s3).
    $ eksctl create iamserviceaccount --name hive-service-account --namespace hivemr3 --cluster hive-mr3 --attach-policy-arn arn:aws:iam::111111111111:policy/s3 --approve --override-existing-serviceaccounts
    [ℹ]  eksctl version 0.27.0
    [ℹ]  using region ap-northeast-1
    [ℹ]  1 iamserviceaccount (hivemr3/hive-service-account) was included (based on the include/exclude rules)
    [!]  metadata of serviceaccounts that exist in Kubernetes will be updated, as --override-existing-serviceaccounts was set
    [ℹ]  1 task: { 2 sequential sub-tasks: { create IAM role for serviceaccount "hivemr3/hive-service-account", create serviceaccount "hivemr3/hive-service-account" } }
    [ℹ]  building iamserviceaccount stack "eksctl-hive-mr3-addon-iamserviceaccount-hivemr3-hive-service-account"
    [ℹ]  deploying stack "eksctl-hive-mr3-addon-iamserviceaccount-hivemr3-hive-service-account"
    [ℹ]  created namespace "hivemr3"
    [ℹ]  created serviceaccount "hivemr3/hive-service-account"
    $ eksctl create iamserviceaccount --name master-service-account --namespace hivemr3 --cluster hive-mr3 --attach-policy-arn arn:aws:iam::111111111111:policy/s3 --approve --override-existing-serviceaccounts
    $ eksctl create iamserviceaccount --name worker-service-account --namespace hivemr3 --cluster hive-mr3 --attach-policy-arn arn:aws:iam::111111111111:policy/s3 --approve --override-existing-serviceaccounts
    
    $ eksctl get iamserviceaccount --namespace hivemr3 --cluster hive-mr3 
    NAMESPACE	NAME			ROLE ARN
    hivemr3		hive-service-account	arn:aws:iam::111111111111:role/eksctl-hive-mr3-addon-iamserviceaccount-hive-Role1-RERICJ8FK7AM
    hivemr3		master-service-account	arn:aws:iam::111111111111:role/eksctl-hive-mr3-addon-iamserviceaccount-hive-Role1-Z3SPAHKYB1UI
    hivemr3		worker-service-account	arn:aws:iam::111111111111:role/eksctl-hive-mr3-addon-iamserviceaccount-hive-Role1-18BQ9YHYM8JTV
    
  • Set the configuration key fs.s3a.aws.credentials.provider to com.amazonaws.auth.WebIdentityTokenCredentialsProvider in kubernetes/conf/core-site.xml.
    $ vi kubernetes/conf/core-site.xml
    
    <property>
      <name>fs.s3a.aws.credentials.provider</name>
      <value>com.amazonaws.auth.WebIdentityTokenCredentialsProvider</value>
    </property>
    
  • If necessary (on Kubernetes 1.18 and earlier), rebuild the Docker image so that all containers run as user root.