The user can access datasets on Amazon S3 (Simple Storage Service) from the outside of Amazon AWS by providing AWS credentials.
For the security purpose, we specify AWS credentials in environment variables in kubernetes/env.sh
and do not specify them in a configuration file such as kubernetes/conf/core-site.xml
.
As kubernetes/env.sh
may contain AWS credentials,
we mount it as a Secret, not a ConfigMap, inside Metastore and HiveServer2 Pods.
To access S3, the user should compile Tez with an additional option -P aws
(see Building Hive on MR3).
Accessing Amazon S3
In order to access S3, the user should take three steps.
First set the configuration key fs.s3a.aws.credentials.provider
in kubernetes/conf/core-site.xml
.
$ vi kubernetes/conf/core-site.xml
<property>
<name>fs.s3a.aws.credentials.provider</name>
<value>com.amazonaws.auth.EnvironmentVariableCredentialsProvider</value>
</property>
The class EnvironmentVariableCredentialsProvider
attempts to read AWS credentials
from two environment variables AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
.
(Currently MR3 does not support the class org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
.)
Next set two environment variables AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
in kubernetes/env.sh
.
$ vi kubernetes/env.sh
export AWS_ACCESS_KEY_ID=_your_aws_access_key_id_
export AWS_SECRET_ACCESS_KEY=_your_aws_secret_secret_key_
Since kubernetes/env.sh
is mounted as a Secret inside Metastore and HiveServer2 Pods,
it is safe to write AWS access key ID and secret access key in kubernetes/env.sh
.
Optionally the user may use an S3 bucket as the data warehouse by setting the environment variable HIVE_WAREHOUSE_DIR
.
$ vi kubernetes/env.sh
HIVE_WAREHOUSE_DIR=s3a://your-warehouse-dir/warehouse # optional
Note that using an S3 bucket as the data warehouse is reasonable if databases are also stored on S3, but not a strict requirement because MR3 is agnostic to the type of data sources. For example, the user may use HDFS for the data warehouse while accessing external tables on S3, or conversely use S3 for the data warehouse while accessing external tables on HDFS.
Finally append AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
to the values of the configuration keys mr3.am.launch.env
and mr3.container.launch.env
in kubernetes/conf/mr3-site.xml
.
Note that for the security purpose, the user should NOT write AWS access key ID and secret access key.
Just appending the two strings suffices
because MR3 automatically sets the two environment variables by reading from the system environment.
$ vi kubernetes/conf/mr3-site.xml
<property>
<name>mr3.am.launch.env</name>
<value>LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native/,HADOOP_CREDSTORE_PASSWORD,AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,USE_JAVA_17</value>
</property>
<property>
<name>mr3.container.launch.env</name>
<value>LD_LIBRARY_PATH=/opt/mr3-run/hadoop/apache-hadoop/lib/native,HADOOP_CREDSTORE_PASSWORD,AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,USE_JAVA_17</value>
</property>
If the user creates a Kubernetes cluster inside Amazon AWS (e.g., by using EC2 instances),
using the class InstanceProfileCredentialsProvider
for the configuration key fs.s3a.aws.credentials.provider
may be enough.
$ vi kubernetes/conf/mr3-site.xml
<property>
<name>fs.s3a.aws.credentials.provider</name>
<value>com.amazonaws.auth.InstanceProfileCredentialsProvider</value>
</property>
In this case, there is no need to specify AWS credentials in kubernetes/env.sh
.
Accessing S3-compatible storage
If the user wants to access custom S3-compatible storage,
additional configuration keys should be set in kubernetes/conf/core-site.xml
.
In particular, the configuration key fs.s3a.endpoint
should be set to point to the storage server.
Here is an example of setting configuration keys for accessing a custom S3-compatible storage.
$ vi kubernetes/conf/core-site.xml
<property>
<name>fs.s3a.endpoint</name>
<value>http://my.s3.server.address:9000</value>
</property>
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
</property>
<property>
<name>fs.s3a.connection.maximum</name>
<value>2000</value>
</property>
<property>
<name>fs.s3.maxConnections</name>
<value>2000</value>
</property>
<property>
<name>mapreduce.input.fileinputformat.list-status.num-threads</name>
<value>50</value>
</property>
The user can also set a few configuration keys in kubernetes/conf/hive-site.xml
to achieve better performance.
Here is an example.
$ vi kubernetes/conf/hive-site.xml
<property>
<name>hive.exec.input.listing.max.threads</name>
<value>50</value>
</property>
<property>
<name>hive.metastore.fshandler.threads</name>
<value>50</value>
</property>
<property>
<name>hive.msck.repair.batch.size</name>
<value>3000</value>
</property>
<property>
<name>hive.load.dynamic.partitions.thread</name>
<value>25</value>
</property>
<property>
<name>hive.mv.files.threads</name>
<value>40</value>
</property>
To use LLAP I/O when accessing S3, set the configuration key hive.llap.io.use.fileid.path
to false in kubernetes/conf/hive-site.xml
.
$ vi kubernetes/conf/hive-site.xml
<property>
<name>hive.llap.io.use.fileid.path</name>
<value>false</value>
</property>
Accessing S3 with SSL
In order to access S3 with SSL enabled,
the user should set the configuration key fs.s3a.connection.ssl.enabled
in kubernetes/conf/core-site.xml
.
$ vi kubernetes/conf/core-site.xml
<property>
<name>fs.s3a.connection.ssl.enabled</name>
<value>true</value>
</property>
For accessing custom S3-compatible storage, the address for the storage server should be revised.
$ vi kubernetes/conf/core-site.xml
<property>
<name>fs.s3a.endpoint</name>
<value>https://my.s3.server.address:9000</value>
</property>
Next the user should make a copy of the certificate file for connecting to the storage server
and set MR3_S3_CERTIFICATE
in kubernetes/config-run.sh
to point to the copy.
$ vi kubernetes/config-run.sh
ENABLE_SSL=true
...
MR3_S3_CERTIFICATE=/home/hive/mr3-run/kubernetes/s3-public.cert
Now executing kubernetes/run-hive.sh
(with or without the option --generate-truststore
) adds the certificate to the KeyStore file
kubernetes/key/hivemr3-ssl-certificate.jks
,
and every component that uses the KeyStore file can access S3.
For example,
HiveServer2 can access S3 because the configuration key hive.server2.keystore.path
points to the KeyStore file.
$ vi kubernetes/conf/hive-site.xml
<property>
<name>hive.server2.keystore.path</name>
<value>/opt/mr3-run/key/hivemr3-ssl-certificate.jks</value>
</property>
For DAGAppMaster and ContainerWorkers,
the user should check if Java properties javax.net.ssl.trustStore
and javax.net.ssl.trustStoreType
are properly set in their command-line options:
$ vi kubernetes/conf/mr3-site.xml
<property>
<name>mr3.am.launch.cmd-opts</name>
<value>-server -XX:+UseG1GC -XX:+AggressiveOpts -XX:+UseNUMA -XX:G1ReservePercent=20 -XX:MetaspaceSize=1024m -Djava.net.preferIPv4Stack=true -Dlog4j.configurationFile=k8s-mr3-container-log4j2.properties -Djavax.net.ssl.trustStore=/opt/mr3-run/key/hivemr3-ssl-certificate.jks -Djavax.net.ssl.trustStoreType=jks</value>
</property>
<property>
<name>mr3.container.launch.cmd-opts</name>
<value>-server -XX:+UseG1GC -XX:+AggressiveOpts -XX:+UseNUMA -XX:+AlwaysPreTouch -Xss512k -XX:TLABSize=8m -XX:+ResizeTLAB -XX:InitiatingHeapOccupancyPercent=40 -XX:G1ReservePercent=20 -XX:MaxGCPauseMillis=200 -XX:MetaspaceSize=1024m -Djava.net.preferIPv4Stack=true -Dlog4j.configurationFile=k8s-mr3-container-log4j2.properties -Djavax.net.ssl.trustStore=/opt/mr3-run/key/hivemr3-ssl-certificate.jks -Djavax.net.ssl.trustStoreType=jks</value>
</property>