This page shows how to use Helm to operate Hive on MR3 with Minikube. All components (Metastore, HiveServer2, MR3 DAGAppMaster) will be running inside Minikube. For Metastore, we will run a MySQL database as a Pod inside Minikube. By following the instruction, the user will learn:
- how to start Metastore using Helm
- how to use Helm to run Hive on MR3 with Minikube
- how to create Beeline connections and send queries to HiveServer2 running inside Minikube
This scenario has the following prerequisites:
- A running Minikube cluster should be available.
- The user should be able to execute: 1) command
docker
so as to build Docker images; 2) commandkubectl
so as to start Pods; 3) commandhelm
to use Helm. - A MySQL connector should be available.
This scenario should take less than 30 minutes to complete,
not including the time for downloading a Hadoop binary distribution and an MR3 release.
This page has been tested with MR3 release 1.2 on CentOS 7.5 running Minikube v1.2.0 using user gla
.
Installation
Download a Hadoop binary distribution and uncompress it. For Hive 3 and earlier, Hadoop 2.7.7 works okay.
$ wget http://apache.tt.co.kr/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
$ gunzip -c hadoop-2.7.7.tar.gz | tar xvf -
Download a pre-built MR3 release and uncompress it.
Below we choose the pre-built MR3 release based on Hive 3.1.2 which corresponds to --hivesrc3
option to be used later.
$ wget https://github.com/mr3project/mr3-release/releases/download/v1.2/hivemr3-1.2-hive3.1.2.tar.gz
$ gunzip -c hivemr3-1.2-hive3.1.2.tar.gz | tar xvf -;
$ cd hivemr3-1.2-hive3.1.2
Set the environment variable JAVA_HOME
if necessary.
Update the environment variable HADOOP_HOME
in env.sh
so that
it points to the installation directory of the Hadoop binary distribution.
$ vi env.sh
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.242.b08-0.el7_7.x86_64/
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_HOME=/data1/gla/hadoop-2.7.7
Building a Docker image
HIVE_MYSQL_DRIVER
in env.sh
(not kubernetes/env.sh
) should point to a MySQL connector jar file
which should be compatible with the MySQL database for Metastore.
One can download the official JDBC driver for MySQL at https://dev.mysql.com/downloads/connector/j/.
$ vi env.sh
HIVE_MYSQL_DRIVER=/data1/gla/mysql-connector-java-8.0.12.jar
Collect all necessary files for running Hive on MR3 in the directory kubernetes/hive
by executing build-k8s.sh
.
$ ./build-k8s.sh --hivesrc3
$ ls kubernetes/hive/hadoop/apache-hadoop/
bin etc lib libexec share
$ ls kubernetes/hive/hive/apache-hive/
bin conf hcatalog lib scripts
Open kubernetes/env.sh
and set DOCKER_HIVE_IMG
so that Minikube reads the Docker image from the local machine.
$ vi kubernetes/env.sh
DOCKER_HIVE_IMG=hive3
Edit kubernetes/build-hive.sh
so that we store Docker images on the local machine.
$ vi kubernetes/build-hive.sh
sudo docker build -t $DOCKER_HIVE_IMG -f $DOCKER_HIVE_FILE .
# sudo docker push $DOCKER_HIVE_IMG
Run kubernetes/build-hive.sh
to build a Docker image.
$ kubernetes/build-hive.sh
Starting a MySQL database
For simplicity, we will run a MySQL database for Metastore as a Pod inside Minikube.
$
$ helm install --name mysql --namespace hivemr3 stable/mysql
NAME: mysql
LAST DEPLOYED: Tue Oct 27 21:12:00 2020
NAMESPACE: hivemr3
STATUS: DEPLOYED
...
NOTES:
MySQL can be accessed via port 3306 on the following DNS name from within your cluster:
mysql.hivemr3.svc.cluster.local
...
mysql.hivemr3.svc.cluster.local
is the address (FQDN) of the MySQL database.
Retrieve the root password as follows:
$ kubectl get secret --namespace hivemr3 mysql -o jsonpath="{.data.mysql-root-password}" | base64 --decode; echo
Cn3GwuCC6N
Linking configuration files
We will reuse the configuration files in kubernetes/conf/
(and keys in kubernetes/key
if Kerberos is used for authentication).
Create symbolic links.
$ ln -s $(pwd)/kubernetes/conf/ kubernetes/helm/hive/conf
$ ln -s $(pwd)/kubernetes/key/ kubernetes/helm/hive/key
Now any change to the configuration files in kubernetes/conf/
is honored when running Hive on MR3.
Creating local directories
We need to create two new local directories:
- for a PersistentVolume to be shared by Pods;
- for a hostPath volume for ContainerWorker Pods.
Create a local directory for the PersistentVolume.
$ mkdir /home/gla/workdir
$ chmod 777 /home/gla/workdir
Hive on MR3 uses local disks for writing intermediate data. In the case of running on Kubernetes, we mount hostPath volumes to mount directories of the local machine. For our example, we create a local directory for the hostPath volume for ContainerWorker Pods.
$ mkdir -p /data1/gla/k8s
Configuring Pods
Create kubernetes/helm/hive/values-minikube.yaml
which is a collection of values to override those in kubernetes/helm/hive/values.yaml
.
$ vi kubernetes/helm/hive/values-minikube.yaml
docker:
image: hive3
imagePullPolicy: Never
create:
metastore: true
metastore:
databaseHost: mysql.hivemr3.svc.cluster.local
warehouseDir: file:///opt/mr3-run/work-dir/warehouse
initSchema: true
mountLib: false
secureMode: false
resources:
requests:
cpu: 1
memory: 4Gi
limits:
cpu: 1
memory: 4Gi
heapSize: 4096
hive:
externalIp: 12.34.56.78 # use your IP address
authentication: NONE
resources:
requests:
cpu: 1
memory: 8Gi
limits:
cpu: 1
memory: 8Gi
heapSize: 8192
workDir:
isNfs: false
volumeStr: "hostPath:\n path: /home/gla/workdir"
docker.imagePullPolicy
is set toNever
because we use the Docker image on the local machine.create.metastore
is set to true because we will create a Metastore Pod.metastore.databaseHost
is set to the address (FQDN) of the MySQL database.metastore.mountLib
is set to false because the Docker image already contains a MySQL connector.hive.externalIp
is set to the public IP address of the local machine.workDir.volumeStr
is set to the path to the local directory for the PersistentVolume.
Update kubernetes/conf/mr3-site.xml
.
$ vi kubernetes/conf/mr3-site.xml
<property>
<name>mr3.k8s.pod.image.pull.policy</name>
<value>Never</value>
</property>
<property>
<name>mr3.k8s.pod.worker.hostpaths</name>
<value>/data1/gla/k8s/</value>
</property>
mr3.k8s.pod.image.pull.policy
is set toNever
because we use the Docker image on the local machine.mr3.k8s.pod.worker.hostpaths
is set to the path to the local directory for the hostPath volume.
Update kubernetes/conf/hive-site.xml
.
$ vi kubernetes/conf/hive-site.xml
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>Cn3GwuCC6N</value>
</property>
<property>
<name>hive.metastore.pre.event.listeners</name>
<value></value>
</property>
<property>
<name>metastore.pre.event.listeners</name>
<value></value>
</property>
<property>
<name>hive.security.authenticator.manager</name>
<value>org.apache.hadoop.hive.ql.security.ProxyUserAuthenticator</value>
</property>
<property>
<name>hive.security.authorization.manager</name>
<value>org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory</value>
</property>
javax.jdo.option.ConnectionPassword
is set to the password of the MySQL database.hive.metastore.pre.event.listeners
andmetastore.pre.event.listeners
are set to empty because we do not enable security on the Metastore side.
Update kubernetes/conf/core-site.xml
.
$ vi kubernetes/conf/core-site.xml
<property>
<name>hadoop.security.authentication</name>
<value>simple</value>
</property>
hadoop.security.authentication
is set tosimple
in order to disable Kerberos for authentication.
Starting Hive on MR3
Before running HiveServer2, the user should remove the label node-role.kubernetes.io/master
from minikube
node.
This is because Hive on MR3 does not count the resources of master nodes
when estimating the resources for ContainerWorker Pods.
Since minikube
node, the only node in a Minikube cluster, is a master node,
we should demote it to an ordinary node in order to secure resources for ContainerWorker Pods.
Thus, in order to be able to create ContainerWorker Pods in minikube
node,
the user should execute the following command:
$ kubectl label node minikube node-role.kubernetes.io/master-
Before running HiveServer2, the user should also make sure that no ConfigMaps and Services exist in the namespace hivemr3
.
For example, the user may see ConfigMaps and Services left over from a previous run.
$ kubectl get configmaps -n hivemr3
NAME DATA AGE
mr3conf-configmap-master 1 16m
mr3conf-configmap-worker 1 16m
$ kubectl get svc -n hivemr3
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service-master-1237-0 ClusterIP 10.105.238.21 <none> 80/TCP 11m
service-worker ClusterIP None <none> <none> 11m
In such a case, manually delete these ConfigMaps and Services.
$ kubectl delete configmap -n hivemr3 mr3conf-configmap-master mr3conf-configmap-worker
$ kubectl delete svc -n hivemr3 service-master-1237-0 service-worker
Install Helm chart for Hive on MR3 with values-minikube.yaml
.
We use hivemr3
for the namespace.
$ helm install --namespace hivemr3 kubernetes/helm/hive -f kubernetes/helm/hive/values-minikube.yaml
NAME: callous-buffalo
LAST DEPLOYED: Tue Oct 27 21:16:14 2020
NAMESPACE: hivemr3
STATUS: DEPLOYED
...
==> v1/ConfigMap
NAME DATA AGE
client-am-config 4 1s
env-configmap 1 1s
hivemr3-conf-configmap 18 1s
...
Check if all ConfigMaps are non-empty.
If the DATA
column for hivemr3-conf-configmap
is 0,
try to remove unnecessary files in the directory kubernetes/conf
or kubernetes/helm/hive/conf
.
The user can find four Pods running in the Minikube cluster:
- MySQL database; 2) Metastore; 3) HiveServer2; MR3 DAGAppMaster.
$ kubectl get pods -n hivemr3
NAME READY STATUS RESTARTS AGE
hivemr3-hiveserver2-drbbv 0/1 Running 0 40s
hivemr3-metastore-0 1/1 Running 0 40s
mr3master-1860-0-9xht2 0/1 Running 0 8s
mysql-8569cdf6fc-qzdkx 1/1 Running 0 4m54s
HiveServer2 Pod hivemr3-hiveserver2-drbbv
soon becomes ready after the readiness probe contacts it.
Running Beeline
Download a sample dataset and copy it to the directory for the PersistentVolume.
$ wget https://github.com/mr3project/mr3-release/releases/download/v1.0/pokemon.csv
$ cp pokemon.csv /home/gla/workdir
$ chmod 777 /home/gla/workdir/pokemon.csv
The user can verify that the sample dataset is accessible inside the HiveServer2 Pod.
$ kubectl exec -n hivemr3 -it hivemr3-hiveserver2-drbbv -- /bin/bash
root@hivemr3-hiveserver2-drbbv:/opt/mr3-run/hive# ls /opt/mr3-run/work-dir/
2beef7a1-5f54-401e-a78d-58b97ebcdaad_resources 48fd0bd5-65f4-4e41-aed8-b02d675b9a8f_resources pokemon.csv root
root@hivemr3-hiveserver2-drbbv:/opt/mr3-run/hive# exit
While the user may use any client program to connect to HiveServer2,
the MR3 release provides a script kubernetes/hive/hive/run-beeline.sh
which slightly simplifies the process of configuring Beeline.
Copy the file kubernetes/env.sh
and the directory kubernetes/conf
to kubernetes/hive/
.
$ cp kubernetes/env.sh kubernetes/hive/
$ cp -r kubernetes/conf kubernetes/hive
Set the host for HiveServer2 in kubernetes/hive/env.sh
using the value of the field hive.externalIp
in kubernetes/helm/hive/values-minikube.yaml
,
and set the configuration key HIVE_SERVER2_AUTHENTICATION
to NONE
in order not to use Kerberos authentication.
Set the environment variable JAVA_HOME
if necessary.
$ vi kubernetes/hive/env.sh
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.242.b08-0.el7_7.x86_64/
export PATH=$JAVA_HOME/bin:$PATH
HIVE_SERVER2_HOST=12.34.56.78 # use your IP address
HIVE_SERVER2_AUTHENTICATION=NONE
In order to start a Beeline connection, execute kubernetes/hive/hive/run-beeline.sh
.
$ kubernetes/hive/hive/run-beeline.sh
...
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.2 by Apache Hive
0: jdbc:hive2://12.34.56.78:9852/>
Alternatively the user can run Beeline inside the Hiveserver2 Pod.
$ kubectl exec -n hivemr3 -it hivemr3-hiveserver2-drbbv -- /bin/bash
root@hivemr3-hiveserver2-drbbv:/opt/mr3-run/hive# export USER=root
root@hivemr3-hiveserver2-drbbv:/opt/mr3-run/hive# /opt/mr3-run/hive/run-beeline.sh
Execute queries.
0: jdbc:hive2://12.34.56.78:9852/> show databases;
...
+----------------+
| database_name |
+----------------+
| default |
+----------------+
1 row selected (0.119 seconds)
0: jdbc:hive2://12.34.56.78:9852/> use default;
...
No rows affected (0.031 seconds)
0: jdbc:hive2://12.34.56.78:9852/> CREATE TABLE pokemon (Number Int,Name String,Type1 String,Type2 String,Total Int,HP Int,Attack Int,Defense Int,Sp_Atk Int,Sp_Def Int,Speed Int) row format delimited fields terminated BY ',' lines terminated BY '\n' tblproperties("skip.header.line.count"="1");
...
No rows affected (0.494 seconds)
0: jdbc:hive2://12.34.56.78:9852/> load data local inpath '/opt/mr3-run/work-dir/pokemon.csv' INTO table pokemon;
...
No rows affected (0.63 seconds)
0: jdbc:hive2://12.34.56.78:9852/> select avg(HP) from pokemon;
...
+---------------------+
| _c0 |
+---------------------+
| 144.84882280049567 |
+---------------------+
1 row selected (11.241 seconds)
0: jdbc:hive2://12.34.56.78:9852/> create table pokemon1 as select *, IF(HP>160.0,'strong',IF(HP>140.0,'moderate','weak')) AS power_rate from pokemon;
...
0: jdbc:hive2://12.34.56.78:9852/> select COUNT(name), power_rate from pokemon1 group by power_rate;
...
+------+-------------+
| _c0 | power_rate |
+------+-------------+
| 363 | strong |
| 336 | weak |
| 108 | moderate |
+------+-------------+
3 rows selected (2.484 seconds)
The user can see that ContainerWorker Pods have been created.
$ kubectl get pods -n hivemr3
NAME READY STATUS RESTARTS AGE
hivemr3-hiveserver2-drbbv 1/1 Running 0 7m34s
hivemr3-metastore-0 1/1 Running 0 7m34s
mr3master-1860-0-9xht2 1/1 Running 0 7m2s
mr3worker-cade-1 1/1 Running 0 46s
mr3worker-cade-2 1/1 Running 0 14s
mysql-8569cdf6fc-qzdkx 1/1 Running 0 11m
The user can find the warehouse directory /home/gla/workdir/warehouse/
.
$ ls /home/gla/workdir/warehouse
pokemon pokemon1
Terminating Hive on MR3
In order to terminate Hive on MR3, the user should first delete the DAGAppMaster Pod and then delete Helm chart, not the other way. This is because deleting Helm chart revokes the ServiceAccount object which DAGAppMaster uses to delete ContainerWorker Pods. Hence, if the user deletes Helm chart first, all remaining Pods should be deleted manually.
Delete ReplicationController for DAGAppMaster which in turn deletes all ContainerWorker Pods automatically.
$ kubectl get replicationcontroller -n hivemr3
NAME DESIRED CURRENT READY AGE
hivemr3-hiveserver2 1 1 1 8m1s
mr3master-1860-0 1 1 1 7m29s
$ kubectl -n hivemr3 delete replicationcontroller mr3master-1860-0
replicationcontroller "mr3master-1860-0" deleted
Delete Helm chart.
$ helm delete callous-buffalo
release "callous-buffalo" deleted
After a while, the user can see that only the MySQL database Pod remains.
$ kubectl get pods -n hivemr3
NAME READY STATUS RESTARTS AGE
mysql-8569cdf6fc-qzdkx 1/1 Running 0 13m
Then stop the MySQL database.
$ helm delete --purge mysql
As the last step, the user will find that the following objects belonging to the namespace hivemr3
are still alive:
- two ConfigMaps
mr3conf-configmap-master
andmr3conf-configmap-worker
- Service for DAGAppMaster, e.g.,
service-master-6910-0
- Service
service-worker
$ kubectl get configmaps -n hivemr3
NAME DATA AGE
mr3conf-configmap-master 1 8m54s
mr3conf-configmap-worker 1 8m46s
$ kubectl get svc -n hivemr3
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service-master-1860-0 ClusterIP 10.105.70.80 <none> 80/TCP 9m4s
service-worker ClusterIP None <none> <none> 9m
These ConfigMaps and Services are not deleted by the command helm delete
because
they are created not by Helm but by HiveServer2 and DAGAppMaster.
Hence the user should delete these ConfigMaps and Services manually.
$ kubectl delete configmap -n hivemr3 mr3conf-configmap-master mr3conf-configmap-worker
configmap "mr3conf-configmap-master" deleted
configmap "mr3conf-configmap-worker" deleted
$ kubectl delete svc -n hivemr3 service-master-1860-0 service-worker
service "service-master-1860-0" deleted
service "service-worker" deleted