This page shows how to use a pre-built Docker image available at DockerHub in order to operate Hive on MR3 with Minikube. All components (Metastore, HiveServer2, MR3 DAGAppMaster) will be running inside Minikube. For Metastore, we will run a MySQL database as a Docker container, although an existing MySQL database is also okay to use. By following the instruction, the user will learn:
- how to start Metastore
- how to run Hive on MR3
- how to create Beeline connections and send queries to HiveServer2 running inside Minikube
This scenario has the following prerequisites:
- A running Minikube cluster should be available.
- The user should be able to execute: 1) command
docker
if no MySQL database is available; 2) commandkubectl
. - A MySQL connector should be available.
This scenario should take less than 30 minutes to complete,
not including the time for downloading a pre-built Docker image.
This page has been tested with MR3 release 1.2 on CentOS 7.5 running Minikube v1.2.0 using user gla
.
Installation
Download an MR3 release containing the executable scripts.
$ git clone https://github.com/mr3project/mr3-run-k8s.git
$ cd mr3-run-k8s/kubernetes/
$ git reset --hard 3287a3dcda7cdd4875fcc2b5e345bc9089f000dc
Starting a MySQL database
For simplicity, we will run a MySQL database for Metastore as a Docker container.
$ docker run -d --name mysql-server -p 3306:3306 -e MYSQL_ROOT_PASSWORD=passwd mysql:5.6
$ mysql --user=root --password=passwd --host=127.0.0.1 -e 'show databases;'
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
+--------------------+
Creating local directories
In our scenario,
Hive on MR3 uses four kinds of Pods: Metastore Pod, HiveServer2 Pod, DAGAppMaster Pod, and ContainerWorker Pod.
A Metastore Pod runs a Metastore container, and the user creates a Metastore Pod by executing the script run-metastore.sh
.
A HiveServer2 Pod runs a HiveServer2 container, and the user creates a HiveServer2 Pod by executing the script run-hive.sh
.
A DAGAppMaster Pod is created by HiveServer2, and
a ContainerWorker Pod runs a ContainerWorker container and is created by DAGAppMaster at runtime.
We need to create two new local directories:
- for a PersistentVolume to be shared by all Pods;
- for a hostPath volume where ContainerWorker Pods write intermediate data.
Create a local directory for the PersistentVolume.
$ mkdir /data1/gla/workdir
$ chmod 777 /data1/gla/workdir
Create a local directory for the hostPath volume for ContainerWorker Pods.
$ mkdir -p /data1/gla/k8s
$ chmod 777 /data1/gla/k8s
Preparing a MySQL connector
The user should have a MySQL connector compatible with the MySQL database for Metastore. One can download the official JDBC driver for MySQL at https://dev.mysql.com/downloads/connector/j/.
Copy the MySQL connector to the directory /lib
under the local directory for the PersistentVolume.
$ mkdir -p /data1/gla/workdir/lib
$ cp mysql-connector-java-8.0.12.jar /data1/gla/workdir/lib/
$ chmod 777 /data1/gla/workdir/lib/mysql-connector-java-8.0.12.jar
Configuring Pods
Open yaml/workdir-pv.yaml
and create a hostPath PersistentVolume using the directory created in the previous step.
$ vi yaml/workdir-pv.yaml
spec:
capacity:
storage: 100Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Delete
hostPath:
path: "/data1/gla/workdir"
Open yaml/hiveserver2-service.yaml
and use the IP address of the local machine.
$ vi yaml/hiveserver2-service.yaml
externalIPs:
- 111.111.111.11 # use your IP address
Open env.sh
and set the following environment variables.
$ vi env.sh
DOCKER_HIVE_IMG=mr3project/hive3:1.2
DOCKER_HIVE_WORKER_IMG=mr3project/hive3:1.2
CREATE_KEYTAB_SECRET=false # do not create a Secret from key/*
HIVE_DATABASE_HOST=111.111.111.11 # use your IP address (where the MySQL database is running)
HIVE_WAREHOUSE_DIR=/opt/mr3-run/work-dir/warehouse
METASTORE_SECURE_MODE=false # disable Kerberos authentication
HIVE_SERVER2_HEAPSIZE=8192
HIVE_SERVER2_AUTHENTICATION=NONE
TOKEN_RENEWAL_HDFS_ENABLED=false
HIVE_METASTORE_HEAPSIZE=8192
HIVE_CLIENT_HEAPSIZE=2048 # heap size for Beeline
Open yaml/metastore.yaml
and yaml/hive.yaml
, and update the following fields.
image
is set to a pre-built Docker imagemr3project/hive3:1.2
available at DockerHub.imagePullPolicy
is set toIfNotPresent
because we download the Docker image from DockerHub.args
includes"--init-schema"
because it is the first time to run Metastore.- We mount
work-dir-volume
because we mount a MySQL connector inside the Metastore Pod. - Change the resources if necessary.
$ vi yaml/metastore.yaml
containers:
- image: mr3project/hive3:1.2
command: ["/opt/mr3-run/hive/metastore-service.sh"]
args: ["start", "--kubernetes", "--init-schema"]
imagePullPolicy: IfNotPresent
resources:
requests:
cpu: 1
memory: 8Gi
limits:
cpu: 1
memory: 8Gi
volumeMounts:
- name: work-dir-volume
mountPath: /opt/mr3-run/work-dir/
- name: work-dir-volume
mountPath: /opt/mr3-run/lib
subPath: lib
Open yaml/hive.yaml
and update image
and imagePullPolicy
so that Minikube reads the pre-built Docker image from DockerHub.
Change the resources if necessary.
$ vi yaml/hive.yaml
containers:
- image: mr3project/hive3:1.2
imagePullPolicy: IfNotPresent
resources:
requests:
cpu: 1
memory: 8Gi
limits:
cpu: 1
memory: 8Gi
Open conf/mr3-site.xml
and set the configuration key mr3.k8s.pod.image.pull.policy
to IfNotPresent
.
Set the configuration key mr3.k8s.pod.worker.hostpaths
to the local directory for the hostPath PersistentVolume.
$ vi conf/mr3-site.xml
<property>
<name>mr3.k8s.pod.image.pull.policy</name>
<value>IfNotPresent</value>
</property>
<property>
<name>mr3.k8s.pod.worker.hostpaths</name>
<value>/data1/gla/k8s</value>
</property>
Update conf/hive-site.xml
.
$ vi conf/hive-site.xml
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>passwd</value>
</property>
<property>
<name>hive.metastore.pre.event.listeners</name>
<value></value>
</property>
<property>
<name>metastore.pre.event.listeners</name>
<value></value>
</property>
<property>
<name>hive.security.authorization.manager</name>
<value>org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory</value>
</property>
javax.jdo.option.ConnectionPassword
is set to the password of the MySQL database.hive.metastore.pre.event.listeners
andmetastore.pre.event.listeners
are set to empty because we do not enable security on the Metastore side.
Update conf/core-site.xml
.
$ vi conf/core-site.xml
<property>
<name>hadoop.security.authentication</name>
<value>simple</value>
</property>
hadoop.security.authentication
is set tosimple
in order to disable Kerberos for authentication.
Starting Hive on MR3
Before starting Hive on MR3, the user should remove the label node-role.kubernetes.io/master
from minikube
node.
This is because Hive on MR3 does not count the resources of master nodes
when estimating the resources for ContainerWorker Pods.
Since minikube
node, the only node in a Minikube cluster, is a master node,
we should demote it to an ordinary node in order to secure resources for ContainerWorker Pods.
Thus, in order to be able to create ContainerWorker Pods in minikube
node,
the user should execute the following command:
$ kubectl label node minikube node-role.kubernetes.io/master-
Before starting Hive on MR3, the user should also make sure that no ConfigMaps exist in the namespace hivemr3
.
For example, the user may see ConfigMaps left over from a previous run.
$ kubectl get configmaps -n hivemr3
NAME DATA AGE
mr3conf-configmap-master 1 14m
mr3conf-configmap-worker 1 14m
In such a case, manually delete these ConfigMaps.
$ kubectl delete configmap -n hivemr3 mr3conf-configmap-master mr3conf-configmap-worker
In order to run Metastore, the user can execute the script run-metastore.sh
.
$ ./run-metastore.sh
...
CLIENT_TO_AM_TOKEN_KEY=0ea834ee-2e5b-4528-a051-7d4b02c9973f
MR3_APPLICATION_ID_TIMESTAMP=15910
MR3_SHARED_SESSION_ID=f6f2c854-b11b-4aed-bf17-fe5e51806610
ATS_SECRET_KEY=2026c87f-94bb-4484-b905-3a5aa81f4f5d
configmap/client-am-config created
statefulset.apps/hivemr3-metastore created
service/metastore created
In order to run HiveServer2, the user can execute the script run-hive.sh
.
$ ./run-hive.sh
...
CLIENT_TO_AM_TOKEN_KEY=96c4de14-4db9-4e95-9fc7-a8545f165dbb
MR3_APPLICATION_ID_TIMESTAMP=28302
MR3_SHARED_SESSION_ID=287286f0-bf63-4ad7-9a27-672d01f4d230
ATS_SECRET_KEY=c54e3952-1546-4236-8f55-26b14ecaf0ff
Error from server (AlreadyExists): configmaps "client-am-config" already exists
replicationcontroller/hivemr3-hiveserver2 created
service/hiveserver2 created
These scripts mount the following files inside the Metastore and HiveServer2 Pods:
env.sh
conf/*
In this way, the user can completely specify the behavior of Metastore and HiveServer2 as well as DAGAppMaster and ContainerWorkers.
Executing the script run-hive.sh
starts a HiveServer2 Pod and a DAGAppMaster Pod in a moment.
HiveServer2 and DAGAppMaster become ready after a while.
$ kubectl get pods -n hivemr3
NAME READY STATUS RESTARTS AGE
hivemr3-hiveserver2-c8gwq 1/1 Running 0 90s
hivemr3-metastore-0 1/1 Running 0 119s
mr3master-5910-0-tbxfm 1/1 Running 0 74s
Running Beeline
Download a sample dataset and copy it to the directory for the PersistentVolume.
$ wget https://github.com/mr3project/mr3-release/releases/download/v1.0/pokemon.csv
$ cp pokemon.csv /data1/gla/workdir
$ chmod 777 /data1/gla/workdir/pokemon.csv
The user can verify that the sample dataset is accessible inside the HiveServer2 Pod.
$ kubectl exec -n hivemr3 -it hivemr3-hiveserver2-c8gwq -- /bin/bash
root@hivemr3-hiveserver2-c8gwq:/opt/mr3-run/hive# ls /opt/mr3-run/work-dir/pokemon.csv
/opt/mr3-run/work-dir/pokemon.csv
root@hivemr3-hiveserver2-c8gwq:/opt/mr3-run/hive# exit
The user may use any client program to connect to HiveServer2. In our example, we run Beeline inside the Hiveserver2 Pod.
$ kubectl exec -n hivemr3 -it hivemr3-hiveserver2-c8gwq -- /bin/bash
root@hivemr3-hiveserver2-c8gwq:/opt/mr3-run/hive# export USER=root
root@hivemr3-hiveserver2-c8gwq:/opt/mr3-run/hive# /opt/mr3-run/hive/run-beeline.sh
...
Connecting to jdbc:hive2://hivemr3-hiveserver2-c8gwq:9852/;;;
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.2 by Apache Hive
0: jdbc:hive2://hivemr3-hiveserver2-c8gwq:985>
Use the default database.
0: jdbc:hive2://hivemr3-hiveserver2-c8gwq:985> show databases;
...
+----------------+
| database_name |
+----------------+
| default |
+----------------+
1 row selected (2.131 seconds)
Create a table called pokemon
.
0: jdbc:hive2://hivemr3-hiveserver2-c8gwq:985> CREATE TABLE pokemon (Number Int,Name String,Type1 String,Type2 String,Total Int,HP Int,Attack Int,Defense Int,Sp_Atk Int,Sp_Def Int,Speed Int) row format delimited fields terminated BY ',' lines terminated BY '\n' tblproperties("skip.header.line.count"="1");
Import the sample dataset.
0: jdbc:hive2://hivemr3-hiveserver2-c8gwq:985> load data local inpath '/opt/mr3-run/work-dir/pokemon.csv' INTO table pokemon;
Execute queries.
0: jdbc:hive2://hivemr3-hiveserver2-c8gwq:985> select avg(HP) from pokemon;
...
0: jdbc:hive2://hivemr3-hiveserver2-c8gwq:985> create table pokemon1 as select *, IF(HP>160.0,'strong',IF(HP>140.0,'moderate','weak')) AS power_rate from pokemon;
...
0: jdbc:hive2://hivemr3-hiveserver2-c8gwq:985> select COUNT(name), power_rate from pokemon1 group by power_rate;
...
+------+-------------+
| _c0 | power_rate |
+------+-------------+
| 363 | strong |
| 336 | weak |
| 108 | moderate |
+------+-------------+
3 rows selected (1.363 seconds)
Now we see that new ContainerWorker Pods have been created.
$ kubectl get pods -n hivemr3
NAME READY STATUS RESTARTS AGE
hivemr3-hiveserver2-c8gwq 1/1 Running 0 5m47s
hivemr3-metastore-0 1/1 Running 0 6m16s
mr3master-5910-0-tbxfm 1/1 Running 0 5m31s
mr3worker-235c-1 1/1 Running 0 64s
mr3worker-235c-2 1/1 Running 0 23s
The user can find the warehouse directory /data1/gla/workdir/warehouse/
.
$ ls /data1/gla/workdir/warehouse
pokemon pokemon1
Stopping Hive on MR3
Delete ReplicationController for HiveServer2.
$ kubectl -n hivemr3 delete replicationcontroller hivemr3-hiveserver2
replicationcontroller "hivemr3-hiveserver2" deleted
Deleting ReplicationController for HiveServer2 does not automatically terminate the DAGAppMaster Pod.
This is a feature, not a bug, which is due to the support of high availability in Hive on MR3.
(After setting environment variable MR3_APPLICATION_ID_TIMESTAMP
properly,
running run-hive.sh
attaches the existing DAGAppMaster Pod to the new HiveServer2 Pod.)
Delete ReplicationController for DAGAppMaster.
$ kubectl delete replicationcontroller -n hivemr3 mr3master-5910-0
replicationcontroller "mr3master-5910-0" deleted
Deleting DAGAppMaster Pod automatically deletes all ContainerWorker Pods as well.
Delete StatefulSet for Metastore.
$ kubectl -n hivemr3 delete statefulset hivemr3-metastore
statefulset.apps "hivemr3-metastore" deleted
After a while, no Pods should be running in the namespace hivemr3
.
To delete all remaining resources, execute the following command:
$ kubectl -n hivemr3 delete configmap --all; kubectl -n hivemr3 delete svc --all; kubectl -n hivemr3 delete secret --all; kubectl -n hivemr3 delete serviceaccount --all; kubectl -n hivemr3 delete role --all; kubectl -n hivemr3 delete rolebinding --all; kubectl delete clusterrole node-reader; kubectl delete clusterrolebinding hive-clusterrole-binding; kubectl -n hivemr3 delete persistentvolumeclaims workdir-pvc; kubectl delete persistentvolumes workdir-pv