This page shows how to operate Hive on MR3 with Minikube on a single machine. HiveServer2 and MR3 DAGAppMaster will be running inside Minikube, whereas Metastore will be running outside Minikube (i.e., as a process on the local machine) so as to simulate an environment in which Hive accesses a remote data source. By following the instruction, the user will learn:
- how to install Hive on MR3 on a single machine
- how to start Metastore with a Derby database
- how to run Hive on MR3 using Minikube
- how to create Beeline connections and send queries to HiveServer2 running inside Minikube
This scenario has the following prerequisites:
- A running Minikube cluster should be available.
- The user should be able to execute: 1) command
docker
so as to build Docker images; 2) commandkubectl
so as to start Pods.
This scenario should take less than 30 minutes to complete,
not including the time for downloading a Hadoop binary distribution and an MR3 release.
This page has been tested with MR3 release 1.2 on CentOS 7.5 running Minikube v1.2.0 using user gla
.
Installation
Download a Hadoop binary distribution and uncompress it. For Hive 3 and earlier, Hadoop 2.7.7 works okay.
$ wget http://apache.tt.co.kr/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
$ gunzip -c hadoop-2.7.7.tar.gz | tar xvf -
Download a pre-built MR3 release and uncompress it.
Below we choose the pre-built MR3 release based on Hive 3.1.2 which corresponds to --hivesrc3
option to be used later.
$ wget https://github.com/mr3project/mr3-release/releases/download/v1.2/hivemr3-1.2-hive3.1.2.tar.gz
$ gunzip -c hivemr3-1.2-hive3.1.2.tar.gz | tar xvf -;
$ cd hivemr3-1.2-hive3.1.2
Set the environment variable JAVA_HOME
if necessary.
Update the environment variable HADOOP_HOME
in env.sh
so that
it points to the installation directory of the Hadoop binary distribution.
$ vi env.sh
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.242.b08-0.el7_7.x86_64/
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_HOME=/data1/gla/hadoop-2.7.7
Building a Docker image
Collect all necessary files for running Hive on MR3 in the directory kubernetes/hive
by executing build-k8s.sh
.
$ ./build-k8s.sh --hivesrc3
$ ls kubernetes/hive/hadoop/apache-hadoop/
bin etc lib libexec share
$ ls kubernetes/hive/hive/apache-hive/
bin conf hcatalog lib
Open kubernetes/env.sh
and set DOCKER_HIVE_IMG
so that Minikube reads the Docker image from the local machine.
$ vi kubernetes/env.sh
DOCKER_HIVE_IMG=hive3
Edit kubernetes/build-hive.sh
so that we store Docker images on the local machine.
$ vi kubernetes/build-hive.sh
sudo docker build -t $DOCKER_HIVE_IMG -f $DOCKER_HIVE_FILE .
# sudo docker push $DOCKER_HIVE_IMG
Run kubernetes/build-hive.sh
to build a Docker image.
$ kubernetes/build-hive.sh
Running Metastore
We will run Metastore with a Derby database using --local
option outside Minikube.
Open conf/local/hive3/hive-site.xml
and set the configuration key hive.metastore.warehouse.dir
as follows:
$ vi conf/local/hive3/hive-site.xml
<property>
<name>hive.metastore.warehouse.dir</name>
<value>file:///opt/mr3-run/work-dir/warehouse</value>
</property>
Here /opt/mr3-run/work-dir
is the directory where a PersistentVolume will be mounted inside all Pods.
The user should also have write permission on the same directory /opt/mr3-run/work-dir
outside Minikube.
$ ls -alt /opt/mr3-run/work-dir
total 8
drwxrwxrwx 2 root root 4096 Oct 27 15:45 .
drwxr-xr-x 3 root root 4096 Jul 23 2019 ..
Run hive/metastore-service.sh
to start Metastore.
$ hive/metastore-service.sh start --local --hivesrc3 --init-schema
# Running Metastore using Hive-MR3 (3.1.2) #
Output Directory:
/data1/gla/hivemr3-1.2-hive3.1.2/hive/metastore-service-result/hive-mr3-5ba3d48-2020-10-27-15-46-43-3959b523
Starting Metastore...
Output Directory:
/data1/gla/hivemr3-1.2-hive3.1.2/hive/metastore-service-result/hive-mr3-5ba3d48-2020-10-27-15-46-43-3959b523
After a while, check if Metastore has successfully started.
$ cat /data1/gla/hivemr3-1.2-hive3.1.2/hive/metastore-service-result/hive-mr3-5ba3d48-2020-10-27-15-46-43-3959b523/out-metastore.txt
Initialization script completed
schemaTool completed
2020-10-27 15:46:49: Starting Hive Metastore Server
...
Check the log file for Metastore.
$ ls -alt /tmp/gla/hive.log
-rw-rw-r-- 1 gla gla 79340 Oct 27 15:49 /tmp/gla/hive.log
Check the database directory for Metastore.
$ ls hive/hive-local-data/metastore5/hive3mr3/
dbex.lck db.lck log README_DO_NOT_TOUCH_FILES.txt seg0 service.properties tmp
Configuring Pods
By default,
Hive on MR3 creates three kinds of Pods: HiveServer2 Pod, DAGAppMaster Pod, and ContainerWorker Pod.
A HiveServer2 Pod runs a HiveServer2 container, and the user creates a HiveServer2 Pod by executing the script kubernetes/run-hive.sh
.
A DAGAppMaster Pod is created by HiveServer2, and
a ContainerWorker Pod runs a ContainerWorker container and is created by DAGAppMaster at runtime.
Create a directory for the PersistentVolume to be shared by all Pods.
$ mkdir -p /home/gla/workdir
$ chmod 777 /home/gla/workdir
Open kubernetes/yaml/workdir-pv.yaml
and create a hostPath PersistentVolume using the directory created in the previous step.
$ vi kubernetes/yaml/workdir-pv.yaml
spec:
capacity:
storage: 100Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Delete
hostPath:
path: "/home/gla/workdir"
Open kubernetes/yaml/hiveserver2-service.yaml
and use the IP address of the local machine.
$ vi kubernetes/yaml/hiveserver2-service.yaml
externalIPs:
- 111.111.111.11 # use your IP address
Open kubernetes/yaml/hive.yaml
, and update image
and imagePullPolicy
so that Minikube reads the Docker image from the local machine.
$ vi kubernetes/yaml/hive.yaml
# - image: 10.1.91.17:5000/hive3:latest
- image: hive3:latest
# imagePullPolicy: Always
imagePullPolicy: Never
Change the resources for HiveServer2 if necessary.
$ vi kubernetes/yaml/hive.yaml
resources:
requests:
cpu: 1
memory: 16Gi
limits:
cpu: 1
memory: 16Gi
Open kubernetes/env.sh
(not env.sh
in the installation directory) and set the following environment variables.
$ vi kubernetes/env.sh
CREATE_KEYTAB_SECRET=false # do not create a Secret from key/*
HIVE_DATABASE_HOST=111.111.111.11 # use your IP address
HIVE_METASTORE_HOST=111.111.111.11 # use your IP address
HIVE_METASTORE_PORT=9831 # 9831 is from HIVE3_METASTORE_LOCAL_PORT in env.sh
HIVE_WAREHOUSE_DIR=/opt/mr3-run/work-dir/warehouse
METASTORE_SECURE_MODE=false # disable Kerberos authentication
HIVE_SERVER2_HEAPSIZE=16384 # no larger than resources.limits.memory in kubernetes/yaml/hive.yaml
HIVE_SERVER2_AUTHENTICATION=NONE
TOKEN_RENEWAL_HDFS_ENABLED=false
HIVE_CLIENT_HEAPSIZE=2048 # heap size for Beeline
Open kubernetes/conf/core-site.xml
and
set the configuration key hadoop.security.authentication
to simple
to disable Kerberos authentication.
$ vi kubernetes/conf/core-site.xml
<property>
<name>hadoop.security.authentication</name>
<value>simple</value>
</property>
Open kubernetes/conf/mr3-site.xml
and set the configuration key mr3.k8s.pod.image.pull.policy
to Never
so that Minikube reads the Docker image from the local machine when creating DAGAppMaster and ContainerWorker Pods.
$ vi kubernetes/conf/mr3-site.xml
<property>
<name>mr3.k8s.pod.image.pull.policy</name>
<value>Never</value>
</property>
Hive on MR3 uses local disks for writing intermediate data. In the case of running on Kubernetes, we mount hostPath volumes to mount directories of the local machine. In our example, we create two directories each of which resides on its own local disk.
$ mkdir -p /data1/gla/k8s/
$ mkdir -p /data2/gla/k8s/
Then we set the configuration key mr3.k8s.pod.worker.hostpaths
in kubernetes/conf/mr3-site.xml
to these directories.
$ vi kubernetes/conf/mr3-site.xml
<property>
<name>mr3.k8s.pod.worker.hostpaths</name>
<value>/data1/gla/k8s/,/data2/gla/k8s/</value>
</property>
Set the configuration keys hive.security.authenticator.manager
and hive.security.authorization.manager
in kubernetes/conf/hive-site.xml
.
$ vi kubernetes/conf/hive-site.xml
<property>
<name>hive.security.authenticator.manager</name>
<value>org.apache.hadoop.hive.ql.security.ProxyUserAuthenticator</value>
</property>
<property>
<name>hive.security.authorization.manager</name>
<value>org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory</value>
</property>
Running HiveServer2
Before running HiveServer2, the user should remove the label node-role.kubernetes.io/master
from minikube
node.
This is because Hive on MR3 does not count the resources of master nodes
when estimating the resources for ContainerWorker Pods.
Since minikube
node, the only node in a Minikube cluster, is a master node,
we should demote it to an ordinary node in order to secure resources for ContainerWorker Pods.
Thus, in order to be able to create ContainerWorker Pods in minikube
node,
the user should execute the following command:
$ kubectl label node minikube node-role.kubernetes.io/master-
In order to run HiveServer2, the user can execute the script kubernetes/run-hive.sh
.
$ kubernetes/run-hive.sh
...
CLIENT_TO_AM_TOKEN_KEY=d9b08003-c240-4c36-bd64-464bee69cc4d
MR3_APPLICATION_ID_TIMESTAMP=29536
MR3_SHARED_SESSION_ID=4f6bb998-6efc-4d0f-9482-85370b982f1f
ATS_SECRET_KEY=94eb710f-8bc9-41c5-bb92-ffa76aa82031
configmap/client-am-config created
replicationcontroller/hivemr3-hiveserver2 created
service/hiveserver2 created
The script mounts the following files inside the HiveServer2 Pod:
kubernetes/env.sh
kubernetes/conf/*
In this way, the user can completely specify the behavior of HiveServer2 as well as DAGAppMaster and ContainerWorkers.
Executing the script kubernetes/run-hive.sh
starts a HiveServer2 Pod and a DAGAppMaster Pod in a moment.
$ kubectl get pods -n hivemr3
NAME READY STATUS RESTARTS AGE
hivemr3-hiveserver2-tl5h2 0/1 Running 0 29s
mr3master-9536-0-pxs2v 0/1 Running 0 13s
Running Beeline
Download a sample dataset and copy it to the directory for the PersistentVolume.
$ wget https://github.com/mr3project/mr3-release/releases/download/v1.0/pokemon.csv
$ cp pokemon.csv /home/gla/workdir
$ chmod 777 /home/gla/workdir/pokemon.csv
The user can verify that the sample dataset is accessible inside the HiveServer2 Pod.
$ chmod 777 /home/gla/workdir/pokemon.csv
$ kubectl exec -n hivemr3 -it hivemr3-hiveserver2-tl5h2 -- /bin/bash
root@hivemr3-hiveserver2-tl5h2:/opt/mr3-run/hive# ls /opt/mr3-run/work-dir/pokemon.csv
/opt/mr3-run/work-dir/pokemon.csv
root@hivemr3-hiveserver2-tl5h2:/opt/mr3-run/hive# exit
While the user may use any client program to connect to HiveServer2,
the MR3 release provides a script kubernetes/hive/hive/run-beeline.sh
which slightly simplifies the process of configuring Beeline.
Copy the file kubernetes/env.sh
and the directory kubernetes/conf
to kubernetes/hive/
.
$ cp kubernetes/env.sh kubernetes/hive/
$ cp -r kubernetes/conf kubernetes/hive
Set the host for HiveServer2 in kubernetes/hive/env.sh
.
Set the environment variable JAVA_HOME
if necessary.
$ vi kubernetes/hive/env.sh
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.242.b08-0.el7_7.x86_64/
export PATH=$JAVA_HOME/bin:$PATH
HIVE_SERVER2_HOST=123.456.789.12 # use your IP address
In order to start a Beeline connection, execute kubernetes/hive/hive/run-beeline.sh
.
$ kubernetes/hive/hive/run-beeline.sh
...
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.2 by Apache Hive
0: jdbc:hive2://123.456.789.12:9852/>
Use the default database.
0: jdbc:hive2://123.456.789.12:9852/> show databases;
...
+----------------+
| database_name |
+----------------+
| default |
+----------------+
1 row selected (2.177 seconds)
0: jdbc:hive2://123.456.789.12:9852/> use default;
...
No rows affected (0.056 seconds)
Create a table called pokemon
.
0: jdbc:hive2://123.456.789.12:9852/> CREATE TABLE pokemon (Number Int,Name String,Type1 String,Type2 String,Total Int,HP Int,Attack Int,Defense Int,Sp_Atk Int,Sp_Def Int,Speed Int) row format delimited fields terminated BY ',' lines terminated BY '\n' tblproperties("skip.header.line.count"="1");
...
No rows affected (0.653 seconds)
Import the sample dataset.
0: jdbc:hive2://123.456.789.12:9852/> load data local inpath '/opt/mr3-run/work-dir/pokemon.csv' INTO table pokemon;
...
No rows affected (0.332 seconds)
Execute queries.
0: jdbc:hive2://123.456.789.12:9852/> select avg(HP) from pokemon;
...
0: jdbc:hive2://123.456.789.12:9852/> create table pokemon1 as select *, IF(HP>160.0,'strong',IF(HP>140.0,'moderate','weak')) AS power_rate from pokemon;
...
0: jdbc:hive2://123.456.789.12:9852/> select COUNT(name), power_rate from pokemon1 group by power_rate;
...
+------+-------------+
| _c0 | power_rate |
+------+-------------+
| 363 | strong |
| 336 | weak |
| 108 | moderate |
+------+-------------+
3 rows selected (2.009 seconds)
Now we see that new ContainerWorker Pods have been created.
$ kubectl get -n hivemr3 pods
NAME READY STATUS RESTARTS AGE
hivemr3-hiveserver2-tl5h2 1/1 Running 0 3m49s
mr3master-9536-0-pxs2v 1/1 Running 0 3m33s
mr3worker-b094-1 1/1 Running 0 30s
mr3worker-b094-2 1/1 Running 0 10s
The user can find the warehouse directory /home/gla/workdir/warehouse/
.
$ du -hs /home/gla/workdir/warehouse/
108K /home/gla/workdir/warehouse/
Stopping HiveServer2 and Metastore
Delete ReplicationController for HiveServer2.
$ kubectl -n hivemr3 delete replicationcontroller hivemr3-hiveserver2
replicationcontroller "hivemr3-hiveserver2" deleted
Deleting ReplicationController for HiveServer2 does not automatically terminate the DAGAppMaster Pod.
This is a feature, not a bug, which is due to the support of high availability in Hive on MR3.
(After setting environment variable MR3_APPLICATION_ID_TIMESTAMP
properly,
running kubernetes/run-hive.sh
attaches the existing DAGAppMaster Pod to the new HiveServer2 Pod.)
Delete ReplicationController for DAGAppMaster which in turn deletes all ContainerWorker Pods automatically.
$ kubectl get replicationcontroller -n hivemr3
NAME DESIRED CURRENT READY AGE
mr3master-9536-0 1 1 1 4m25s
$ kubectl -n hivemr3 delete replicationcontroller mr3master-9536-0
replicationcontroller "mr3master-9536-0" deleted
Now no Pods should be running in the namespace hivemr3
.
To delete all remaining resources, execute the following command:
$ kubectl -n hivemr3 delete configmap --all; kubectl -n hivemr3 delete svc --all; kubectl -n hivemr3 delete secret --all; kubectl -n hivemr3 delete serviceaccount --all; kubectl -n hivemr3 delete role --all; kubectl -n hivemr3 delete rolebinding --all; kubectl delete clusterrole node-reader; kubectl delete clusterrolebinding hive-clusterrole-binding; kubectl -n hivemr3 delete persistentvolumeclaims workdir-pvc; kubectl delete persistentvolumes workdir-pv
Stop Metastore.
$ hive/metastore-service.sh stop --local --hivesrc3
The user can check if Metastore has successfully stopped by reading its log file.
$ tail -n3 /tmp/gla/hive.log
/************************************************************
SHUTDOWN_MSG: Shutting down HiveMetaStore at .................../123.456.789.12
************************************************************/