This page shows how to use Helm and a pre-built Docker image available at DockerHub in order to operate Hive on MR3 on Minikube. All components (Metastore, HiveServer2, MR3 DAGAppMaster) will be running inside Minikube. For Metastore, we will run a MySQL database as a Pod inside Minikube. By following the instruction, the user will learn:
- how to start Metastore using Helm
- how to use Helm to run Hive on MR3 on Minikube
- how to create Beeline connections and send queries to HiveServer2 running inside Minikube
This scenario has the following prerequisites:
- A running Minikube cluster is available.
- The user should be able to execute: 1) command
kubectl
; 2) commandhelm
to use Helm; 3) commanddocker
if no MySQL database is available.
This scenario should take less than 30 minutes to complete,
not including the time for downloading a pre-built Docker image.
All commands are executed by user gla
.
We use Helm 2.17.0.
For asking any questions, please visit MR3 Google Group or join MR3 Slack.
Installation
Download an MR3 release containing the executable scripts.
$ git clone https://github.com/mr3project/mr3-run-k8s.git
$ cd mr3-run-k8s/kubernetes/
$ git checkout release-1.11-hive3
$ git clone https://github.com/mr3project/mr3-run-k8s.git
$ cd mr3-run-k8s/kubernetes/
Starting a MySQL database
For simplicity, we will run a MySQL database for Metastore as a Docker container.
$ docker run -d --name mysql-server -p 3306:3306 -e MYSQL_ROOT_PASSWORD=passwd mysql:5.6
$ mysql --user=root --password=passwd --host=127.0.0.1 -e 'show databases;'
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
+--------------------+
Linking configuration files
We will reuse the configuration files in conf/
(and keys in key
if Kerberos is used for authentication).
Create symbolic links.
$ mkdir -p key
$ ln -s $(pwd)/conf/ helm/hive/conf
$ ln -s $(pwd)/key/ helm/hive/key
Now any change to the configuration files in conf/
is honored when running Hive on MR3.
Creating local directories
We need to create two new local directories:
- for a PersistentVolume to be shared by Pods;
- for a hostPath volume for ContainerWorker Pods.
Create a local directory for the PersistentVolume.
In our example, we use /home/gla/workdir
for the directory for the PersistentVolume.
$ mkdir -p /home/gla/workdir
$ chmod 777 /home/gla/workdir
Hive on MR3 uses local disks for writing intermediate data.
In the case of running on Kubernetes, we mount hostPath volumes to mount directories of the local machine.
For our example,
we create a local directory /data1/gla/k8s
for the hostPath volume for ContainerWorker Pods.
$ mkdir -p /data1/gla/k8s
$ chmod 777 /data1/gla/k8s
Configuring Pods
Create a new file helm/hive/values-minikube.yaml
which is a collection of values to override those in helm/hive/values.yaml
.
$ vi helm/hive/values-minikube.yaml
docker:
image: mr3project/hive3:1.11
imagePullPolicy: IfNotPresent
create:
metastore: true
metastore:
databaseHost: 192.168.10.1 # use your IP address
warehouseDir: file:///opt/mr3-run/work-dir/warehouse
initSchema: true
resources:
requests:
cpu: 1
memory: 4Gi
limits:
cpu: 1
memory: 4Gi
heapSize: 4096
hive:
externalIp: 192.168.10.1 # use your IP address
resources:
requests:
cpu: 1
memory: 8Gi
limits:
cpu: 1
memory: 8Gi
heapSize: 8192
workDir:
isNfs: false
volumeStr: "hostPath:\n path: /home/gla/workdir"
docker.image
is set to a pre-built Docker image available at DockerHub (ex.mr3project/hive3:1.11
for Hive 3 on MR3 andmr3project/hive4:4.0.1
for Hive 4 on MR3).docker.imagePullPolicy
is set toIfNotPresent
because we download the Docker image from DockerHub.create.metastore
is set to true because we will create a Metastore Pod.metastore.databaseHost
is set to the address of the MySQL database.metastore.initSchema
is set to true because it is the first time to run Metastore. For subsequent runs, the user may set it to false.hive.externalIp
is set to the public IP address of the local machine.workDir.volumeStr
is set to the path to the local directory for the PersistentVolume.
Update helm/hive/templates/metastore.yaml
to remove nodeAffinity
as we do not use node affinity rules.
$ vi helm/hive/templates/metastore.yaml
affinity:
# nodeAffinity:
# requiredDuringSchedulingIgnoredDuringExecution:
# nodeSelectorTerms:
# - matchExpressions:
# - key: roles
# operator: In
# values:
# - "masters"
Update conf/mr3-site.xml
.
$ vi conf/mr3-site.xml
<property>
<name>mr3.k8s.pod.image.pull.policy</name>
<value>IfNotPresent</value>
</property>
<property>
<name>mr3.k8s.pod.worker.hostpaths</name>
<value>/data1/gla/k8s/</value>
</property>
mr3.k8s.pod.image.pull.policy
is set toIfNotPresent
because we download the Docker image from DockerHub.mr3.k8s.pod.worker.hostpaths
is set to the path to the local directory for the hostPath volume.
Update conf/hive-site.xml
.
$ vi conf/hive-site.xml
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>passwd</value>
</property>
<property>
<name>hive.metastore.pre.event.listeners</name>
<value></value>
</property>
<property>
<name>metastore.pre.event.listeners</name>
<value></value>
</property>
<property>
<name>hive.security.authorization.manager</name>
<value>org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory</value>
</property>
javax.jdo.option.ConnectionPassword
is set to the password of the MySQL database.hive.metastore.pre.event.listeners
andmetastore.pre.event.listeners
are set to empty because we do not enable security on the Metastore side.hive.security.authorization.manager
is set to useSQLStdHiveAuthorizerFactory
.
Starting Hive on MR3
Before running HiveServer2, the user should remove the label node-role.kubernetes.io/master
from minikube
node.
This is because Hive on MR3 does not count the resources of master nodes
when estimating the resources for ContainerWorker Pods.
Since minikube
node, the only node in a Minikube cluster, is a master node,
we should demote it to an ordinary node in order to secure resources for ContainerWorker Pods.
Thus, in order to be able to create ContainerWorker Pods in minikube
node,
the user should execute the following command:
$ kubectl label node minikube node-role.kubernetes.io/master-
Before running HiveServer2, the user should also make sure that no ConfigMaps and Services exist in the namespace hivemr3
.
For example, the user may see ConfigMaps and Services left over from a previous run.
$ kubectl get configmaps -n hivemr3
NAME DATA AGE
mr3conf-configmap-master 1 16m
mr3conf-configmap-worker 1 16m
$ kubectl get svc -n hivemr3
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service-master-1237-0 ClusterIP 10.105.238.21 <none> 80/TCP 11m
service-worker ClusterIP None <none> <none> 11m
In such a case, manually delete these ConfigMaps and Services.
$ kubectl delete configmap -n hivemr3 mr3conf-configmap-master mr3conf-configmap-worker
$ kubectl delete svc -n hivemr3 service-master-1237-0 service-worker
Install Helm chart for Hive on MR3 with values-minikube.yaml
.
We use hivemr3
for the namespace.
Metastore automatically downloads a MySQL connector
from https://cdn.mysql.com/Downloads/Connector-J/mysql-connector-java-8.0.28.tar.gz
.
$ helm install --namespace hivemr3 helm/hive -f helm/hive/values-minikube.yaml
2022/07/30 23:25:20 found symbolic link in path: /data1/gla/mr3-run-k8s/kubernetes/helm/hive/conf resolves to /data1/gla/mr3-run-k8s/kubernetes/conf
2022/07/30 23:25:20 found symbolic link in path: /data1/gla/mr3-run-k8s/kubernetes/helm/hive/key resolves to /data1/gla/mr3-run-k8s/kubernetes/key
NAME: jaundiced-lightningbug
LAST DEPLOYED: Sat Jul 30 23:25:20 2022
NAMESPACE: hivemr3
STATUS: DEPLOYED
...
==> v1/ConfigMap
NAME DATA AGE
client-am-config 4 0s
env-configmap 1 0s
hivemr3-conf-configmap 15 0s
...
Check if all ConfigMaps are non-empty.
If the DATA
column for hivemr3-conf-configmap
is 0,
try to remove unnecessary files in the directory conf
or helm/hive/conf
.
This usually happens when a temporary file (e.g., .hive-site.xml.swp
) is kept at the time of installing Helm chart.
The user can find three Pods running in the Minikube cluster: 1) Metastore; 2) HiveServer2; 3) MR3 DAGAppMaster.
$ kubectl get pods -n hivemr3
NAME READY STATUS RESTARTS AGE
hivemr3-hiveserver2-b489d4d7f-s77b8 1/1 Running 0 111s
hivemr3-metastore-0 1/1 Running 0 111s
mr3master-4609-0-dfff6fc7f-g8hbv 1/1 Running 0 79s
HiveServer2 Pod becomes ready after a readiness probe contacts it.
Running Beeline
Download a sample dataset and copy it to the directory for the PersistentVolume.
$ wget https://github.com/mr3project/mr3-release/releases/download/v1.0/pokemon.csv
$ cp pokemon.csv /home/gla/workdir
$ chmod 777 /home/gla/workdir/pokemon.csv
The user can verify that the sample dataset is accessible inside the HiveServer2 Pod.
$ kubectl exec -n hivemr3 -it hivemr3-hiveserver2-b489d4d7f-s77b8 -- /bin/bash
hive@hivemr3-hiveserver2-b489d4d7f-s77b8:/opt/mr3-run/hive$ ls /opt/mr3-run/work-dir/
91c13320-b051-41b1-a0d9-d6c4c7d218a3_resources hive
_resultscache_ lib
db7da8c1-d360-43d4-a72a-cbad18e64832_resources pokemon.csv
hive@hivemr3-hiveserver2-b489d4d7f-s77b8:/opt/mr3-run/hive$ exit
The user may use any client program (such as beeline
) to connect to HiveServer2 which is running at port 9852.
Alternatively the user can run Beeline inside the Hiveserver2 Pod.
$ kubectl exec -n hivemr3 -it hivemr3-hiveserver2-b489d4d7f-s77b8 -- /bin/bash
hive@hivemr3-hiveserver2-b489d4d7f-s77b8:/opt/mr3-run/hive$ export USER=root
hive@hivemr3-hiveserver2-b489d4d7f-s77b8:/opt/mr3-run/hive$ /opt/mr3-run/hive/run-beeline.sh
...
Connecting to jdbc:hive2://hivemr3-hiveserver2-b489d4d7f-s77b8:9852/;;;
Connected to: Apache Hive (version 3.1.3)
Driver: Hive JDBC (version 3.1.3)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.3 by Apache Hive
0: jdbc:hive2://hivemr3-hiveserver2-b489d4d7f>
Execute queries.
0: jdbc:hive2://hivemr3-hiveserver2-b489d4d7f> show databases;
...
+----------------+
| database_name |
+----------------+
| default |
+----------------+
1 row selected (2.129 seconds)
0: jdbc:hive2://hivemr3-hiveserver2-b489d4d7f> use default;
...
No rows affected (0.054 seconds)
0: jdbc:hive2://hivemr3-hiveserver2-b489d4d7f> CREATE TABLE pokemon (Number Int,Name String,Type1 String,Type2 String,Total Int,HP Int,Attack Int,Defense Int,Sp_Atk Int,Sp_Def Int,Speed Int) row format delimited fields terminated BY ',' lines terminated BY '\n' tblproperties("skip.header.line.count"="1");
...
No rows affected (1.416 seconds)
0: jdbc:hive2://hivemr3-hiveserver2-b489d4d7f> load data local inpath '/opt/mr3-run/work-dir/pokemon.csv' INTO table pokemon;
...
No rows affected (0.609 seconds)
0: jdbc:hive2://hivemr3-hiveserver2-b489d4d7f> select avg(HP) from pokemon;
...
+---------------------+
| _c0 |
+---------------------+
| 144.84882280049567 |
+---------------------+
1 row selected (12.004 seconds)
0: jdbc:hive2://hivemr3-hiveserver2-b489d4d7f> create table pokemon1 as select *, IF(HP>160.0,'strong',IF(HP>140.0,'moderate','weak')) AS power_rate from pokemon;
...
No rows affected (1.246 seconds)
0: jdbc:hive2://hivemr3-hiveserver2-b489d4d7f> select COUNT(name), power_rate from pokemon1 group by power_rate;
...
+------+-------------+
| _c0 | power_rate |
+------+-------------+
| 108 | moderate |
| 363 | strong |
| 336 | weak |
+------+-------------+
3 rows selected (1.328 seconds)
The user can see that ContainerWorker Pods have been created.
$ kubectl get pods -n hivemr3
NAME READY STATUS RESTARTS AGE
hivemr3-hiveserver2-b489d4d7f-s77b8 1/1 Running 0 6m42s
hivemr3-metastore-0 1/1 Running 0 6m42s
mr3master-4609-0-dfff6fc7f-g8hbv 1/1 Running 0 6m10s
mr3worker-2590-1 1/1 Running 0 71s
mr3worker-2590-2 1/1 Running 0 19s
The user can find the warehouse directory /home/gla/workdir/warehouse
.
$ ls /home/gla/workdir/warehouse
pokemon pokemon1
Stopping Hive on MR3
In order to terminate Hive on MR3, the user should first delete the DAGAppMaster Pod and then delete Helm chart, not the other way. This is because deleting Helm chart revokes the ServiceAccount object which DAGAppMaster uses to delete ContainerWorker Pods. Hence, if the user deletes Helm chart first, all remaining Pods should be deleted manually.
Delete Deployment for DAGAppMaster which in turn deletes all ContainerWorker Pods automatically.
$ kubectl get deployment -n hivemr3
NAME READY UP-TO-DATE AVAILABLE AGE
hivemr3-hiveserver2 1/1 1 1 7m8s
mr3master-4609-0 1/1 1 1 6m36s
$ kubectl -n hivemr3 delete deployment mr3master-4609-0
deployment.extensions "mr3master-4609-0" deleted
Delete Helm chart.
$ helm delete jaundiced-lightningbug
release "jaundiced-lightningbug" deleted
As the last step, the user will find that the following objects belonging to the namespace hivemr3
are still alive:
- two ConfigMaps
mr3conf-configmap-master
andmr3conf-configmap-worker
- Service for DAGAppMaster, e.g.,
service-master-4609-0
- Service
service-worker
$ kubectl get configmaps -n hivemr3
NAME DATA AGE
mr3conf-configmap-master 1 7m36s
mr3conf-configmap-worker 1 7m32s
$ kubectl get svc -n hivemr3
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service-master-4609-0 ClusterIP 10.103.4.82 <none> 80/TCP,9890/TCP 7m50s
service-worker ClusterIP None <none> <none> 7m47s
These ConfigMaps and Services are not deleted by the command helm delete
because
they are created not by Helm but by HiveServer2 and DAGAppMaster.
Hence the user should delete these ConfigMaps and Services manually.
$ kubectl delete configmap -n hivemr3 mr3conf-configmap-master mr3conf-configmap-worker
configmap "mr3conf-configmap-master" deleted
configmap "mr3conf-configmap-worker" deleted
$ kubectl delete svc -n hivemr3 service-master-4609-0 service-worker
service "service-master-4609-0" deleted
service "service-worker" deleted