Hive / On Minikube |

This page shows how to use a pre-built Docker image available at DockerHub in order to operate Hive on MR3 on Minikube. All components (Metastore, HiveServer2, MR3 DAGAppMaster) will be running inside Minikube. For Metastore, we will run a MySQL database as a Docker container, although an existing MySQL database is also okay to use. By following the instruction, the user will learn:

how to start Metastore
how to run Hive on MR3
how to create Beeline connections and send queries to HiveServer2 running inside Minikube

This scenario has the following prerequisites:

A running Minikube cluster is available.
The user should be able to execute: 1) command kubectl; 2) command docker if no MySQL database is available.

This scenario should take less than 30 minutes to complete, not including the time for downloading a pre-built Docker image. All commands are executed by user gla.

For asking any questions, please visit MR3 Google Group or join MR3 Slack.

Installation

Download an MR3 release containing the executable scripts.

Hive 3 on MR3

$ git clone https://github.com/mr3project/mr3-run-k8s.git
$ cd mr3-run-k8s/kubernetes/
$ git checkout release-1.11-hive3

Hive 4 on MR3

$ git clone https://github.com/mr3project/mr3-run-k8s.git
$ cd mr3-run-k8s/kubernetes/

Starting a MySQL database

For simplicity, we will run a MySQL database for Metastore as a Docker container.

$ docker run -d --name mysql-server -p 3306:3306 -e MYSQL_ROOT_PASSWORD=passwd mysql:5.6
$ mysql --user=root --password=passwd --host=127.0.0.1 -e 'show databases;'
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
+--------------------+

Creating local directories

In our scenario, Hive on MR3 uses four kinds of Pods: Metastore Pod, HiveServer2 Pod, DAGAppMaster Pod, and ContainerWorker Pod. A Metastore Pod runs a Metastore container, and the user creates a Metastore Pod by executing the script run-metastore.sh. A HiveServer2 Pod runs a HiveServer2 container, and the user creates a HiveServer2 Pod by executing the script run-hive.sh. A DAGAppMaster Pod is created by HiveServer2, and a ContainerWorker Pod runs a ContainerWorker container and is created by DAGAppMaster at runtime.

We need to create two new local directories:

for a PersistentVolume to be shared by all Pods;
for a hostPath volume where ContainerWorker Pods write intermediate data.

Create a local directory for the PersistentVolume. In our example, we use /data1/gla/workdir for the directory for the PersistentVolume.

$ mkdir /data1/gla/workdir
$ chmod 777 /data1/gla/workdir

Create a local directory for the hostPath volume for ContainerWorker Pods. In our example, we use /data1/gla/k8s for the directory for the hostPath volume.

$ mkdir -p /data1/gla/k8s
$ chmod 777 /data1/gla/k8s

Configuring Pods

Open yaml/workdir-pv.yaml and create a hostPath PersistentVolume using the directory created in the previous step.

$ vi yaml/workdir-pv.yaml

spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Delete
  hostPath:
    path: "/data1/gla/workdir"

Open yaml/hiveserver2-service.yaml and use the IP address of the local machine.

$ vi yaml/hiveserver2-service.yaml

  externalIPs:
  - 192.168.10.1        # use your IP address

Open env.sh and set the following environment variables.

$ vi env.sh

DOCKER_HIVE_IMG=mr3project/hive3:1.11
DOCKER_HIVE_WORKER_IMG=mr3project/hive3:1.11

HIVE_DATABASE_HOST=192.168.10.1       # use your IP address (where the MySQL database is running)

HIVE_SERVER2_HEAPSIZE=8192            # HiveServer2 Pod memory in MB 
HIVE_METASTORE_HEAPSIZE=8192          # Metastore Pod memory in MB
HIVE_CLIENT_HEAPSIZE=2048             # heap size for Beeline

Open yaml/metastore.yaml and yaml/hive.yaml, and update the following fields.

Remove nodeAffinity as we do not use node affinity rules.
image is set to a pre-built Docker image available at DockerHub (ex. mr3project/hive3:1.11 for Hive 3 on MR3 and mr3project/hive4:4.0.0 for Hive 4 on MR3).
args includes "--init-schema" because it is the first time to run Metastore.
imagePullPolicy is set to IfNotPresent because we download the Docker image from DockerHub.
Change the resources if necessary.

$ vi yaml/metastore.yaml 

      affinity:
      # nodeAffinity:
      #   requiredDuringSchedulingIgnoredDuringExecution:
      #     nodeSelectorTerms:
      #     - matchExpressions:
      #       - key: roles
      #         operator: In
      #         values:
      #         - "masters"

      containers:
      - image: mr3project/hive3:1.11
        command: ["/opt/mr3-run/hive/metastore-service.sh"]
        args: ["start", "--kubernetes", "--init-schema"]
        imagePullPolicy: IfNotPresent

        resources:
          requests:
            cpu: 1
            memory: 8Gi
          limits:
            cpu: 1
            memory: 8Gi

Open yaml/hive.yaml and update image and imagePullPolicy so that Minikube reads the pre-built Docker image from DockerHub. Change the resources if necessary.

$ vi yaml/hive.yaml 

      containers:
      - image: mr3project/hive3:1.11
        imagePullPolicy: IfNotPresent

        resources:
          requests:
            cpu: 1
            memory: 8Gi
          limits:
            cpu: 1
            memory: 8Gi

Open conf/mr3-site.xml and set the configuration key mr3.k8s.pod.image.pull.policy to IfNotPresent. Set the configuration key mr3.k8s.pod.worker.hostpaths to the local directory for the hostPath PersistentVolume.

$ vi conf/mr3-site.xml 

<property>
  <name>mr3.k8s.pod.image.pull.policy</name>
  <value>IfNotPresent</value>
</property>

<property>
  <name>mr3.k8s.pod.worker.hostpaths</name>
  <value>/data1/gla/k8s</value>
</property>

Update conf/hive-site.xml.

$ vi conf/hive-site.xml 

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>root</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>passwd</value>
</property>

<property>
  <name>hive.metastore.pre.event.listeners</name>
  <value></value>
</property>
<property>
  <name>metastore.pre.event.listeners</name>
  <value></value>
</property>

<property>
  <name>hive.security.authorization.manager</name>
  <value>org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory</value> 
</property>

javax.jdo.option.ConnectionPassword is set to the password of the MySQL database.
hive.metastore.pre.event.listeners and metastore.pre.event.listeners are set to empty because we do not enable security on the Metastore side.
hive.security.authorization.manager is set to use SQLStdHiveAuthorizerFactory.

Starting Hive on MR3

Before starting Hive on MR3, the user should remove the label node-role.kubernetes.io/master from minikube node. This is because Hive on MR3 does not count the resources of master nodes when estimating the resources for ContainerWorker Pods. Since minikube node, the only node in a Minikube cluster, is a master node, we should demote it to an ordinary node in order to secure resources for ContainerWorker Pods. Thus, in order to be able to create ContainerWorker Pods in minikube node, the user should execute the following command:

$ kubectl label node minikube node-role.kubernetes.io/master-

In order to run Metastore, the user can execute the script run-metastore.sh. Before running Metastore, the script automatically downloads a MySQL connector from https://cdn.mysql.com/Downloads/Connector-J/mysql-connector-java-8.0.28.tar.gz.

$ ./run-metastore.sh
...
CLIENT_TO_AM_TOKEN_KEY=0833ee57-8bdc-4aff-ad6e-837a941b4086
MR3_APPLICATION_ID_TIMESTAMP=18633
MR3_SHARED_SESSION_ID=e56cde9e-02f2-4a37-baa7-d1a63a2d80b3
ATS_SECRET_KEY=7c6409d6-fc46-4588-b159-d78fe3b7389e
configmap/client-am-config created
statefulset.apps/hivemr3-metastore created
service/metastore created

In order to run HiveServer2, the user can execute the script run-hive.sh.

$ ./run-hive.sh 
...
CLIENT_TO_AM_TOKEN_KEY=d69aa482-15e1-4dfa-9912-59c2691afde3
MR3_APPLICATION_ID_TIMESTAMP=18906
MR3_SHARED_SESSION_ID=39041fdc-1b6d-424b-ba7c-a152b52d24fd
ATS_SECRET_KEY=4c8b07f7-f498-4075-962e-0dc57e66989c
Error from server (AlreadyExists): configmaps "client-am-config" already exists
deployment.apps/hivemr3-hiveserver2 created
service/hiveserver2 created

These scripts mount the following files inside the Metastore and HiveServer2 Pods:

env.sh
conf/*

In this way, the user can completely specify the behavior of Metastore and HiveServer2 as well as DAGAppMaster and ContainerWorkers.

Executing the script run-hive.sh starts a HiveServer2 Pod and a DAGAppMaster Pod in a moment. HiveServer2 and DAGAppMaster become ready after a while.

$ kubectl get pods -n hivemr3
NAME                                   READY   STATUS    RESTARTS   AGE
hivemr3-hiveserver2-69b7c45745-6wkfw   1/1     Running   0          71s
hivemr3-metastore-0                    1/1     Running   0          97s
mr3master-8633-0-768f779d-lpbq8        1/1     Running   0          53s

Running Beeline

Download a sample dataset and copy it to the directory for the PersistentVolume.

$ wget https://github.com/mr3project/mr3-release/releases/download/v1.0/pokemon.csv
$ cp pokemon.csv /data1/gla/workdir
$ chmod 777 /data1/gla/workdir/pokemon.csv

The user can verify that the sample dataset is accessible inside the HiveServer2 Pod.

$ kubectl exec -n hivemr3 -it hivemr3-hiveserver2-69b7c45745-6wkfw -- /bin/bash
hive@hivemr3-hiveserver2-69b7c45745-6wkfw:/opt/mr3-run/hive$ ls /opt/mr3-run/work-dir/pokemon.csv
/opt/mr3-run/work-dir/pokemon.csv
hive@hivemr3-hiveserver2-69b7c45745-6wkfw:/opt/mr3-run/hive$ exit

The user may use any client program to connect to HiveServer2. In our example, we run Beeline inside the Hiveserver2 Pod.

kubectl exec -n hivemr3 -it hivemr3-hiveserver2-69b7c45745-6wkfw -- /bin/bash
hive@hivemr3-hiveserver2-69b7c45745-6wkfw:/opt/mr3-run/hive$ export USER=root
hive@hivemr3-hiveserver2-69b7c45745-6wkfw:/opt/mr3-run/hive$ /opt/mr3-run/hive/run-beeline.sh
Output directory: /opt/mr3-run/hive/run-result/hivemr3-2022-07-29-11-56-38

# Running Beeline using Hive-MR3 () #

...
Connecting to jdbc:hive2://hivemr3-hiveserver2-69b7c45745-6wkfw:9852/;;;
Connected to: Apache Hive (version 3.1.3)
Driver: Hive JDBC (version 3.1.3)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.3 by Apache Hive
0: jdbc:hive2://hivemr3-hiveserver2-69b7c4574>

Use the default database.

0: jdbc:hive2://hivemr3-hiveserver2-69b7c4574> show databases;
...
+----------------+
| database_name  |
+----------------+
| default        |
+----------------+
1 row selected (1.999 seconds)

Create a table called pokemon.

0: jdbc:hive2://hivemr3-hiveserver2-69b7c4574> CREATE TABLE pokemon (Number Int,Name String,Type1 String,Type2 String,Total Int,HP Int,Attack Int,Defense Int,Sp_Atk Int,Sp_Def Int,Speed Int) row format delimited fields terminated BY ',' lines terminated BY '\n' tblproperties("skip.header.line.count"="1");

Import the sample dataset.

0: jdbc:hive2://hivemr3-hiveserver2-69b7c4574> load data local inpath '/opt/mr3-run/work-dir/pokemon.csv' INTO table pokemon;

Execute queries.

0: jdbc:hive2://hivemr3-hiveserver2-69b7c4574> select avg(HP) from pokemon;
...
+---------------------+
|         _c0         |
+---------------------+
| 144.84882280049567  |
+---------------------+
1 row selected (11.288 seconds)

0: jdbc:hive2://hivemr3-hiveserver2-69b7c4574> create table pokemon1 as select *, IF(HP>160.0,'strong',IF(HP>140.0,'moderate','weak')) AS power_rate from pokemon;
...

0: jdbc:hive2://hivemr3-hiveserver2-69b7c4574> select COUNT(name), power_rate from pokemon1 group by power_rate;
...
+------+-------------+
| _c0  | power_rate  |
+------+-------------+
| 108  | moderate    |
| 363  | strong      |
| 336  | weak        |
+------+-------------+
3 rows selected (1.144 seconds)

Now we see that new ContainerWorker Pods have been created.

$ kubectl get pods -n hivemr3
NAME                                   READY   STATUS    RESTARTS   AGE
hivemr3-hiveserver2-69b7c45745-6wkfw   1/1     Running   0          6m45s
hivemr3-metastore-0                    1/1     Running   0          7m11s
mr3master-8633-0-768f779d-lpbq8        1/1     Running   0          6m27s
mr3worker-6c99-1                       1/1     Running   0          93s
mr3worker-6c99-2                       1/1     Running   0          35s

The user can find the warehouse directory /data1/gla/workdir/warehouse.

$ ls /data1/gla/workdir/warehouse
pokemon  pokemon1

Stopping Hive on MR3

Delete Deployment for HiveServer2.

$ kubectl -n hivemr3 delete deployment hivemr3-hiveserver2
deployment.extensions "hivemr3-hiveserver2" deleted

Deleting Deployment for HiveServer2 does not automatically terminate the DAGAppMaster Pod. This is a feature, not a bug, which is due to the support of high availability in Hive on MR3. (After setting environment variable MR3_APPLICATION_ID_TIMESTAMP properly, running run-hive.sh attaches the existing DAGAppMaster Pod to the new HiveServer2 Pod.)

Delete Deployment for DAGAppMaster.

$ kubectl delete deployment -n hivemr3 mr3master-8633-0
deployment.extensions "mr3master-8633-0" deleted

Deleting DAGAppMaster Pod automatically deletes all ContainerWorker Pods as well.

Delete StatefulSet for Metastore.

$ kubectl -n hivemr3 delete statefulset hivemr3-metastore
statefulset.apps "hivemr3-metastore" deleted

After a while, no Pods should be running in the namespace hivemr3. To delete all remaining resources, execute the following command:

$ kubectl -n hivemr3 delete configmap --all; kubectl -n hivemr3 delete svc --all; kubectl -n hivemr3 delete secret --all; kubectl -n hivemr3 delete serviceaccount --all;  kubectl -n hivemr3 delete role --all; kubectl -n hivemr3 delete rolebinding --all; kubectl delete clusterrole node-reader; kubectl delete clusterrolebinding hive-clusterrole-binding; kubectl -n hivemr3 delete persistentvolumeclaims workdir-pvc; kubectl delete persistentvolumes workdir-pv