This page shows how to operate Hive on MR3 with Minikube on a single machine. HiveServer2 and MR3 DAGAppMaster will be running inside Minikube, whereas Metastore will be running outside Minikube (i.e., as a process on the local machine) so as to simulate an environment in which Hive accesses a remote data source. By following the instruction, the user will learn:

  1. how to install Hive on MR3 on a single machine
  2. how to start Metastore with a Derby database
  3. how to run Hive on MR3 using Minikube
  4. how to create Beeline connections and send queries to HiveServer2 running inside Minikube

This scenario has the following prerequisites:

  • A running Minikube cluster should be available.
  • The user should be able to execute: 1) command docker so as to build Docker images; 2) command kubectl so as to start Pods.

This scenario should take less than 30 minutes to complete, not including the time for downloading a Hadoop binary distribution and an MR3 release. This page has been tested with MR3 release 1.2 on CentOS 7.5 running Minikube v1.2.0 using user gla.

Installation

Download a Hadoop binary distribution and uncompress it. For Hive 3 and earlier, Hadoop 2.7.7 works okay.

$ wget http://apache.tt.co.kr/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz 
$ gunzip -c hadoop-2.7.7.tar.gz | tar xvf -

Download a pre-built MR3 release and uncompress it. Below we choose the pre-built MR3 release based on Hive 3.1.2 which corresponds to --hivesrc3 option to be used later.

$ wget https://github.com/mr3project/mr3-release/releases/download/v1.2/hivemr3-1.2-hive3.1.2.tar.gz
$ gunzip -c hivemr3-1.2-hive3.1.2.tar.gz | tar xvf -;
$ cd hivemr3-1.2-hive3.1.2

Set the environment variable JAVA_HOME if necessary. Update the environment variable HADOOP_HOME in env.sh so that it points to the installation directory of the Hadoop binary distribution.

$ vi env.sh

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.242.b08-0.el7_7.x86_64/
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_HOME=/data1/gla/hadoop-2.7.7

Building a Docker image

Collect all necessary files for running Hive on MR3 in the directory kubernetes/hive by executing build-k8s.sh.

$ ./build-k8s.sh --hivesrc3
$ ls kubernetes/hive/hadoop/apache-hadoop/
bin  etc  lib  libexec  share
$ ls kubernetes/hive/hive/apache-hive/
bin  conf  hcatalog  lib

Open kubernetes/env.sh and set DOCKER_HIVE_IMG so that Minikube reads the Docker image from the local machine.

$ vi kubernetes/env.sh

DOCKER_HIVE_IMG=hive3

Edit kubernetes/build-hive.sh so that we store Docker images on the local machine.

$ vi kubernetes/build-hive.sh 

sudo docker build -t $DOCKER_HIVE_IMG -f $DOCKER_HIVE_FILE .
# sudo docker push $DOCKER_HIVE_IMG

Run kubernetes/build-hive.sh to build a Docker image.

$ kubernetes/build-hive.sh

Running Metastore

We will run Metastore with a Derby database using --local option outside Minikube. Open conf/local/hive3/hive-site.xml and set the configuration key hive.metastore.warehouse.dir as follows:

$ vi conf/local/hive3/hive-site.xml

<property>
  <name>hive.metastore.warehouse.dir</name>
  <value>file:///opt/mr3-run/work-dir/warehouse</value>
</property>

Here /opt/mr3-run/work-dir is the directory where a PersistentVolume will be mounted inside all Pods. The user should also have write permission on the same directory /opt/mr3-run/work-dir outside Minikube.

$ ls -alt /opt/mr3-run/work-dir
total 8
drwxrwxrwx 2 root root 4096 Oct 27 15:45 .
drwxr-xr-x 3 root root 4096 Jul 23  2019 ..

Run hive/metastore-service.sh to start Metastore.

$ hive/metastore-service.sh start --local --hivesrc3 --init-schema

# Running Metastore using Hive-MR3 (3.1.2) #

Output Directory: 
/data1/gla/hivemr3-1.2-hive3.1.2/hive/metastore-service-result/hive-mr3-5ba3d48-2020-10-27-15-46-43-3959b523

Starting Metastore...
Output Directory: 
/data1/gla/hivemr3-1.2-hive3.1.2/hive/metastore-service-result/hive-mr3-5ba3d48-2020-10-27-15-46-43-3959b523

After a while, check if Metastore has successfully started.

$ cat /data1/gla/hivemr3-1.2-hive3.1.2/hive/metastore-service-result/hive-mr3-5ba3d48-2020-10-27-15-46-43-3959b523/out-metastore.txt 

Initialization script completed
schemaTool completed
2020-10-27 15:46:49: Starting Hive Metastore Server
...

Check the log file for Metastore.

$ ls -alt /tmp/gla/hive.log
-rw-rw-r-- 1 gla gla 79340 Oct 27 15:49 /tmp/gla/hive.log

Check the database directory for Metastore.

$ ls hive/hive-local-data/metastore5/hive3mr3/
dbex.lck  db.lck  log  README_DO_NOT_TOUCH_FILES.txt  seg0  service.properties  tmp

Configuring Pods

By default, Hive on MR3 creates three kinds of Pods: HiveServer2 Pod, DAGAppMaster Pod, and ContainerWorker Pod. A HiveServer2 Pod runs a HiveServer2 container, and the user creates a HiveServer2 Pod by executing the script kubernetes/run-hive.sh. A DAGAppMaster Pod is created by HiveServer2, and a ContainerWorker Pod runs a ContainerWorker container and is created by DAGAppMaster at runtime.

Create a directory for the PersistentVolume to be shared by all Pods.

$ mkdir -p /home/gla/workdir
$ chmod 777 /home/gla/workdir

Open kubernetes/yaml/workdir-pv.yaml and create a hostPath PersistentVolume using the directory created in the previous step.

$ vi kubernetes/yaml/workdir-pv.yaml

spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Delete
  hostPath:
    path: "/home/gla/workdir"

Open kubernetes/yaml/hiveserver2-service.yaml and use the IP address of the local machine.

$ vi kubernetes/yaml/hiveserver2-service.yaml

  externalIPs:
  - 111.111.111.11        # use your IP address

Open kubernetes/yaml/hive.yaml, and update image and imagePullPolicy so that Minikube reads the Docker image from the local machine.

$ vi kubernetes/yaml/hive.yaml

#       - image: 10.1.91.17:5000/hive3:latest
        - image: hive3:latest

#       imagePullPolicy: Always
        imagePullPolicy: Never

Change the resources for HiveServer2 if necessary.

$ vi kubernetes/yaml/hive.yaml

        resources:
          requests:
            cpu: 1
            memory: 16Gi
          limits:
            cpu: 1
            memory: 16Gi

Open kubernetes/env.sh (not env.sh in the installation directory) and set the following environment variables.

$ vi kubernetes/env.sh

CREATE_KEYTAB_SECRET=false            # do not create a Secret from key/*

HIVE_DATABASE_HOST=111.111.111.11     # use your IP address
HIVE_METASTORE_HOST=111.111.111.11    # use your IP address
HIVE_METASTORE_PORT=9831              # 9831 is from HIVE3_METASTORE_LOCAL_PORT in env.sh

HIVE_WAREHOUSE_DIR=/opt/mr3-run/work-dir/warehouse

METASTORE_SECURE_MODE=false           # disable Kerberos authentication

HIVE_SERVER2_HEAPSIZE=16384           # no larger than resources.limits.memory in kubernetes/yaml/hive.yaml
HIVE_SERVER2_AUTHENTICATION=NONE

TOKEN_RENEWAL_HDFS_ENABLED=false

HIVE_CLIENT_HEAPSIZE=2048             # heap size for Beeline

Open kubernetes/conf/core-site.xml and set the configuration key hadoop.security.authentication to simple to disable Kerberos authentication.

$ vi kubernetes/conf/core-site.xml 

<property>
  <name>hadoop.security.authentication</name>
  <value>simple</value>
</property>

Open kubernetes/conf/mr3-site.xml and set the configuration key mr3.k8s.pod.image.pull.policy to Never so that Minikube reads the Docker image from the local machine when creating DAGAppMaster and ContainerWorker Pods.

$ vi kubernetes/conf/mr3-site.xml 

<property>
  <name>mr3.k8s.pod.image.pull.policy</name>
  <value>Never</value>
</property>

Hive on MR3 uses local disks for writing intermediate data. In the case of running on Kubernetes, we mount hostPath volumes to mount directories of the local machine. In our example, we create two directories each of which resides on its own local disk.

$ mkdir -p /data1/gla/k8s/
$ mkdir -p /data2/gla/k8s/

Then we set the configuration key mr3.k8s.pod.worker.hostpaths in kubernetes/conf/mr3-site.xml to these directories.

$ vi kubernetes/conf/mr3-site.xml 

<property>
  <name>mr3.k8s.pod.worker.hostpaths</name>
  <value>/data1/gla/k8s/,/data2/gla/k8s/</value>
</property>

Set the configuration keys hive.security.authenticator.manager and hive.security.authorization.manager in kubernetes/conf/hive-site.xml.

$ vi kubernetes/conf/hive-site.xml

<property>
  <name>hive.security.authenticator.manager</name>
  <value>org.apache.hadoop.hive.ql.security.ProxyUserAuthenticator</value>
</property>

<property>
  <name>hive.security.authorization.manager</name>
  <value>org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory</value> 
</property>

Running HiveServer2

Before running HiveServer2, the user should remove the label node-role.kubernetes.io/master from minikube node. This is because Hive on MR3 does not count the resources of master nodes when estimating the resources for ContainerWorker Pods. Since minikube node, the only node in a Minikube cluster, is a master node, we should demote it to an ordinary node in order to secure resources for ContainerWorker Pods. Thus, in order to be able to create ContainerWorker Pods in minikube node, the user should execute the following command:

$ kubectl label node minikube node-role.kubernetes.io/master-

In order to run HiveServer2, the user can execute the script kubernetes/run-hive.sh.

$ kubernetes/run-hive.sh
...
CLIENT_TO_AM_TOKEN_KEY=d9b08003-c240-4c36-bd64-464bee69cc4d
MR3_APPLICATION_ID_TIMESTAMP=29536
MR3_SHARED_SESSION_ID=4f6bb998-6efc-4d0f-9482-85370b982f1f
ATS_SECRET_KEY=94eb710f-8bc9-41c5-bb92-ffa76aa82031
configmap/client-am-config created
replicationcontroller/hivemr3-hiveserver2 created
service/hiveserver2 created

The script mounts the following files inside the HiveServer2 Pod:

  • kubernetes/env.sh
  • kubernetes/conf/*

In this way, the user can completely specify the behavior of HiveServer2 as well as DAGAppMaster and ContainerWorkers.

Executing the script kubernetes/run-hive.sh starts a HiveServer2 Pod and a DAGAppMaster Pod in a moment.

$ kubectl get pods -n hivemr3
NAME                        READY   STATUS    RESTARTS   AGE
hivemr3-hiveserver2-tl5h2   0/1     Running   0          29s
mr3master-9536-0-pxs2v      0/1     Running   0          13s

Running Beeline

Download a sample dataset and copy it to the directory for the PersistentVolume.

$ wget https://github.com/mr3project/mr3-release/releases/download/v1.0/pokemon.csv
$ cp pokemon.csv /home/gla/workdir
$ chmod 777 /home/gla/workdir/pokemon.csv 

The user can verify that the sample dataset is accessible inside the HiveServer2 Pod.

$ chmod 777 /home/gla/workdir/pokemon.csv
$ kubectl exec -n hivemr3 -it hivemr3-hiveserver2-tl5h2 -- /bin/bash
root@hivemr3-hiveserver2-tl5h2:/opt/mr3-run/hive# ls /opt/mr3-run/work-dir/pokemon.csv
/opt/mr3-run/work-dir/pokemon.csv
root@hivemr3-hiveserver2-tl5h2:/opt/mr3-run/hive# exit

While the user may use any client program to connect to HiveServer2, the MR3 release provides a script kubernetes/hive/hive/run-beeline.sh which slightly simplifies the process of configuring Beeline.

Copy the file kubernetes/env.sh and the directory kubernetes/conf to kubernetes/hive/.

$ cp kubernetes/env.sh kubernetes/hive/
$ cp -r kubernetes/conf kubernetes/hive

Set the host for HiveServer2 in kubernetes/hive/env.sh. Set the environment variable JAVA_HOME if necessary.

$ vi kubernetes/hive/env.sh 

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.242.b08-0.el7_7.x86_64/
export PATH=$JAVA_HOME/bin:$PATH

HIVE_SERVER2_HOST=123.456.789.12      # use your IP address

In order to start a Beeline connection, execute kubernetes/hive/hive/run-beeline.sh.

$ kubernetes/hive/hive/run-beeline.sh
...
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.2 by Apache Hive
0: jdbc:hive2://123.456.789.12:9852/> 

Use the default database.

0: jdbc:hive2://123.456.789.12:9852/> show databases;
...
+----------------+
| database_name  |
+----------------+
| default        |
+----------------+
1 row selected (2.177 seconds)
0: jdbc:hive2://123.456.789.12:9852/> use default;
...
No rows affected (0.056 seconds)

Create a table called pokemon.

0: jdbc:hive2://123.456.789.12:9852/> CREATE TABLE pokemon (Number Int,Name String,Type1 String,Type2 String,Total Int,HP Int,Attack Int,Defense Int,Sp_Atk Int,Sp_Def Int,Speed Int) row format delimited fields terminated BY ',' lines terminated BY '\n' tblproperties("skip.header.line.count"="1");
...
No rows affected (0.653 seconds)

Import the sample dataset.

0: jdbc:hive2://123.456.789.12:9852/> load data local inpath '/opt/mr3-run/work-dir/pokemon.csv' INTO table pokemon;
...
No rows affected (0.332 seconds)

Execute queries.

0: jdbc:hive2://123.456.789.12:9852/> select avg(HP) from pokemon;
...
0: jdbc:hive2://123.456.789.12:9852/> create table pokemon1 as select *, IF(HP>160.0,'strong',IF(HP>140.0,'moderate','weak')) AS power_rate from pokemon;
...
0: jdbc:hive2://123.456.789.12:9852/> select COUNT(name), power_rate from pokemon1 group by power_rate;
...
+------+-------------+
| _c0  | power_rate  |
+------+-------------+
| 363  | strong      |
| 336  | weak        |
| 108  | moderate    |
+------+-------------+
3 rows selected (2.009 seconds)

Now we see that new ContainerWorker Pods have been created.

$ kubectl get -n hivemr3 pods
NAME                        READY   STATUS    RESTARTS   AGE
hivemr3-hiveserver2-tl5h2   1/1     Running   0          3m49s
mr3master-9536-0-pxs2v      1/1     Running   0          3m33s
mr3worker-b094-1            1/1     Running   0          30s
mr3worker-b094-2            1/1     Running   0          10s

The user can find the warehouse directory /home/gla/workdir/warehouse/.

$ du -hs /home/gla/workdir/warehouse/ 
108K  /home/gla/workdir/warehouse/

Stopping HiveServer2 and Metastore

Delete ReplicationController for HiveServer2.

$ kubectl -n hivemr3 delete replicationcontroller hivemr3-hiveserver2
replicationcontroller "hivemr3-hiveserver2" deleted

Deleting ReplicationController for HiveServer2 does not automatically terminate the DAGAppMaster Pod. This is a feature, not a bug, which is due to the support of high availability in Hive on MR3. (After setting environment variable MR3_APPLICATION_ID_TIMESTAMP properly, running kubernetes/run-hive.sh attaches the existing DAGAppMaster Pod to the new HiveServer2 Pod.)

Delete ReplicationController for DAGAppMaster which in turn deletes all ContainerWorker Pods automatically.

$ kubectl get replicationcontroller -n hivemr3
NAME               DESIRED   CURRENT   READY   AGE
mr3master-9536-0   1         1         1       4m25s
$ kubectl -n hivemr3 delete replicationcontroller mr3master-9536-0
replicationcontroller "mr3master-9536-0" deleted

Now no Pods should be running in the namespace hivemr3. To delete all remaining resources, execute the following command:

$ kubectl -n hivemr3 delete configmap --all; kubectl -n hivemr3 delete svc --all; kubectl -n hivemr3 delete secret --all; kubectl -n hivemr3 delete serviceaccount --all; kubectl -n hivemr3 delete role --all; kubectl -n hivemr3 delete rolebinding --all; kubectl delete clusterrole node-reader; kubectl delete clusterrolebinding hive-clusterrole-binding; kubectl -n hivemr3 delete persistentvolumeclaims workdir-pvc; kubectl delete persistentvolumes workdir-pv

Stop Metastore.

$ hive/metastore-service.sh stop --local --hivesrc3

The user can check if Metastore has successfully stopped by reading its log file.

$ tail -n3 /tmp/gla/hive.log
/************************************************************
SHUTDOWN_MSG: Shutting down HiveMetaStore at .................../123.456.789.12
************************************************************/