This page shows how to operate Hive on MR3 in a Kerberos-enabled secure Hadoop cluster using a MySQL database for Metastore. The same user, not necessarily an administrator of the Hadoop cluster, will run both Metastore and HiveServer2. By following the instruction, the user will learn:

  1. how to install and configure Hive on MR3 in a Kerberos-enabled secure Hadoop cluster
  2. how to start and stop Metastore with a MySQL compatible database
  3. how to start and stop HiveServer2
  4. how to create Beeline connections and send queries to HiveServer2

This scenario has the following prerequisites:

  • Java 8 or Java 17 is available. Java 17 should be installed in the same directory on every node.
  • A Kerberos-enabled secure Hadoop cluster 3 or higher is available.
  • The user has access to the home directory and /tmp directory on HDFS.
  • The user has access to the MySQL server with all privileges.
  • The user has a service keytab file for securing Metastore and HiveServer2.
  • The user has a keytab file for renewing HDFS tokens.

This scenario should take less than 30 minutes to complete, not including the time for downloading an MR3 release.

For asking any questions, please visit MR3 Google Group or join MR3 Slack.

Accessing MySQL

In our example, we assume that a MySQL user hivemr3 (on node navy0) has access to a MySQL database with all privileges. Later we will configure Metastore so that it connects to the MySQL server using the user name hivemr3. The MySQL server may run on any node, not necessarily on the node where Metastore or HiveServer2 will be running. In our example, it runs on the same node.

$ mysql -h navy0 -u hivemr3 -p
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 154956
Server version: 5.7.21 MySQL Community Server (GPL)

Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> SHOW GRANTS FOR 'hivemr3'@'navy0';
+--------------------------------------------------+
| Grants for hivemr3@navy0                         |
+--------------------------------------------------+
| GRANT ALL PRIVILEGES ON *.* TO 'hivemr3'@'navy0' |
+--------------------------------------------------+
1 row in set (0.00 sec)

Checking the keytab files

In our example, we will run both Metastore and HiveServer2 as user hive. We assume that a service keytab file hive.service.keytab has already been created by the administrator and placed under the directory /etc/security/keytabs.

Login as user hive. Check the service principal associated with the service keytab file. In our example, we use service principal hive/navy0@NAVY where NAVY is the Kerberos realm. Note that the service keytab file has permission 600 and is accessible only to user hive.

$ ls -alt /etc/security/keytabs/hive.service.keytab
-rw------- 1 hive hadoop 672 Jan 26  2020 /etc/security/keytabs/hive.service.keytab
$ klist -kt /etc/security/keytabs/hive.service.keytab
Keytab name: FILE:/etc/security/keytabs/hive.service.keytab
KVNO Timestamp           Principal
---- ------------------- ------------------------------------------------------
   2 01/26/2020 14:19:51 hive@NAVY
   2 01/26/2020 14:19:51 hive@NAVY
   2 01/26/2020 14:19:51 hive@NAVY
   2 01/26/2020 14:19:51 hive@NAVY
   2 01/26/2020 14:19:51 hive@NAVY
   2 01/26/2020 14:19:51 hive@NAVY
   2 01/26/2020 14:19:51 hive/navy0@NAVY
   2 01/26/2020 14:19:51 hive/navy0@NAVY
   2 01/26/2020 14:19:51 hive/navy0@NAVY
   2 01/26/2020 14:19:51 hive/navy0@NAVY
   2 01/26/2020 14:19:51 hive/navy0@NAVY
   2 01/26/2020 14:19:51 hive/navy0@NAVY

Acquire a Kerberos ticket from the service keytab file.

$ kinit -k -t /etc/security/keytabs/hive.service.keytab hive/navy0@NAVY
$ klist
Ticket cache: FILE:/tmp/krb5cc_1005
Default principal: hive/navy0@NAVY

Valid starting       Expires              Service principal
02/08/2020 00:14:13  02/09/2020 00:14:13  krbtgt/NAVY@NAVY

We also assume that an ordinary keytab file hive.keytab has been created for user hive and placed under the home directory. If not, login as user root and create a new file as follows. First run kadmin.local and create a new principal hive@NAVY.

$ kadmin.local
Authenticating as principal root/admin@NAVY with password.
kadmin.local:  addprinc hive@NAVY
WARNING: no policy specified for hive@NAVY; defaulting to no policy
Enter password for principal "hive@NAVY": 
Re-enter password for principal "hive@NAVY": 
add_principal: Principal or policy already exists while creating "hive@NAVY".

Create a keytab file hive.keytab for user hive.

kadmin.local:  xst -k hive.keytab hive
Entry for principal hive with kvno 4, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:hive.keytab.
Entry for principal hive with kvno 4, encryption type aes128-cts-hmac-sha1-96 added to keytab WRFILE:hive.keytab.
Entry for principal hive with kvno 4, encryption type des3-cbc-sha1 added to keytab WRFILE:hive.keytab.
Entry for principal hive with kvno 4, encryption type arcfour-hmac added to keytab WRFILE:hive.keytab.
Entry for principal hive with kvno 4, encryption type des-hmac-sha1 added to keytab WRFILE:hive.keytab.
Entry for principal hive with kvno 4, encryption type des-cbc-md5 added to keytab WRFILE:hive.keytab.

Check the principal associated with the keytab file.

$ klist -k -t hive.keytab 
Keytab name: FILE:hive.keytab
KVNO Timestamp           Principal
---- ------------------- ------------------------------------------------------
   4 02/08/2020 03:32:48 hive@NAVY
   4 02/08/2020 03:32:48 hive@NAVY
   4 02/08/2020 03:32:48 hive@NAVY
   4 02/08/2020 03:32:48 hive@NAVY
   4 02/08/2020 03:32:48 hive@NAVY
   4 02/08/2020 03:32:48 hive@NAVY

Copy the keytab file to the home directory.

$ cp hive.keytab /home/hive/
$ chown hive /home/hive/hive.keytab 
$ chmod 600 /home/hive/hive.keytab

Installation

Login as user hive. Download a pre-built MR3 release and uncompress it. We choose the pre-built MR3 release based on Hive 3.1.3 which corresponds to --hivesrc3 option to be used later.

$ wget https://github.com/mr3project/mr3-release/releases/download/v1.10/hivemr3-1.10-hive3.1.3-k8s.tar.gz
$ gunzip -c hivemr3-1.10-hive3.1.3-k8s.tar.gz| tar xvf -;
$ wget https://github.com/mr3project/mr3-release/releases/download/v1.10/hivemr3-1.10-java17-hive3.1.3-k8s.tar.gz
$ gunzip -c hivemr3-1.10-java17-hive3.1.3-k8s.tar.gz| tar xvf -;

Rename the new directory to mr3-run and change the working directory. Renaming the new directory is not strictly necessary, but it is recommended because the sample configuration file hive-site.xml included in the MR3 release uses the same directory name.

$ mv hivemr3-1.10-hive3.1.3-k8s/ mr3-run
$ cd mr3-run/
$ mv hivemr3-1.10-java17-hive3.1.3-k8s/ mr3-run
$ cd mr3-run/

Configuring Java and Hadoop

Open env.sh and set JAVA_HOME and PATH if necessary. Set HADOOP_HOME to the Hadoop installation directory.

$ vi env.sh

export JAVA_HOME=/usr/jdk64/jdk1.8  # Java 8
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_HOME=/usr/hdp/3.1.4.0-315/hadoop

USE_JAVA_17=false
$ vi env.sh

export JAVA_HOME=/usr/jdk64/jdk17   # Java 17
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_HOME=/usr/hdp/3.1.4.0-315/hadoop

USE_JAVA_17=true

For Java 8 only

Update the configuration keys mr3.am.launch.cmd-opts and mr3.container.launch.cmd-opts in conf/tpcds/mr3/mr3-site.xml.

  • add -XX:+AggressiveOpts for performance.
  • remove --add-opens java.base/java.net=ALL-UNNAMED --add-opens java.base/java.util=ALL-UNNAMED --add-opens java.base/java.time=ALL-UNNAMED --add-opens java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens java.base/java.io=ALL-UNNAMED ... (which are Java 17 options).

Update the configuration keys mr3.am.launch.env and mr3.container.launch.env in conf/tpcds/mr3/mr3-site.xml.

  • remove JAVA_HOME=/home/hive/jdk17/.

For Java 17 only

Update the configuration keys mr3.am.launch.env and mr3.container.launch.env in conf/tpcds/mr3/mr3-site.xml.

  • set JAVA_HOME=/home/hive/jdk17/ to point to the installation directory of Java 17 on every worker node.

In order to execute Metastore and HiveServer2 with Java 17, JAVA_HOME in hadoop-env.sh in the Hadoop configuration directory should also be set to point to the installation directory of Java 17.

$ vi /etc/hadoop/conf/hadoop-env.sh

JAVA_HOME=/home/hive/jdk17/

Configuring Hive on MR3

Open env.sh and set HIVE_MYSQL_DRIVER to point to the MySQL connector jar file.

$ vi env.sh

HIVE_MYSQL_DRIVER=/usr/share/java/mysql-connector-java-8.0.12.jar

Set the following environment variables to adjust the memory size (in MB) to be allocated to each component:

  • HIVE_METASTORE_HEAPSIZE specifies the memory size for Metastore.
  • HIVE_SERVER2_HEAPSIZE specifies the memory size for HiveServer2.
  • HIVE_CLIENT_HEAPSIZE specifies the memory size of HiveCLI (hive command) and Beeline (beeline command).
  • MR3_AM_HEAPSIZE specifies the memory size of MR3 DAGAppMaster.

In our example, we use the default values.

$ vi env.sh

HIVE_METASTORE_HEAPSIZE=4096
HIVE_SERVER2_HEAPSIZE=16384
HIVE_CLIENT_HEAPSIZE=1024
MR3_AM_HEAPSIZE=10240

HIVE3_DATABASE_HOST specifies the node where the MySQL server runs, and HIVE3_METASTORE_HOST specifies the node where Metastore is to run. In our example, both run on the same node.

$ vi env.sh

HIVE3_DATABASE_HOST=$HOSTNAME
HIVE3_METASTORE_HOST=$HOSTNAME

HIVE3_DATABASE_NAME specifies the database name to be created for Metastore inside MySQL. In our example, we use hive3mr3test.

$ vi env.sh

HIVE3_DATABASE_NAME=hive3mr3test

HIVE3_HDFS_WAREHOUSE specifies the warehouse directory on HDFS.

$ vi env.sh

HIVE3_HDFS_WAREHOUSE=/user/hive/warehouse

Enable security by setting SECURE_MODE to true and HIVE_SERVER2_AUTHENTICATION to KERBEROS. Set the following three Kerberos principals and their corresponding keytab files.

  • USER_PRINCIPAL specifies the principal to use when renewing HDFS and Hive tokens.
  • HIVE_METASTORE_KERBEROS_PRINCIPAL specifies the service principal for Metastore.
  • HIVE_SERVER2_KERBEROS_PRINCIPAL specifies the service principal for HiveServer2.

In our example, we use hive.service.keytab for Metastore and HiveServer2, and hive.keytab for renewing HDFS and Hive tokens.

$ vi env.sh

SECURE_MODE=true

USER_PRINCIPAL=hive@NAVY
USER_KEYTAB=/home/hive/hive.keytab

HIVE_METASTORE_KERBEROS_PRINCIPAL=hive/_HOST@NAVY
HIVE_METASTORE_KERBEROS_KEYTAB=/etc/security/keytabs/hive.service.keytab

HIVE_SERVER2_AUTHENTICATION=KERBEROS
HIVE_SERVER2_KERBEROS_PRINCIPAL=hive/_HOST@NAVY
HIVE_SERVER2_KERBEROS_KEYTAB=/etc/security/keytabs/hive.service.keytab

Open conf/tpcds/hive3/hive-site.xml and set the following four configuration keys according to the current user name and the working directory.

$ vi conf/tpcds/hive3/hive-site.xml 

<property>
  <name>hive.users.in.admin.role</name>
  <value>hive</value>
</property>

<property>
  <name>hive.aux.jars.path</name>
  <value>/home/hive/mr3-run/hive/hivejar/apache-hive-3.1.3-bin/lib/hive-llap-common-3.1.3.jar,/home/hive/mr3-run/hive/hivejar/apache-hive-3.1.3-bin/lib/hive-llap-server-3.1.3.jar,/home/hive/mr3-run/hive/hivejar/apache-hive-3.1.3-bin/lib/hive-llap-tez-3.1.3.jar</value>
</property>

<property>
  <name>hive.exec.scratchdir</name>
  <value>/tmp/hive</value>
</property>

<property>
  <name>hive.server2.logging.operation.log.location</name>
  <value>/tmp/hive/operation_logs</value>
</property>

To use a later release of MR3, adjust the path in hive.aux.jars.path correctly, e.g.:

  <name>hive.aux.jars.path</name>
  <value>/home/hive/mr3-run/hive/hivejar/apache-hive-3.1.3-bin/lib/hive-llap-common-3.1.3.jar,/home/hive/mr3-run/hive/hivejar/apache-hive-3.1.3-bin/lib/hive-llap-server-3.1.3.jar,/home/hive/mr3-run/hive/hivejar/apache-hive-3.1.3-bin/lib/hive-llap-tez-3.1.3.jar</value>

The following configuration keys specify the resource to be allocated to a Map Task, a Reduce Task, or a ContainerWorker. By default, we allocate 4GB and a single core to a Map Task, a Reduce Task, and a ContainerWorker.

$ vi conf/tpcds/hive3/hive-site.xml 

<property>
  <name>hive.mr3.map.task.memory.mb</name>
  <value>4096</value>
</property>

<property>
  <name>hive.mr3.map.task.vcores</name>
  <value>1</value>
</property>

<property>
  <name>hive.mr3.reduce.task.memory.mb</name>
  <value>4096</value>
</property>

<property>
  <name>hive.mr3.reduce.task.vcores</name>
  <value>1</value>
</property>

<property>
  <name>hive.mr3.all-in-one.containergroup.memory.mb</name>
  <value>4096</value>
</property>

<property>
  <name>hive.mr3.all-in-one.containergroup.vcores</name>
  <value>1</value>
</property>

When updating these configuration keys, we should meet the following requirements:

  • hive.mr3.map.task.memory.mb <= hive.mr3.all-in-one.containergroup.memory.mb
  • hive.mr3.map.task.vcores <= hive.mr3.all-in-one.containergroup.vcores
  • hive.mr3.reduce.task.memory.mb <= hive.mr3.all-in-one.containergroup.memory.mb
  • hive.mr3.reduce.task.vcores <= hive.mr3.all-in-one.containergroup.vcores

We disable impersonation by setting hive.server2.enable.doAs to false. With impersonation disabled, all Hive operations are executed by user hive.

$ vi conf/tpcds/hive3/hive-site.xml 

<property>
  <name>hive.server2.enable.doAs</name>
  <value>false</value>
</property>

Set javax.jdo.option.ConnectionUserName to the MySQL user name and javax.jdo.option.ConnectionPassword to the password.

$ vi conf/tpcds/hive3/hive-site.xml 

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hivemr3</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>password</value>
</property>

Creating directories on HDFS

Create the warehouse directory specified in env.sh.

$ grep HIVE3_HDFS_WAREHOUSE env.sh
HIVE3_HDFS_WAREHOUSE=/user/hive/warehouse
$ kinit
Password for hive@NAVY: 
$ hdfs dfs -mkdir -p /user/hive/warehouse

Create a directory for storing MR3 and Tez jar files.

$ hdfs dfs -mkdir -p /user/hive/lib

Load MR3 jar files.

$ mr3/upload-hdfslib-mr3.sh

Load Tez jar files.

$ tez/upload-hdfslib-tez.sh

Check if /tmp/hive exists on HDFS. If the directory already exists, make sure that its permission is set to 733. HiveServer2 automatically creates a new directory with permission 733 if it does not exist.

$ hdfs dfs -ls /tmp/hive
ls: `/tmp/hive': No such file or directory

Make sure that /tmp/hive/hive` does not exist on HDFS.

$ hdfs dfs -ls /tmp/hive/hive
ls: `/tmp/hive/hive': No such file or directory

Creating temporary directories

Create a new directory specified by hive.server2.logging.operation.log.location.

$ ls -alt /tmp/hive/operation_logs
ls: cannot access /tmp/hive/operation_logs: No such file or directory
$ mkdir -p /tmp/hive/operation_logs

Running Metastore

Metastore uses the port specified by the environment variable HIVE3_METASTORE_PORT in env.sh. Make sure that the port is not in use.

$ grep HIVE3_METASTORE_PORT env.sh
HIVE3_METASTORE_PORT=9830
$ ss -anpt | grep -E "LISTEN.*:9830"
$ 

Run Metastore with a MySQL database using --tpcds option, and initialize it using --init-schema option.

$ hive/metastore-service.sh start --tpcds --hivesrc3 --init-schema

After a while, check if Metastore has successfully started.

$ cat /home/hive/mr3-run/hive/metastore-service-result/hive-mr3--2020-02-08--23-40-28-5cdd6a5c/out-metastore.txt | grep -e Metastore -e Init
Metastore connection URL:  jdbc:mysql://navy0/hive3mr3test?createDatabaseIfNotExist=true
Metastore Connection Driver :  com.mysql.jdbc.Driver
Metastore connection User:   hivemr3
Initialization script hive-schema-3.1.0.mysql.sql
Initialization script completed
2020-02-08 23:40:37: Starting Hive Metastore Server

When the user restarts Metastore, do not use --init-schema option in order to reuse existing Hive databases.

The log file for Metastore is found under /tmp/hive.

$ ls /tmp/hive/hive.log
/tmp/hive/hive.log

Running HiveServer2

Run HiveServer2 using --tpcds option. Do not use LocalProcess mode for MR3 DAGAppMaster (with --amprocess option) which is not supported on a secure Hadoop cluster.

$ hive/hiveserver2-service.sh start --tpcds --hivesrc3 

After a while, check if HiveServer2 has successfully started by inspecting its log file.

$ grep -e "New MR3Session created" /home/hive/mr3-run/hive/hiveserver2-service-result/hive-mr3--2020-02-09--00-34-35-740609e2/hive-logs/hive.log 
2020-02-09T00:35:10,303  INFO [main] session.MR3SessionManagerImpl: New MR3Session created: 1bf11521-b9a4-476e-8129-70ddfdac55dd, hive

The user can find a new Yarn application of type mr3 submitted by user hive.

$ yarn application -list
20/02/09 00:35:31 INFO client.RMProxy: Connecting to ResourceManager at navy0/10.0.0.200:8032
Total number of applications (application-types: [], states: [SUBMITTED, ACCEPTED, RUNNING] and tags: []):1
                Application-Id      Application-Name      Application-Type        User       Queue               State         Final-State         Progress                        Tracking-URL
application_1550457902779_0048  1bf11521-b9a4-476e-8129-70ddfdac55dd                   mr3        hive     default             RUNNING           UNDEFINED               0%                                 N/A

Create a new directory hive/run-beeline-result and set its permission to 777 so that any user can execute Beeline using the same MR3 installation.

$ mkdir -p hive/run-beeline-result
$ chmod 777 hive/run-beeline-result

Download a sample dataset.

$ wget https://github.com/mr3project/mr3-release/releases/download/v1.0/pokemon.csv

Running queries

In our example, we use another user gla to send queries to HiveServer2. Login as user gla and change the working directory.

$ cd /home/hive/mr3-run/

Make sure that user gla cannot read the keytab files.

$ cat /etc/security/keytabs/hive.service.keytab
cat: /etc/security/keytabs/hive.service.keytab: Permission denied
$ cat /home/hive/hive.keytab
cat: /home/hive/hive.keytab: Permission denied

Acquire a Kerberos ticket for user gla (either by executing kinit or reading a Kerberos keytab file).

$ klist
Ticket cache: FILE:/tmp/krb5cc_1001
Default principal: gla@NAVY

Valid starting       Expires              Service principal
02/09/2020 02:43:51  02/10/2020 02:43:51  krbtgt/NAVY@NAVY

Run Beeline using --tpcds option.

$ hive/run-beeline.sh --tpcds --hivesrc3

Use the default database.

0: jdbc:hive2://indigo1:9832/> use default;

Create a table called pokemon.

0: jdbc:hive2://indigo1:9832/> CREATE TABLE pokemon (Number Int,Name String,Type1 String,Type2 String,Total Int,HP Int,Attack Int,Defense Int,Sp_Atk Int,Sp_Def Int,Speed Int) row format delimited fields terminated BY ',' lines terminated BY '\n' tblproperties("skip.header.line.count"="1");

Import the sample dataset.

0: jdbc:hive2://indigo1:9832/> load data local inpath './pokemon.csv' INTO table pokemon;

Execute queries.

0: jdbc:hive2://indigo1:9832/> select avg(HP) from pokemon;
...

0: jdbc:hive2://indigo1:9832/> create table pokemon1 as select *, IF(HP>160.0,'strong',IF(HP>140.0,'moderate','weak')) AS power_rate from pokemon;
...

0: jdbc:hive2://indigo1:9832/> select COUNT(name), power_rate from pokemon1 group by power_rate;
...

Exit Beeline. The warehouse directory on HDFS has now two sub-directories corresponding to the two Hive tables created above. Note that the sub-directories are owned by user hive, not user gla, because impersonation is disabled.

$ hdfs dfs -ls /user/hive/warehouse
Found 2 items
drwxr-xr-x   - hive hadoop          0 2020-02-09 03:29 /user/hive/warehouse/pokemon
drwxr-xr-x   - hive hadoop          0 2020-02-09 03:30 /user/hive/warehouse/pokemon1

Stopping HiveServer2 and Metastore

Stop HiveServer2 as user hive.

$ hive/hiveserver2-service.sh stop --tpcds --hivesrc3

Stop Metastore as user hive.

$ hive/metastore-service.sh stop --tpcds --hivesrc3

The user can check if Metastore has successfully stopped by reading its log file.

$ tail -n3 /tmp/hive/hive.log
/************************************************************
SHUTDOWN_MSG: Shutting down HiveMetaStore at navy0/10.0.0.200
************************************************************/