This page shows how to operate Hive on MR3 in a Kerberos-enabled secure Cloudera CDH cluster using a MySQL database for Metastore. The same user, not necessarily an administrator of the Hadoop cluster, will run both Metastore and HiveServer2. By following the instruction, the user will learn:
- how to install and configure Hive on MR3 in a Kerberos-enabled secure CDH cluster
- how to start and stop Metastore with a MySQL compatible database
- how to start and stop HiveServer2
- how to create Beeline connections and send queries to HiveServer2
This scenario has the following prerequisites:
- A secure CDH cluster is available.
- The user has access to the home directory and
/tmp
directory on HDFS. - The user has access to the MySQL server with all privileges.
- The user has a service keytab file for securing Metastore and HiveServer2.
- The user has a keytab file for renewing HDFS tokens.
This scenario should take less than 30 minutes to complete, not including the time for downloading an MR3 release. This page has been tested with MR3 release 1.0 on CDH 5.15.2.
Accessing MySQL
In our example, we assume that a MySQL user hive
(on node blue0
) has access to a MySQL database with all privileges.
Later we will configure Metastore so that it connects to the MySQL server using the user name hive
.
The MySQL server may run on any node, not necessarily on the node where Metastore or HiveServer2 will be running.
In our example, it runs on the same node.
$ mysql -h blue0 -u hive -p
Enter password:
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 719640
Server version: 5.5.60-MariaDB MariaDB Server
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
MariaDB [(none)]> SHOW GRANTS FOR 'hive';
+-----------------------------------------------------------------------------------------------------+
| Grants for hive@% |
+-----------------------------------------------------------------------------------------------------+
| GRANT USAGE ON *.* TO 'hive'@'%' IDENTIFIED BY PASSWORD '*2470C0C06DEE42FD1618BB99005ADCA2EC9Dxxxx' |
| GRANT ALL PRIVILEGES ON `hive5mr3`.* TO 'hive'@'%' |
| GRANT ALL PRIVILEGES ON `metastore`.* TO 'hive'@'%' |
| GRANT ALL PRIVILEGES ON `hive4mr3`.* TO 'hive'@'%' |
+-----------------------------------------------------------------------------------------------------+
4 rows in set (0.01 sec)
Checking the keytab files
In our example, we will run both Metastore and HiveServer2 as user hive
.
We assume that a service keytab file hive.service.keytab
has already been created by the administrator and placed under the directory /etc/security/keytabs
.
On CDH, the user (as a root user) can find such a service keytab file under the directory /var/run/cloudera-scm-agent/process/
.
$ find /var/run/cloudera-scm-agent/process/ -not -empty | grep keytab
/var/run/cloudera-scm-agent/process/2053-impala-CATALOGSERVER/impala.keytab
/var/run/cloudera-scm-agent/process/2051-impala-STATESTORE/impala.keytab
/var/run/cloudera-scm-agent/process/2043-hive-HIVESERVER2/hive.keytab
/var/run/cloudera-scm-agent/process/2045-hive-WEBHCAT/HTTP.keytab
/var/run/cloudera-scm-agent/process/2044-hive-HIVEMETASTORE/hive.keytab
/var/run/cloudera-scm-agent/process/2034-yarn-RESOURCEMANAGER/yarn.keytab
/var/run/cloudera-scm-agent/process/2040-yarn-JOBHISTORY/mapred.keytab
/var/run/cloudera-scm-agent/process/2028-hdfs-NAMENODE-nnRpcWait/hdfs.keytab
/var/run/cloudera-scm-agent/process/2015-hdfs-NFSGATEWAY/hdfs.keytab
/var/run/cloudera-scm-agent/process/2013-hdfs-HTTPFS/httpfs.keytab
/var/run/cloudera-scm-agent/process/2012-hdfs-SECONDARYNAMENODE/hdfs.keytab
/var/run/cloudera-scm-agent/process/2017-hdfs-NAMENODE/hdfs.keytab
/var/run/cloudera-scm-agent/process/2008-cloudera-mgmt-ACTIVITYMONITOR/cmon.keytab
/var/run/cloudera-scm-agent/process/2010-cloudera-mgmt-SERVICEMONITOR/cmon.keytab
In our example, we use 2044-hive-HIVEMETASTORE/hive.keytab
.
$ mkdir -p /etc/security/keytabs/
$ cp /var/run/cloudera-scm-agent/process/2044-hive-HIVEMETASTORE/hive.keytab /etc/security/keytabs/hive.service.keytab
$ chmod 600 /etc/security/keytabs/hive.service.keytab
$ chown hive /etc/security/keytabs/hive.service.keytab
Login as user hive
.
Check the service principal associated with the service keytab file.
In our example, we use service principal hive/blue0@PL
where PL
is the Kerberos realm.
Note that the service keytab file has permission 600 and is accessible only to user hive
.
$ ls -alt /etc/security/keytabs/hive.service.keytab
-rw------- 1 hive hadoop 130 Feb 7 15:15 /etc/security/keytabs/hive.service.keytab
$ klist -kt /etc/security/keytabs/hive.service.keytab
Keytab name: FILE:/etc/security/keytabs/hive.service.keytab
KVNO Timestamp Principal
---- ------------------- ------------------------------------------------------
2 02/06/2020 18:51:43 hive/blue0@PL
2 02/06/2020 18:51:43 hive/blue0@PL
Acquire a Kerberos ticket from the service keytab file.
$ kinit -k -t /etc/security/keytabs/hive.service.keytab hive/blue0@PL
$ klist
Ticket cache: FILE:/tmp/krb5cc_375
Default principal: hive/blue0@PL
Valid starting Expires Service principal
02/07/2020 15:19:40 02/08/2020 15:19:40 krbtgt/PL@PL
We also assume that an ordinary keytab file hive.keytab
has been created for user hive
and placed under the home directory.
Check the principal associated with the keytab file.
$ pwd
/var/lib/hive
$ ls -alt hive.keytab
-rw------- 1 hive hive 132 Feb 6 23:39 hive.keytab
$ klist -k -t hive.keytab
Keytab name: FILE:hive.keytab
KVNO Timestamp Principal
---- ------------------- ------------------------------------------------------
1 02/06/2020 22:32:53 hive@PL
1 02/06/2020 23:31:46 hive@PL
Installation
Login as user hive
.
Download the pre-built MR3 release based on Hive 3.1.2 (corresponding to --hivesrc3
option to be used later)
built specifically for Cloudera CDH (ending with -cdh.tar.gz
) and uncompress it.
$ wget https://github.com/mr3project/mr3-release/releases/download/v1.0/hivemr3-1.0-hive3.1.2-cdh.tar.gz
$ gunzip -c hivemr3-1.0-hive3.1.2-cdh.tar.gz | tar xvf -;
Rename the new directory to mr3-run
and change the working directory.
Renaming the new directory is not strictly necessary,
but it is recommended because the sample configuration file hive-site.xml
included in the MR3 release uses the same directory name.
$ mv hivemr3-1.0-hive3.1.2-cdh/ mr3-run
$ cd mr3-run/
Configuring Hive on MR3
Open env.sh
and set JAVA_HOME
and PATH
if necessary.
Set HADOOP_HOME
to the Hadoop installation directory.
$ vi env.sh
export JAVA_HOME=/usr/java/jdk1.8.0_202-amd64
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_HOME=/usr/lib/hadoop
Set TEZ_USE_MINIMAL
to true so that Hive on MR3 uses jar files provided by Cloudera CDH.
TEZ_USE_MINIMAL=true
Set HIVE_MYSQL_DRIVER
to point to the MySQL connector jar file.
HIVE_MYSQL_DRIVER=/usr/share/java/mysql-connector-java.jar
Set the following environment variables to adjust the memory size (in MB) to be allocated to each component:
HIVE_METASTORE_HEAPSIZE
specifies the memory size for Metastore.HIVE_SERVER2_HEAPSIZE
specifies the memory size for HiveServer2.HIVE_CLIENT_HEAPSIZE
specifies the memory size of HiveCLI (hive
command) and Beeline (beeline
command).MR3_AM_HEAPSIZE
specifies the memory size of MR3 DAGAppMaster.
In our example, we use the default values.
HIVE_METASTORE_HEAPSIZE=4096
HIVE_SERVER2_HEAPSIZE=16384
HIVE_CLIENT_HEAPSIZE=1024
MR3_AM_HEAPSIZE=10240
HIVE3_DATABASE_HOST
specifies the node where the MySQL server runs,
and HIVE3_METASTORE_HOST
specifies the node where Metastore is to run.
In our example, both run on the same node.
HIVE3_DATABASE_HOST=$HOSTNAME
HIVE3_METASTORE_HOST=$HOSTNAME
HIVE3_DATABASE_NAME
specifies the database name to be used for Metastore inside MySQL.
In our example, we use hive5mr3
.
HIVE3_DATABASE_NAME=hive5mr3
HIVE3_HDFS_WAREHOUSE
specifies the warehouse directory on HDFS.
HIVE3_HDFS_WAREHOUSE=/tmp/hivemr3/warehouse
Enable security by setting SECURE_MODE
to true and HIVE_SERVER2_AUTHENTICATION
to KERBEROS.
Set the following three Kerberos principals and their corresponding keytab files.
USER_PRINCIPAL
specifies the principal to use when renewing HDFS and Hive tokens.HIVE_METASTORE_KERBEROS_PRINCIPAL
specifies the service principal for Metastore.HIVE_SERVER2_KERBEROS_PRINCIPAL
specifies the service principal for HiveServer2.
In our example, we use hive.service.keytab
for Metastore and HiveServer2, and hive.keytab
for renewing HDFS and Hive tokens.
SECURE_MODE=true
USER_PRINCIPAL=hive@PL
USER_KEYTAB=/var/lib/hive/hive.keytab
HIVE_METASTORE_KERBEROS_PRINCIPAL=hive/blue0@PL
HIVE_METASTORE_KERBEROS_KEYTAB=/etc/security/keytabs/hive.service.keytab
HIVE_SERVER2_AUTHENTICATION=KERBEROS
HIVE_SERVER2_KERBEROS_PRINCIPAL=hive/blue0@PL
HIVE_SERVER2_KERBEROS_KEYTAB=/etc/security/keytabs/hive.service.keytab
Open conf/tpcds/hive3/hive-site.xml
and set the following four configuration keys
according to the current user name and the working directory.
In our case, the default home directory of user hive
on Cloudera CDH is /var/lib/hive
, not /home/hive
.
Since Cloudera CDH does not include hadoop-mapreduce-client-core.jar
in its default jar files,
we manually append it in the configuration key hive.aux.jars.path
.
Otherwise MR3 fails to create ContainerWorkers because such classes as org.apache.hadoop.mapred.JobConf
are unavailable.
$ vi conf/tpcds/hive3/hive-site.xml
<property>
<name>hive.users.in.admin.role</name>
<value>hive</value>
</property>
<property>
<name>hive.aux.jars.path</name>
<value>/var/lib/hive/mr3-run/hive/hivejar/apache-hive-3.1.2-bin/lib/hive-llap-common-3.1.2.jar,/var/lib/hive/mr3-run/hive/hivejar/apache-hive-3.1.2-bin/lib/hive-llap-server-3.1.2.jar,/var/lib/hive/mr3-run/hive/hivejar/apache-hive-3.1.2-bin/lib/hive-llap-tez-3.1.2.jar,/opt/cloudera/parcels/CDH-5.15.2-1.cdh5.15.2.p0.3/lib/hadoop-mapreduce/hadoop-mapreduce-client-core.jar</value>
</property>
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hive</value>
</property>
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>/tmp/hive/operation_logs</value>
</property>
The following configuration keys specify the resource to be allocated to a Map Task, a Reduce Task, or a ContainerWorker. By default, we allocate 4GB and a single core to a Map Task, a Reduce Task, and a ContainerWorker.
<property>
<name>hive.mr3.map.task.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>hive.mr3.map.task.vcores</name>
<value>1</value>
</property>
<property>
<name>hive.mr3.reduce.task.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>hive.mr3.reduce.task.vcores</name>
<value>1</value>
</property>
<property>
<name>hive.mr3.all-in-one.containergroup.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>hive.mr3.all-in-one.containergroup.vcores</name>
<value>1</value>
</property>
When updating these configuration keys, we should meet the following requirements:
hive.mr3.map.task.memory.mb
<=hive.mr3.all-in-one.containergroup.memory.mb
hive.mr3.map.task.vcores
<=hive.mr3.all-in-one.containergroup.vcores
hive.mr3.reduce.task.memory.mb
<=hive.mr3.all-in-one.containergroup.memory.mb
hive.mr3.reduce.task.vcores
<=hive.mr3.all-in-one.containergroup.vcores
We disable impersonation by setting hive.server2.enable.doAs
to false.
With impersonation disabled, all Hive operations are executed by user hive
.
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
</property>
Set javax.jdo.option.ConnectionUserName
to the MySQL user name
and javax.jdo.option.ConnectionPassword
to the password.
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
</property>
Checking and creating directories on HDFS
Check the warehouse directory specified in env.sh
.
$ grep HIVE3_HDFS_WAREHOUSE env.sh
HIVE3_HDFS_WAREHOUSE=/tmp/hivemr3/warehouse
$ kinit -k -t ~/hive.keytab hive@PL
$ export JAVA_HOME=/usr/java/jdk1.8.0_202-amd64
$ hdfs dfs -ls /tmp/hivemr3/warehouse
Note that the user should use Java 1.8 update 161 or later to enable the unlimited cryptography policy. A previous Java version may produce the following error even if a valid Kerberos ticket has been acquired:
$ hdfs dfs -ls /tmp/hivemr3/warehouse
20/02/07 06:46:37 WARN security.UserGroupInformation: PriviledgedActionException as:hive (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
20/02/07 06:46:37 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
20/02/07 06:46:37 WARN security.UserGroupInformation: PriviledgedActionException as:hive (auth:KERBEROS) cause:java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
ls: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "blue0/192.168.10.101"; destination host is: "blue0":8020;
Create a directory for storing MR3 and Tez jar files.
$ hdfs dfs -mkdir -p /user/hive/lib
Load MR3 jar files.
$ mr3/upload-hdfslib-mr3.sh
Load Tez jar files.
$ tez/upload-hdfslib-tez.sh
Check if /tmp/hive
exists on HDFS.
If the directory already exists, make sure that its permission is set to 733.
HiveServer2 automatically creates a new directory with permission 733 if it does not exist.
$ hdfs dfs -ls /tmp/hive
ls: `/tmp/hive': No such file or directory
Make sure that /tmp/hive/hive` does not exist on HDFS.
$ hdfs dfs -ls /tmp/hive/hive
ls: `/tmp/hive/hive': No such file or directory
Creating temporary directories
Create a new directory specified by hive.server2.logging.operation.log.location
.
$ ls -alt /tmp/hive/operation_logs
ls: cannot access /tmp/hive/operation_logs: No such file or directory
$ mkdir -p /tmp/hive/operation_logs
Running Metastore
Metastore uses the port specified by the environment variable HIVE3_METASTORE_PORT
in env.sh
.
Make sure that the port is not in use.
$ grep HIVE3_METASTORE_PORT env.sh
HIVE3_METASTORE_PORT=9830
$ ss -anpt | grep -E "LISTEN.*:9830"
$
Run Metastore with a MySQL database using --tpcds
option.
If necessary, initialize it using --init-schema
option.
$ hive/metastore-service.sh start --tpcds --hivesrc3 --init-schema
After a while, check if Metastore has successfully started.
$ cat /var/lib/hive/mr3-run/hive/metastore-service-result/hive-mr3--2020-02-07-15-55-20-41eac7ad/out-metastore.txt | grep -e Metastore
2020-02-07 15:55:24: Starting Hive Metastore Server
The log file for Metastore is found under /tmp/hive
.
$ ls /tmp/hive/hive.log
/tmp/hive/hive.log
Running HiveServer2
Run HiveServer2 using --tpcds
option.
Do not use LocalProcess mode for MR3 DAGAppMaster (with --amprocess
option) which is not supported on a secure Hadoop cluster.
$ hive/hiveserver2-service.sh start --tpcds --hivesrc3
After a while, check if HiveServer2 has successfully started by inspecting its log file.
$ grep -e "New MR3Session created" /var/lib/hive/mr3-run/hive/hiveserver2-service-result/hive-mr3--2020-02-07-15-56-46-c4d3421d/hive-logs/hive.log
2020-02-07T06:57:06,743 INFO [main] session.MR3SessionManagerImpl: New MR3Session created: cf78f8c0-0970-482c-a51c-65073fd2a32e, hive
The user can find a new Yarn application of type mr3
submitted by user hive
.
$ yarn application -list
20/02/07 06:58:29 INFO client.RMProxy: Connecting to ResourceManager at blue0/192.168.10.101:8032
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1578304376790_0014 cf78f8c0-0970-482c-a51c-65073fd2a32e mr3 hive root.users.hive RUNNING UNDEFINED 0% N/A
Create a new directory hive/run-beeline-result
and set its permission to 777
so that any user can execute Beeline using the same MR3 installation.
$ mkdir -p hive/run-beeline-result
$ chmod 777 hive/run-beeline-result
Download a sample dataset.
$ wget https://github.com/mr3project/mr3-release/releases/download/v1.0/pokemon.csv
Running queries
In our example, we use another user gitlab-runner
to send queries to HiveServer2.
Login as user gitlab-runner
and change the working directory.
$ cd /var/lib/hive/mr3-run
Make sure that user gitlab-runner
cannot read the keytab files.
$ cat /etc/security/keytabs/hive.service.keytab
cat: /etc/security/keytabs/hive.service.keytab: Permission denied
$ cat /var/lib/hive/hive.keytab
cat: /var/lib/hive/hive.keytab: Permission denied
Acquire a Kerberos ticket for user gitlab-runner
(either by executing kinit
or reading a Kerberos keytab file).
$ klist
Ticket cache: FILE:/tmp/krb5cc_1100
Default principal: gitlab-runner@PL
Valid starting Expires Service principal
02/07/2020 16:01:34 02/08/2020 16:01:34 krbtgt/PL@PL
Run Beeline using --tpcds
option.
$ hive/run-beeline.sh --tpcds --hivesrc3
Create a table called pokemon
.
0: jdbc:hive2://blue0:9832/> CREATE TABLE pokemon (Number Int,Name String,Type1 String,Type2 String,Total Int,HP Int,Attack Int,Defense Int,Sp_Atk Int,Sp_Def Int,Speed Int) row format delimited fields terminated BY ',' lines terminated BY '\n' tblproperties("skip.header.line.count"="1");
Import the sample dataset.
0: jdbc:hive2://blue0:9832/> load data local inpath './pokemon.csv' INTO table pokemon;
Execute queries.
0: jdbc:hive2://blue0:9832/> select avg(HP) from pokemon;
...
0: jdbc:hive2://blue0:9832/> create table pokemon1 as select *, IF(HP>160.0,'strong',IF(HP>140.0,'moderate','weak')) AS power_rate from pokemon;
...
0: jdbc:hive2://blue0:9832/> select COUNT(name), power_rate from pokemon1 group by power_rate;
...
Exit Beeline.
The warehouse directory on HDFS has now two sub-directories corresponding to the two Hive tables created above.
Note that the sub-directories are owned by user hive
, not user gla
, because impersonation is disabled.
$ export JAVA_HOME=/usr/java/jdk1.8.0_202-amd64
$ hdfs dfs -ls /tmp/hivemr3/warehouse | grep pokemon
drwxr-xr-x - hive hive 0 2020-02-07 07:03 /tmp/hivemr3/warehouse/pokemon
drwxr-xr-x - hive hive 0 2020-02-07 07:04 /tmp/hivemr3/warehouse/pokemon1
Stopping HiveServer2 and Metastore
Stop HiveServer2 as user hive
.
$ hive/hiveserver2-service.sh stop --tpcds --hivesrc3
Stop Metastore as user hive
.
$ hive/metastore-service.sh stop --tpcds --hivesrc3
The user can check if Metastore has successfully stopped by reading its log file.
$ tail -n3 /tmp/hive/hive.log
/************************************************************
SHUTDOWN_MSG: Shutting down HiveMetaStore at blue0/192.168.10.101
************************************************************/