On Non-secure Hadoop Using MySQL |

This page shows how to operate Hive on MR3 in a non-secure Hadoop cluster using a MySQL database for Metastore. The same user, not necessarily an administrator of the Hadoop cluster, will run both Metastore and HiveServer2. By following the instruction, the user will learn:

how to install and configure Hive on MR3 in a non-secure Hadoop cluster without Kerberos
how to start and stop Metastore with a MySQL compatible database
how to start and stop HiveServer2 with impersonation
how to create Beeline connections and send queries to HiveServer2

This scenario has the following prerequisites:

Java 8 or Java 17 is available. Java 17 should be installed in the same directory on every node.
A non-secure Hadoop cluster 3 or higher is available.
The user has access to the home directory and /tmp directory on HDFS.
The administrator user can update core-site.xml and restart Yarn.
The user has access to the MySQL server with all privileges.

This scenario should take less than 30 minutes to complete, not including the time for downloading an MR3 release.

For asking any questions, please visit MR3 Google Group or join MR3 Slack.

Configuring Yarn

In order to support impersonation in HiveServer2, Yarn should be configured to allow the user starting Metastore and HiveServer2 to impersonate. In our example, user hive runs both Metastore and HiveServer, and the administrator user should add two configuration settings to core-site.xml and restart Yarn.

$ vi /etc/hadoop/conf/core-site.xml

<property>
  <name>hadoop.proxyuser.hive.groups</name>
  <value>*</value>
</property>

<property>
  <name>hadoop.proxyuser.hive.hosts</name>
  <value>*</value>
</property>

hive in hadoop.proxyuser.hive.groups and hadoop.proxyuser.hive.hosts denotes the user starting Metastore and HiveServer2. Thus hadoop.proxyuser.hive.groups specifies the list of groups whose members can be impersonated by user hive, and hadoop.proxyuser.hive.hosts specifies the list of nodes where user hive can impersonate. For simplicity, we allow user hive to impersonate any user on any node.

Accessing MySQL

In our example, we assume that a MySQL user root has access to a MySQL database with all privileges. Later we will configure Metastore so that it connects to the MySQL server using the user name root. The MySQL server may run on any node, not necessarily on the node where Metastore or HiveServer2 will be running.

$ mysql -u root -p
Enter password: 
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 560632
Server version: 5.5.60-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> SHOW GRANTS FOR 'root'@'%';
+--------------------------------------------------------------------------------------------------------------------------------+
| Grants for root@%                                                                                                              |
+--------------------------------------------------------------------------------------------------------------------------------+
| GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY PASSWORD '*2470C0C06DEE42FD1618BB99005ADCA2EC9D1E19' WITH GRANT OPTION |
+--------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Installation

Login as user hive. Download a pre-built MR3 release and uncompress it. We rename the new directory to mr3-run and change the working directory. Renaming the new directory is not strictly necessary, but it is recommended because the sample configuration file hive-site.xml included in the MR3 release uses the same directory name.

Hive 3 on MR3, Java 8

$ wget https://github.com/mr3project/mr3-release/releases/download/v1.11/hivemr3-1.11-hive3.1.3-k8s.tar.gz
$ gunzip -c hivemr3-1.11-hive3.1.3-k8s.tar.gz| tar xvf -;
$ mv hivemr3-1.11-hive3.1.3-k8s/ mr3-run
$ cd mr3-run/

Hive 3 on MR3, Java 17

$ wget https://github.com/mr3project/mr3-release/releases/download/v1.11/hivemr3-1.11-java17-hive3.1.3-k8s.tar.gz
$ gunzip -c hivemr3-1.11-java17-hive3.1.3-k8s.tar.gz| tar xvf -;
$ mv hivemr3-1.11-java17-hive3.1.3-k8s/ mr3-run
$ cd mr3-run/

Hive 4 on MR3, Java 17

$ wget https://github.com/mr3project/mr3-release/releases/download/v1.11/hivemr3-1.11-java17-hive4.0.0-k8s.tar.gz
$ gunzip -c hivemr3-1.11-java17-hive4.0.0-k8s.tar.gz| tar xvf -;
$ mv hivemr3-1.11-java17-hive4.0.0-k8s/ mr3-run
$ cd mr3-run/

A pre-built MR3 release is intended for evaluating Hive on MR3 with a limit on the aggregate capacity of MR3 workers. For running Hive on MR3 in production, contact us.

Configuring Java and Hadoop

Open env.sh and set JAVA_HOME and PATH if necessary. Set HADOOP_HOME to the Hadoop installation directory.

Java 8

$ vi env.sh

export JAVA_HOME=/usr/jdk64/jdk1.8  # Java 8
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_HOME=/usr/hdp/3.1.4.0-315/hadoop

USE_JAVA_17=false

Java 17

$ vi env.sh

export JAVA_HOME=/usr/jdk64/jdk17   # Java 17
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_HOME=/usr/hdp/3.1.4.0-315/hadoop

USE_JAVA_17=true

For Java 8 only

Update the configuration keys mr3.am.launch.cmd-opts and mr3.container.launch.cmd-opts in conf/tpcds/mr3/mr3-site.xml.

add -XX:+AggressiveOpts for performance.
remove --add-opens java.base/java.net=ALL-UNNAMED --add-opens java.base/java.util=ALL-UNNAMED --add-opens java.base/java.time=ALL-UNNAMED --add-opens java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens java.base/java.io=ALL-UNNAMED ... (which are Java 17 options).

Update the configuration keys mr3.am.launch.env and mr3.container.launch.env in conf/tpcds/mr3/mr3-site.xml.

remove JAVA_HOME=/home/hive/jdk17/.

For Java 17 only

Update the configuration keys mr3.am.launch.env and mr3.container.launch.env in conf/tpcds/mr3/mr3-site.xml.

set JAVA_HOME=/home/hive/jdk17/ to point to the installation directory of Java 17 on every worker node.

In order to execute Metastore and HiveServer2 with Java 17, JAVA_HOME in hadoop-env.sh in the Hadoop configuration directory should also be set to point to the installation directory of Java 17.

$ vi /etc/hadoop/conf/hadoop-env.sh

JAVA_HOME=/home/hive/jdk17/

Configuring Hive on MR3

Open env.sh and set HIVE_MYSQL_DRIVER to point to the MySQL connector jar file.

$ vi env.sh

HIVE_MYSQL_DRIVER=/usr/share/java/mysql-connector-java.jar

Set the following environment variables to adjust the memory size (in MB) to be allocated to each component:

HIVE_METASTORE_HEAPSIZE specifies the memory size for Metastore.
HIVE_SERVER2_HEAPSIZE specifies the memory size for HiveServer2.
HIVE_CLIENT_HEAPSIZE specifies the memory size of HiveCLI (hive command) and Beeline (beeline command).
MR3_AM_HEAPSIZE specifies the memory size of MR3 DAGAppMaster.

In our example, we use the default values.

$ vi env.sh

HIVE_METASTORE_HEAPSIZE=4096
HIVE_SERVER2_HEAPSIZE=16384
HIVE_CLIENT_HEAPSIZE=1024
MR3_AM_HEAPSIZE=10240

HIVE?_DATABASE_HOST specifies the node where the MySQL server runs, and HIVE?_METASTORE_HOST specifies the node where Metastore is to run.
HIVE?_DATABASE_NAME specifies the database name to be created for Metastore inside MySQL. In our example, we use hive3mr3test/hive4mr3test.
HIVE?_HDFS_WAREHOUSE specifies the warehouse directory on HDFS. Update it to use the current user name.

Hive 3 on MR3

$ vi env.sh

HIVE3_DATABASE_HOST=blue0
HIVE3_METASTORE_HOST=blue0

HIVE3_DATABASE_NAME=hive3mr3test

HIVE3_HDFS_WAREHOUSE=/user/hive/warehouse

Hive 4 on MR3

$ vi env.sh

HIVE4_DATABASE_HOST=blue0
HIVE4_METASTORE_HOST=blue0

HIVE4_DATABASE_NAME=hive4mr3test

HIVE4_HDFS_WAREHOUSE=/user/hive/warehouse

Open conf/tpcds/hive3/hive-site.xml (for Hive 3 on MR3) or conf/tpcds/hive4/hive-site.xml (for Hive 4 on MR3) and set the following four configuration keys according to the current user name and the working directory.

Hive 3 on MR3

$ vi conf/tpcds/hive3/hive-site.xml 

<property>
  <name>hive.users.in.admin.role</name>
  <value>hive</value>
</property>

<property>
  <name>hive.aux.jars.path</name>
  <value>/home/hive/mr3-run/hive/hivejar/apache-hive-3.1.3-bin/lib/hive-llap-common-3.1.3.jar,/home/hive/mr3-run/hive/hivejar/apache-hive-3.1.3-bin/lib/hive-llap-server-3.1.3.jar,/home/hive/mr3-run/hive/hivejar/apache-hive-3.1.3-bin/lib/hive-llap-tez-3.1.3.jar</value>
</property>

<property>
  <name>hive.exec.scratchdir</name>
  <value>/tmp/hive</value>
</property>

<property>
  <name>hive.server2.logging.operation.log.location</name>
  <value>/tmp/hive/operation_logs</value>
</property>

Hive 4 on MR3

$ vi conf/tpcds/hive4/hive-site.xml 

<property>
  <name>hive.users.in.admin.role</name>
  <value>hive</value>
</property>

<property>
  <name>hive.aux.jars.path</name>
  <value>/home/hive/mr3-run/hive/hivejar/apache-hive-4.0.0-bin/lib/hive-llap-common-4.0.0.jar,/home/hive/mr3-run/hive/hivejar/apache-hive-4.0.0-bin/lib/hive-llap-server-4.0.0.jar,/home/hive/mr3-run/hive/hivejar/apache-hive-4.0.0-bin/lib/hive-llap-tez-4.0.0.jar,/home/hive/mr3-run/hive/hivejar/apache-hive-4.0.0-bin/lib/hive-iceberg-handler-4.0.0.jar,/home/hive/mr3-run/hive/hivejar/apache-hive-4.0.0-bin/lib/log4j-1.2-api-2.18.0.jar</value>
</property>

<property>
  <name>hive.exec.scratchdir</name>
  <value>/tmp/hive</value>
</property>

<property>
  <name>hive.server2.logging.operation.log.location</name>
  <value>/tmp/hive/operation_logs</value>
</property>

In hive-site.xml, the following configuration keys specify the resource to be allocated to a Map Task, a Reduce Task, or a ContainerWorker. By default, we allocate 4GB and a single core to a Map Task and a Reduce Task. A single ContainerWorker uses 40GB of memory and 10 cores, so it can accommodate 10 concurrent Tasks.

<property>
  <name>hive.mr3.map.task.memory.mb</name>
  <value>4096</value>
</property>

<property>
  <name>hive.mr3.map.task.vcores</name>
  <value>1</value>
</property>

<property>
  <name>hive.mr3.reduce.task.memory.mb</name>
  <value>4096</value>
</property>

<property>
  <name>hive.mr3.reduce.task.vcores</name>
  <value>1</value>
</property>

<property>
  <name>hive.mr3.all-in-one.containergroup.memory.mb</name>
  <value>40960</value>
</property>

<property>
  <name>hive.mr3.all-in-one.containergroup.vcores</name>
  <value>10</value>
</property>

When updating these configuration keys, we should meet the following requirements:

hive.mr3.map.task.memory.mb <= hive.mr3.all-in-one.containergroup.memory.mb
hive.mr3.map.task.vcores <= hive.mr3.all-in-one.containergroup.vcores
hive.mr3.reduce.task.memory.mb <= hive.mr3.all-in-one.containergroup.memory.mb
hive.mr3.reduce.task.vcores <= hive.mr3.all-in-one.containergroup.vcores

We enable impersonation by setting hive.server2.enable.doAs to true.

<property>
  <name>hive.server2.enable.doAs</name>
  <value>true</value>
</property>

In hive-site.xml, set javax.jdo.option.ConnectionUserName to the MySQL user name and javax.jdo.option.ConnectionPassword to the password.

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>root</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>password</value>
</property>

Creating directories on HDFS

Create the warehouse directory specified in env.sh, e.g.:

$ grep HIVE3_HDFS_WAREHOUSE env.sh
HIVE3_HDFS_WAREHOUSE=/user/hive/warehouse
$ hdfs dfs -mkdir -p /user/hive/warehouse

Create a directory for storing MR3 and Tez jar files.

$ hdfs dfs -mkdir -p /user/hive/lib

Load MR3 jar files.

$ mr3/upload-hdfslib-mr3.sh

Load Tez jar files.

$ tez/upload-hdfslib-tez.sh

Check if /tmp/hive exists on HDFS. If the directory already exists, make sure that its permission is set to 733. HiveServer2 automatically creates a new directory with permission 733 if it does not exist.

$ hdfs dfs -ls /tmp/ | grep hive
drwx-wx-wx   - hive          hdfs          0 2021-04-16 06:55 /tmp/hive

Make sure that /tmp/hive/hive does not exist on HDFS.

$ hdfs dfs -ls /tmp/hive/hive
ls: `/tmp/hive/hive': No such file or directory

Creating temporary directories

Create a new directory specified by hive.server2.logging.operation.log.location.

$ ls -alt /tmp/hive/operation_logs
ls: cannot access /tmp/hive/operation_logs: No such file or directory
$ mkdir -p /tmp/hive/operation_logs

Running Metastore

Metastore uses the port specified by the environment variable HIVE?_METASTORE_PORT in env.sh. Make sure that the port is not in use.

Hive 3 on MR3

$ grep HIVE3_METASTORE_PORT env.sh
HIVE3_METASTORE_PORT=9830
$ ss -anpt | grep -E "LISTEN.*:9830"
$

Hive 4 on MR3

$ grep HIVE4_METASTORE_PORT env.sh
HIVE4_METASTORE_PORT=9840
$ ss -anpt | grep -E "LISTEN.*:9840"
$

Run Metastore with a MySQL database using --tpcds option, and initialize it using --init-schema option.

$ hive/metastore-service.sh start --tpcds --init-schema

After a while, check if Metastore has successfully started.

$ cat /data2/hive/mr3-run/hive/metastore-service-result/hive-mr3--2022-12-14-21-03-41-57f00af0/out-metastore.txt | grep -e Metastore -e Init
Metastore connection URL:	 jdbc:mysql://blue0/hive3mr3test?createDatabaseIfNotExist=true
Metastore Connection Driver :	 com.mysql.jdbc.Driver
Metastore connection User:	 root
Initialization script hive-schema-3.1.0.mysql.sql
Initialization script completed
2022-12-14 21:03:49: Starting Hive Metastore Server

When the user restarts Metastore, do not use --init-schema option in order to reuse existing Hive databases.

The log file for Metastore is found under /tmp/hive.

$ ls /tmp/hive/hive.log
/tmp/hive/hive.log

Running HiveServer2

Run HiveServer2 using --tpcds option. In order to use LocalProcess mode for MR3 DAGAppMaster, use --amprocess option. In our example, we use --amprocess option.

$ hive/hiveserver2-service.sh start --tpcds --amprocess

After a while, check if HiveServer2 has successfully started by inspecting its log file.

bash-4.2$ grep -e "New MR3Session created" /home/hive/mr3-run/hive/hiveserver2-service-result/hive-mr3--2022-12-14-21-04-53-529c7cc6/hive-logs/hive.log 
2022-12-14T12:05:12,025  INFO [main] session.MR3SessionManagerImpl: New MR3Session created: f9af6e8e-a87e-4732-865c-4cc43efc9a3e, hive

The user can find a new Yarn application of type mr3 submitted by user hive.

$ yarn application -list
...
application_1660836356025_0545	f9af6e8e-a87e-4732-865c-4cc43efc9a3e                 mr3	      hive	   default	           RUNNING	         UNDEFINED	             0%	                                N/A

The user can also find the process for MR3 DAGAppMaster.

$ ps -ef | grep DAGAppMaster | grep mr3
hive     154414 153969 25 21:05 pts/5    00:00:10 /usr/jdk64/jdk1.8.0_112/jre/bin/java -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -XX:+UseNUMA -XX:+UseG1GC -XX:+ResizeTLAB -Xmx8192m -Xms4096m -Dlog4j.configurationFile=mr3-container-log4j2.properties -Dmr3.root.logger=INFO -Dyarn.app.container.log.dir=/data2/hive/mr3-run/hive/hiveserver2-service-result/hive-mr3--2022-12-14-21-04-53-529c7cc6/application_1660836356025_0545 -Dsun.nio.ch.bugLevel='' com.datamonad.mr3.master.DAGAppMaster --session

The user can find the log file for MR3 DAGAppMaster.

bash-4.2$ ls /home/hive/mr3-run/hive/hiveserver2-service-result/hive-mr3--2022-12-14-21-04-53-529c7cc6/application_1660836356025_0545/run.log 
/home/hive/mr3-run/hive/hiveserver2-service-result/hive-mr3--2022-12-14-21-04-53-529c7cc6/application_1660836356025_0545/run.log

Running queries

Download a sample dataset.

$ wget https://github.com/mr3project/mr3-release/releases/download/v1.0/pokemon.csv

Run Beeline using --tpcds option.

$ hive/run-beeline.sh --tpcds

Use the default database.

0: jdbc:hive2://blue0:9832/> use default;

Create a table called pokemon.

0: jdbc:hive2://blue0:9832/> CREATE TABLE pokemon (Number Int,Name String,Type1 String,Type2 String,Total Int,HP Int,Attack Int,Defense Int,Sp_Atk Int,Sp_Def Int,Speed Int) row format delimited fields terminated BY ',' lines terminated BY '\n' tblproperties("skip.header.line.count"="1");

Import the sample dataset.

0: jdbc:hive2://blue0:9832/> load data local inpath './pokemon.csv' INTO table pokemon;

Execute queries.

0: jdbc:hive2://blue0:9832/> select avg(HP) from pokemon;
...

0: jdbc:hive2://blue0:9832/> create table pokemon1 as select *, IF(HP>160.0,'strong',IF(HP>140.0,'moderate','weak')) AS power_rate from pokemon;
...

0: jdbc:hive2://blue0:9832/> select COUNT(name), power_rate from pokemon1 group by power_rate;
...

Exit Beeline. The warehouse directory on HDFS has now two sub-directories.

$ hdfs dfs -ls /user/hive/warehouse/
Found 2 items
drwxr-xr-x   - hive hdfs          0 2022-12-14 21:08 /user/hive/warehouse/pokemon
drwxr-xr-x   - hive hdfs          0 2022-12-14 21:08 /user/hive/warehouse/pokemon1

Change the permission on hive/run-beeline-result to 777 so that any user can run Beeline on /home/hive/mr3-run.

$ chmod 755 /home/hive/
$ chmod 777 hive/run-beeline-result

Running queries by a different user

Change to /home/hive/mr3-run and run Beeline.

$ cd /home/hive/mr3-run
$ hive/run-beeline.sh --tpcds

Use the default database.

0: jdbc:hive2://blue0:9832/> use default;

Execute queries.

0: jdbc:hive2://blue0:9832/> select avg(HP) from pokemon;
...

0: jdbc:hive2://blue0:9832/> select COUNT(name), power_rate from pokemon1 group by power_rate;
...

HiveServer2 creates a scratch directory /tmp/hive/gla on HDFS for user gla

bash-4.2$ hdfs dfs -ls /tmp/hive/ | grep gla
drwx-wx-wx   - gla           hdfs          0 2022-12-14 21:12 /tmp/hive/gla

User gla, however, cannot create a new table. This is because we enable impersonation in HiveServer2 while user gla does not have write permission on the warehouse directory on HDFS.

0: jdbc:hive2://blue0:9832/> create table pokemon_new as select *, IF(HP>160.0,'strong',IF(HP>140.0,'moderate','weak')) AS power_rate from pokemon;
Error: Error while compiling statement: FAILED: SemanticException 0:0 Error creating temporary folder on: hdfs://blue0:8020/user/hive/warehouse. Error encountered near token 'TOK_TMP_FILE' (state=42000,code=40000)

Grant write permission on the warehouse directory to everyone.

$ hdfs dfs -chmod 777 /user/hive/warehouse

Now user gla can create a new table.

0: jdbc:hive2://blue0:9832/> create table pokemon_new as select *, IF(HP>160.0,'strong',IF(HP>140.0,'moderate','weak')) AS power_rate from pokemon;
...

0: jdbc:hive2://blue0:9832/> select COUNT(name), power_rate from pokemon_new group by power_rate; 
...

The warehouse directory contains a new sub-directory creatd by user gla.

bash-4.2$ hdfs dfs -ls /user/hive/warehouse
Found 3 items
drwxr-xr-x   - hive hdfs          0 2022-12-14 21:08 /user/hive/warehouse/pokemon
drwxr-xr-x   - hive hdfs          0 2022-12-14 21:08 /user/hive/warehouse/pokemon1
drwxr-xr-x   - gla  hdfs          0 2022-12-14 21:14 /user/hive/warehouse/pokemon_new

Finally we revert the permission on the warehouse directory.

$ hdfs dfs -chmod 755 /user/hive/warehouse

Stopping HiveServer2 and Metastore

Stop HiveServer2 as user hive.

$ hive/hiveserver2-service.sh stop --tpcds

Stop Metastore as user hive.

$ hive/metastore-service.sh stop --tpcds

The user can check if Metastore has successfully stopped by reading its log file.

$ tail -n3 /tmp/hive/hive.log
/************************************************************
SHUTDOWN_MSG: Shutting down HiveMetaStore at blue0/192.168.10.101
************************************************************/