This page shows how to operate Hive on MR3 in a non-secure Hadoop cluster without Kerberos. The same user, not necessarily an administrator of the Hadoop cluster, will run all the components of Hive on MR3, such as Metastore, HiveServer2, and Beeline. For running Metastore, we will use a Derby database included in Hive, so the user does not need a separate database server. By following the instruction, the user will learn:

  1. how to install and configure Hive on MR3 in a non-secure Hadoop cluster without Kerberos
  2. how to start and stop Metastore with a Derby database
  3. how to start and stop HiveServer2
  4. how to create Beeline connections and send queries to HiveServer2

This scenario has the following prerequisites:

  • Java 1.8 is available.
  • A non-secure Hadoop cluster 3 or higher is available.
  • The user has access to the home directory and /tmp directory on HDFS.

This scenario should take less than 30 minutes to complete, not including the time for downloading an MR3 release.

This page has been tested with MR3 release 1.5 on Hadoop 3.1 (HDP 3.1.4).
For asking questions, please visit MR3 Google Group.

Installation

Download a pre-built MR3 release and uncompress it. We choose the pre-built MR3 release based on Hive 3.1.3 which corresponds to --hivesrc3 option to be used later.

$ wget https://github.com/mr3project/mr3-release/releases/download/v1.5/hivemr3-1.5-hive3.1.3-k8s.tar.gz
$ gunzip -c hivemr3-1.5-hive3.1.3-k8s.tar.gz| tar xvf -;

Rename the new directory to mr3-run and change the working directory. Renaming the new directory is not strictly necessary, but it is recommended because the sample configuration file hive-site.xml included in the MR3 release uses the same directory name.

$ mv hivemr3-1.5-hive3.1.3-k8s/ mr3-run
$ cd mr3-run/

Configuring Hive on MR3

Open env.sh and set JAVA_HOME and PATH if necessary. Set HADOOP_HOME to the Hadoop installation directory.

$ vi env.sh

export JAVA_HOME=/usr/jdk64/jdk1.8.0_112
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_HOME=/usr/hdp/3.1.4.0-315/hadoop

Set the following environment variables to adjust the memory size (in MB) to be allocated to each component:

  • HIVE_METASTORE_HEAPSIZE specifies the memory size for Metastore.
  • HIVE_SERVER2_HEAPSIZE specifies the memory size for HiveServer2.
  • HIVE_CLIENT_HEAPSIZE specifies the memory size of HiveCLI (hive command) and Beeline (beeline command).
  • MR3_AM_HEAPSIZE specifies the memory size of MR3 DAGAppMaster.

In our example, we use the following values.

$ vi env.sh

HIVE_METASTORE_HEAPSIZE=4096
HIVE_SERVER2_HEAPSIZE=16384
HIVE_CLIENT_HEAPSIZE=1024
MR3_AM_HEAPSIZE=10240

HIVE3_HDFS_WAREHOUSE specifies the warehouse directory on HDFS. Update it to use the current user name.

$ echo $USER
gla

$ vi env.sh

HIVE3_HDFS_WAREHOUSE=/user/gla/warehouse

Open conf/cluster/hive3/hive-site.xml and set the following four configuration keys according to the current user name (instead of the default user hive) and the working directory (instead of the default directory /home/hive).

$ vi conf/cluster/hive3/hive-site.xml 

<property>
  <name>hive.users.in.admin.role</name>
  <value>gla</value>
</property>

<property>
  <name>hive.aux.jars.path</name>
  <value>/home/gla/mr3-run/hive/hivejar/apache-hive-3.1.3-bin/lib/hive-llap-common-3.1.3.jar,/home/gla/mr3-run/hive/hivejar/apache-hive-3.1.3-bin/lib/hive-llap-server-3.1.3.jar,/home/gla/mr3-run/hive/hivejar/apache-hive-3.1.3-bin/lib/hive-llap-tez-3.1.3.jar</value>
</property>

<property>
  <name>hive.exec.scratchdir</name>
  <value>/tmp/gla</value>
</property>

<property>
  <name>hive.server2.logging.operation.log.location</name>
  <value>/tmp/gla/operation_logs</value>
</property>

The following configuration keys specify the resource to be allocated to a Map Task, a Reduce Task, or a ContainerWorker. By default, we allocate 4GB and a single core to a Map Task and a Reduce Task. A single ContainerWorker uses 40GB of memory and 10 cores, so it can accommodate 10 concurrent Tasks.

$ vi conf/cluster/hive3/hive-site.xml 

<property>
  <name>hive.mr3.map.task.memory.mb</name>
  <value>4096</value>
</property>

<property>
  <name>hive.mr3.map.task.vcores</name>
  <value>1</value>
</property>

<property>
  <name>hive.mr3.reduce.task.memory.mb</name>
  <value>4096</value>
</property>

<property>
  <name>hive.mr3.reduce.task.vcores</name>
  <value>1</value>
</property>

<property>
  <name>hive.mr3.all-in-one.containergroup.memory.mb</name>
  <value>40960</value>
</property>

<property>
  <name>hive.mr3.all-in-one.containergroup.vcores</name>
  <value>10</value>
</property>

When updating these configuration keys, we should meet the following requirements:

  • hive.mr3.map.task.memory.mb <= hive.mr3.all-in-one.containergroup.memory.mb
  • hive.mr3.map.task.vcores <= hive.mr3.all-in-one.containergroup.vcores
  • hive.mr3.reduce.task.memory.mb <= hive.mr3.all-in-one.containergroup.memory.mb
  • hive.mr3.reduce.task.vcores <= hive.mr3.all-in-one.containergroup.vcores

For simplicity, disable impersonation by setting hive.server2.enable.doAs to false.

$ vi conf/cluster/hive3/hive-site.xml 

<property>
  <name>hive.server2.enable.doAs</name>
  <value>false</value>
</property>

Creating directories on HDFS

Create the warehouse directory specified in env.sh.

$ grep HIVE3_HDFS_WAREHOUSE env.sh
HIVE3_HDFS_WAREHOUSE=/user/gla/warehouse
$ hdfs dfs -mkdir -p /user/gla/warehouse

Create a directory for storing MR3 and Tez jar files.

$ hdfs dfs -mkdir -p /user/gla/lib

Load MR3 jar files.

$ mr3/upload-hdfslib-mr3.sh

Load Tez jar files.

$ tez/upload-hdfslib-tez.sh

Make sure that /tmp/gla does NOT exist on HDFS.

$ hdfs dfs -ls /tmp/gla
ls: `/tmp/gla': No such file or directory

If the directory already exists (e.g., when running Hive on MR3 for the second time), make sure that its permission is set to 733. HiveServer2 automatically creates a new directory with permission 733 if it does not exist.

$ hdfs dfs -ls /tmp/ | grep gla
drwx-wx-wx   - gla           hdfs          0 2020-02-03 00:05 /tmp/gla

Creating temporary directories

Create a new directory specified by hive.server2.logging.operation.log.location.

$ ls -alt /tmp/gla/operation_logs
ls: cannot access /tmp/gla/operation_logs: No such file or directory
$ mkdir -p /tmp/gla/operation_logs

Running Metastore

Metastore uses the port specified by the environment variable HIVE3_METASTORE_PORT in env.sh. Make sure that the port is not in use.

$ grep HIVE3_METASTORE_PORT env.sh
HIVE3_METASTORE_PORT=9830
$ ss -anpt | grep -E "LISTEN.*:9830"
$

Run Metastore with a Derby database using --cluster option, and initialize it using --init-schema option.

$ hive/metastore-service.sh start --cluster --hivesrc3 --init-schema

After a while, check if Metastore has successfully started.

$ tail /data2/gla/mr3-run/hive/metastore-service-result/hive-mr3--2022-08-19-10-13-25-3b7a616f/out-metastore.txt 


Initialization script completed
schemaTool completed
...
2022-08-19 10:13:34: Starting Hive Metastore Server
...

When the user restarts Metastore, do not use --init-schema option in order to reuse existing Hive databases. For example, the user can kill Metastore and restart it as follows. (If the following command fails to kill Metastore for some reason, manually kill with kill -9.)

$ hive/metastore-service.sh stop --cluster --hivesrc3
$ hive/metastore-service.sh start --cluster --hivesrc3

The log file for Metastore is found under /tmp/gla.

$ ls /tmp/gla/hive.log
/tmp/gla/hive.log

Running HiveServer2

Run HiveServer2 using --cluster option. In order to use LocalProcess mode for MR3 DAGAppMaster, use --amprocess option. In our example, we use --amprocess option.

$ hive/hiveserver2-service.sh start --cluster --hivesrc3 --amprocess

After a while, check if HiveServer2 has successfully started by inspecting its log file.

$ grep -e "New MR3Session created" /data2/gla/mr3-run/hive/hiveserver2-service-result/hive-mr3--2022-08-19-10-31-54-304bbf48/hive-logs/hive.log 
2022-08-19T01:32:17,493  INFO [main] session.MR3SessionManagerImpl: New MR3Session created: 1f99f10c-3b28-4890-a81a-a57950dd692c, gla

The user can find a new Yarn application of type mr3 submitted by the user.

$ yarn application -list
...
application_1660836356025_0003	1f99f10c-3b28-4890-a81a-a57950dd692c	                 mr3	       gla	   default	           RUNNING	         UNDEFINED	             0%	                                N/A

The user can also find the process for MR3 DAGAppMaster.

$ ps -ef | grep DAGAppMaster | grep mr3
gla      166193 165372 28 10:32 pts/0    00:00:10 /usr/jdk64/jdk1.8.0_112/jre/bin/java -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -XX:+UseNUMA -XX:+UseG1GC -XX:+ResizeTLAB -Xmx8192m -Xms4096m -Dlog4j.configurationFile=mr3-container-log4j2.properties -Dmr3.root.logger=INFO -Dyarn.app.container.log.dir=/data2/gla/mr3-run/hive/hiveserver2-service-result/hive-mr3--2022-08-19-10-31-54-304bbf48/application_1660836356025_0003 -Dsun.nio.ch.bugLevel='' com.datamonad.mr3.master.DAGAppMaster --session

The user can find the log file for MR3 DAGAppMaster.

$ ls /data2/gla/mr3-run/hive/hiveserver2-service-result/hive-mr3--2022-08-19-10-31-54-304bbf48/application_1660836356025_0003/run.log 
/data2/gla/mr3-run/hive/hiveserver2-service-result/hive-mr3--2022-08-19-10-31-54-304bbf48/application_1660836356025_0003/run.log

Running queries

Download a sample dataset.

$ wget https://github.com/mr3project/mr3-release/releases/download/v1.0/pokemon.csv

Run Beeline using --cluster option.

$ hive/run-beeline.sh --cluster --hivesrc3

Use the default database.

0: jdbc:hive2://blue0:9832/> use default;

Create a table called pokemon.

0: jdbc:hive2://blue0:9832/> CREATE TABLE pokemon (Number Int,Name String,Type1 String,Type2 String,Total Int,HP Int,Attack Int,Defense Int,Sp_Atk Int,Sp_Def Int,Speed Int) row format delimited fields terminated BY ',' lines terminated BY '\n' tblproperties("skip.header.line.count"="1");

Import the sample dataset.

0: jdbc:hive2://blue0:9832/> load data local inpath './pokemon.csv' INTO table pokemon; 

Execute queries.

0: jdbc:hive2://blue0:9832/> select avg(HP) from pokemon;
...

0: jdbc:hive2://blue0:9832/> create table pokemon1 as select *, IF(HP>160.0,'strong',IF(HP>140.0,'moderate','weak')) AS power_rate from pokemon;
...

0: jdbc:hive2://blue0:9832/> select COUNT(name), power_rate from pokemon1 group by power_rate;
...

Exit Beeline. The warehouse directory on HDFS has now two sub-directories.

$ hdfs dfs -ls /user/gla/warehouse/
Found 2 items
drwxr-xr-x   - gla hdfs          0 2022-08-19 01:34 /user/gla/warehouse/pokemon
drwxr-xr-x   - gla hdfs          0 2022-08-19 01:35 /user/gla/warehouse/pokemon1

Stop HiveServer2. MR3 DAGAppMaster also stops.

$ hive/hiveserver2-service.sh stop --cluster --hivesrc3

Stopping HiveServer2 and Metastore

Stop HiveServer2.

$ hive/hiveserver2-service.sh stop --cluster --hivesrc3

Stop Metastore.

$ hive/metastore-service.sh stop --cluster --hivesrc3

The user can check if Metastore has successfully stopped by reading its log file.

$ tail -n3 /tmp/gla/hive.log
/************************************************************
SHUTDOWN_MSG: Shutting down HiveMetaStore at blue0/192.168.10.101
************************************************************/