On a Local Machine |

This page shows how to operate Hive on MR3 on a single machine. All the components of Hive on MR3, such as Metastore, HiveServer2, and MR3 DAGAppMaster, will be running on the same machine. By following the instruction, the user will learn:

how to install and configure Hive on MR3 on a single machine
how to start and stop Metastore with a Derby database
how to start and stop HiveServer2
how to create Beeline connections and send queries to HiveServer2
the difference between LocalThread mode and LocalProcess mode

This scenario has the following prerequisite:

Java 8 or Java 17 is available.

Note that the user does not need a working Hadoop installation.

This scenario should take less than 30 minutes to complete, not including the time for downloading a Hadoop binary distribution and an MR3 release.

For asking any questions, please visit MR3 Google Group or join MR3 Slack.

Installation

Download a Hadoop binary distribution of Hadoop 3 and uncompress it.

Hive 3 on MR3

$ wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
$ gunzip -c hadoop-3.3.1.tar.gz | tar xvf -

Hive 4 on MR3

$ wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
$ gunzip -c hadoop-3.3.6.tar.gz | tar xvf -

Download a pre-built MR3 release and uncompress it. We rename the new directory to mr3-run and change the working directory. Renaming the new directory is not strictly necessary, but it is recommended because the sample configuration file hive-site.xml included in the MR3 release uses the same directory name.

Hive 3 on MR3, Java 8

$ wget https://github.com/mr3project/mr3-release/releases/download/v1.11/hivemr3-1.11-hive3.1.3-k8s.tar.gz
$ gunzip -c hivemr3-1.11-hive3.1.3-k8s.tar.gz| tar xvf -;
$ mv hivemr3-1.11-hive3.1.3-k8s/ mr3-run
$ cd mr3-run/

Hive 3 on MR3, Java 17

$ wget https://github.com/mr3project/mr3-release/releases/download/v1.11/hivemr3-1.11-java17-hive3.1.3-k8s.tar.gz
$ gunzip -c hivemr3-1.11-java17-hive3.1.3-k8s.tar.gz| tar xvf -;
$ mv hivemr3-1.11-java17-hive3.1.3-k8s/ mr3-run
$ cd mr3-run/

Hive 4 on MR3, Java 17

$ wget https://github.com/mr3project/mr3-release/releases/download/v1.11/hivemr3-1.11-java17-hive4.0.0-k8s.tar.gz
$ gunzip -c hivemr3-1.11-java17-hive4.0.0-k8s.tar.gz| tar xvf -;
$ mv hivemr3-1.11-java17-hive4.0.0-k8s/ mr3-run
$ cd mr3-run/

Configuring `env.sh`

Open env.sh and set JAVA_HOME and PATH if necessary. Set HADOOP_HOME to the Hadoop installation directory created earlier.

Hive 3 on MR3, Java 8

$ vi env.sh

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.242.b08-0.el7_7.x86_64/
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_HOME=/data1/gla/hadoop-3.3.1

USE_JAVA_17=false

Hive 3 on MR3, Java 17

$ vi env.sh

export JAVA_HOME=/usr/jdk64/jdk17   # Java 17
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_HOME=/data1/gla/hadoop-3.3.1

USE_JAVA_17=true

Hive 4 on MR3, Java 17

$ vi env.sh

export JAVA_HOME=/usr/jdk64/jdk17   # Java 17
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_HOME=/data1/gla/hadoop-3.3.6

USE_JAVA_17=true

For Java 8 only

Update the configuration key mr3.am.launch.cmd-opts in conf/local/mr3/mr3-site.xml.

add -XX:+AggressiveOpts for performance.
remove --add-opens java.base/java.net=ALL-UNNAMED --add-opens java.base/java.util=ALL-UNNAMED --add-opens java.base/java.time=ALL-UNNAMED --add-opens java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens java.base/java.io=ALL-UNNAMED ... (which are Java 17 options).

Update the configuration key mr3.am.launch.env in conf/local/mr3/mr3-site.xml.

remove JAVA_HOME=/home/hive/jdk17/.

For Java 17 only

Update the configuration key mr3.am.launch.env in conf/local/mr3/mr3-site.xml.

set JAVA_HOME=/home/hive/jdk17/ to point to the installation directory of Java 17 on every worker node.

Configuring Hive on MR3

In env.sh, set the following environment variables to adjust the memory size (in MB) to be allocated to each component:

HIVE_METASTORE_HEAPSIZE specifies the memory size for Metastore.
HIVE_SERVER2_HEAPSIZE specifies the memory size for HiveServer2.
HIVE_CLIENT_HEAPSIZE specifies the memory size of Beeline (and also HiveCLI).
MR3_AM_HEAPSIZE specifies the memory size of MR3 DAGAppMaster.

Since MR3 DAGAppMaster is to run as a thread inside HiveServer2, MR3_AM_HEAPSIZE should be strictly smaller than HIVE_SERVER2_HEAPSIZE. Be default, we use the following values.

HIVE_METASTORE_HEAPSIZE=4096
HIVE_SERVER2_HEAPSIZE=16384
HIVE_CLIENT_HEAPSIZE=1024
MR3_AM_HEAPSIZE=10240

Open conf/local/mr3/mr3-site.xml and set the configuration keys mr3.am.local.resourcescheduler.max.memory.mb and mr3.am.local.resourcescheduler.max.cpu.cores which determine the memory size (in MB) and the number of cores to be allocated to all ContainerWorkers. Since all ContainerWorkers are to run inside MR3 DAGAppMaster, mr3.am.local.resourcescheduler.max.memory.mb should be strictly smaller than MR3_AM_HEAPSIZE in env.sh. On the other hand, mr3.am.local.resourcescheduler.max.cpu.cores specifies virtual resources and can be set to any value.

$ vi conf/local/mr3/mr3-site.xml

<property>
  <name>mr3.am.local.resourcescheduler.max.memory.mb</name>
  <value>8192</value>
</property>

<property>
  <name>mr3.am.local.resourcescheduler.max.cpu.cores</name>
  <value>4</value>
</property>

Open conf/local/hive3/hive-site.xml (for Hive 3 on MR3) or conf/local/hive4/hive-site.xml (for Hive 4 on MR3) and set the following four configuration keys according to the current user name (instead of the default user hive) and the working directory (instead of the default directory /home/hive).

Hive 3 on MR3

$ vi conf/local/hive3/hive-site.xml 

<property>
  <name>hive.users.in.admin.role</name>
  <value>gla</value>
</property>

<property>
  <name>hive.aux.jars.path</name>
  <value>/data1/gla/mr3-run/hive/hivejar/apache-hive-3.1.3-bin/lib/hive-llap-common-3.1.3.jar,/data1/gla/mr3-run/hive/hivejar/apache-hive-3.1.3-bin/lib/hive-llap-server-3.1.3.jar,/data1/gla/mr3-run/hive/hivejar/apache-hive-3.1.3-bin/lib/hive-llap-tez-3.1.3.jar</value>
</property>

<property>
  <name>hive.exec.scratchdir</name>
  <value>/tmp/gla</value>
</property>

<property>
  <name>hive.server2.logging.operation.log.location</name>
  <value>/tmp/gla/operation_logs</value>
</property>

Hive 4 on MR3

$ vi conf/local/hive4/hive-site.xml 

<property>
  <name>hive.users.in.admin.role</name>
  <value>gla</value>
</property>

<property>
  <name>hive.aux.jars.path</name>
  <value>/data1/gla/mr3-run/hive/hivejar/apache-hive-4.0.0-bin/lib/hive-llap-common-4.0.0.jar,/data1/gla/mr3-run/hive/hivejar/apache-hive-4.0.0-bin/lib/hive-llap-server-4.0.0.jar,/data1/gla/mr3-run/hive/hivejar/apache-hive-4.0.0-bin/lib/hive-llap-tez-4.0.0.jar,/data1/gla/mr3-run/hive/hivejar/apache-hive-4.0.0-bin/lib/hive-iceberg-handler-4.0.0.jar,/data1/gla/mr3-run/hive/hivejar/apache-hive-4.0.0-bin/lib/log4j-1.2-api-2.18.0.jar</value>
</property>

<property>
  <name>hive.exec.scratchdir</name>
  <value>/tmp/gla</value>
</property>

<property>
  <name>hive.server2.logging.operation.log.location</name>
  <value>/tmp/gla/operation_logs</value>
</property>

In hive-site.xml, the following configuration keys specify the resource to be allocated to a Map Task, a Reduce Task, or a ContainerWorker. By default, we allocate 2GB and a single core to a Map Task, a Reduce Task, and a ContainerWorker.

<property>
  <name>hive.mr3.map.task.memory.mb</name>
  <value>2048</value>
</property>

<property>
  <name>hive.mr3.map.task.vcores</name>
  <value>1</value>
</property>

<property>
  <name>hive.mr3.reduce.task.memory.mb</name>
  <value>2048</value>
</property>

<property>
  <name>hive.mr3.reduce.task.vcores</name>
  <value>1</value>
</property>

<property>
  <name>hive.mr3.all-in-one.containergroup.memory.mb</name>
  <value>2048</value>
</property>

<property>
  <name>hive.mr3.all-in-one.containergroup.vcores</name>
  <value>1</value>
</property>

When updating these configuration keys, we should meet the following requirements:

hive.mr3.map.task.memory.mb <= hive.mr3.all-in-one.containergroup.memory.mb
hive.mr3.map.task.vcores <= hive.mr3.all-in-one.containergroup.vcores
hive.mr3.reduce.task.memory.mb <= hive.mr3.all-in-one.containergroup.memory.mb
hive.mr3.reduce.task.vcores <= hive.mr3.all-in-one.containergroup.vcores
hive.mr3.all-in-one.containergroup.memory.mb <= mr3.am.local.resourcescheduler.max.memory.mb
hive.mr3.all-in-one.containergroup.vcores <= mr3.am.local.resourcescheduler.max.cpu.cores

Creating temporary directories

Create a new directory specified by hive.exec.scratchdir and set its permission to 733. Make sure that /tmp has enough free space.

$ ls /tmp/gla
ls: cannot access /tmp/gla: No such file or directory
$ mkdir -p /tmp/gla
$ chmod 733 /tmp/gla

Create a new directory specified by hive.server2.logging.operation.log.location.

$ ls -alt /tmp/gla/operation_logs
ls: cannot access /tmp/gla/operation_logs: No such file or directory
$ mkdir -p /tmp/gla/operation_logs

Running Metastore

Run Metastore with a Derby database using --local option, and initialize it using --init-schema option.

$ hive/metastore-service.sh start --local --init-schema

After a while, check if Metastore has successfully started.

$ more /data1/gla/mr3-run/hive/metastore-service-result/hive-mr3-91c8fdd-2022-12-14-16-42-19-946ee117/out-metastore.txt 
...
Metastore connection URL:	 jdbc:derby:;databaseName=/data1/gla/mr3-run/hive/hive-loc
al-data/metastore5/hive3mr3;create=true
Metastore Connection Driver :	 org.apache.derby.jdbc.EmbeddedDriver
Metastore connection User:	 APP
Starting metastore schema initialization to 3.1.0
Initialization script hive-schema-3.1.0.derby.sql
...

Initialization script completed
schemaTool completed
...
2022-12-14 16:42:29: Starting Hive Metastore Server

Check the log file for Metastore:

$ ls /tmp/gla/hive.log
/tmp/gla/hive.log

Running HiveServer2

Run HiveServer2 using --local option.

$ hive/hiveserver2-service.sh start --local

After a while, check if HiveServer2 has successfully started.

$ grep "New MR3Session created" /data1/gla/mr3-run/hive/hiveserver2-service-result/hive-mr3-91c8fdd-2022-12-14-16-43-58-8d94537e/hive-logs/hive.log 
2022-12-14T16:44:07,491  INFO [main] session.MR3SessionManagerImpl: New MR3Session created: 47a6018e-5e93-44df-966a-d7a3341f4d71, gla

Running Beeline

Download a sample dataset.

$ wget https://github.com/mr3project/mr3-release/releases/download/v1.0/pokemon.csv

Run Beeline using --local option.

$ hive/run-beeline.sh --local

# Running Beeline using Hive-MR3 (3.1.3) #

...
Connecting to jdbc:hive2://gold7:9832/;;
Connected to: Apache Hive (version 3.1.3)
Driver: Hive JDBC (version 3.1.3)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.3 by Apache Hive
0: jdbc:hive2://gold7:9832/>

Use the default database.

0: jdbc:hive2://gold7:9832/> show databases;
...
+----------------+
| database_name  |
+----------------+
| default        |
+----------------+
1 row selected (1.544 seconds)

0: jdbc:hive2://gold7:9832/> use default;
...
No rows affected (0.035 seconds)

Create a table called pokemon.

0: jdbc:hive2://gold7:9832/> CREATE TABLE pokemon (Number Int,Name String,Type1 String,Type2 String,Total Int,HP Int,Attack Int,Defense Int,Sp_Atk Int,Sp_Def Int,Speed Int) row format delimited fields terminated BY ',' lines terminated BY '\n' tblproperties("skip.header.line.count"="1");
...
No rows affected (1.126 seconds)

Import the sample dataset.

0: jdbc:hive2://gold7:9832/> load data local inpath './pokemon.csv' INTO table pokemon;
...
No rows affected (0.59 seconds)

Execute queries.

0: jdbc:hive2://gold7:9832/> select avg(HP) from pokemon;
...

0: jdbc:hive2://gold7:9832/> create table pokemon1 as select *, IF(HP>160.0,'strong',IF(HP>140.0,'moderate','weak')) AS power_rate from pokemon;
...

0: jdbc:hive2://gold7:9832/> select COUNT(name), power_rate from pokemon1 group by power_rate;
...
+------+-------------+
| _c0  | power_rate  |
+------+-------------+
| 363  | strong      |
| 336  | weak        |
| 108  | moderate    |
+------+-------------+
3 rows selected (0.568 seconds)

The user can find the directory for Metastore and the warehouse directory under hive/hive-local-data/.

$ ls hive/hive-local-data/
metastore5  warehouse
$ ls hive/hive-local-data/warehouse/
pokemon  pokemon1

Stopping HiveServer2

Stop HiveServer2.

$ hive/hiveserver2-service.sh stop --local

By stopping HiveServer2, we automatically stop MR3 DAGAppMaster as well because it runs as a thread inside HiveServer2, or in LocalThread mode. Note, however, that Metastore is still running.

Using LocalProcess mode for MR3 DAGAppMaster

In LocalProcess mode, MR3 DAGAppMaster runs as a separate process rather than a thread inside HiveServer2. Hence HiveServer2 does not need additional resources for accommodating MR3 DAGAppMaster. Open env.sh and adjust the memory size for HiveServer2.

$ vi env.sh

HIVE_SERVER2_HEAPSIZE=8192

Open conf/local/mr3/mr3-site.xml and set the configuration key mr3.master.mode to local-process.

$ vi conf/local/mr3/mr3-site.xml 

<property>
  <name>mr3.master.mode</name>
  <value>local-process</value>
</property>

Run HiveServer2 using --amprocess option as well.

$ hive/hiveserver2-service.sh start --local --amprocess

After a while, the user can find the log file for MR3 DAGAppMaster.

$ ls /data1/gla/mr3-run/hive/hiveserver2-service-result/hive-mr3-91c8fdd-2022-12-14-16-47-00-0c3081d1/*/run.log
/data1/gla/mr3-run/hive/hiveserver2-service-result/hive-mr3-91c8fdd-2022-12-14-16-47-00-0c3081d1/application_25621671004028547_0001/run.log

The user can also find the process for MR3 DAGAppMaster.

$ ps -ef | grep DAGAppMaster | grep mr3
gla       2712  2562 14 16:47 pts/0    00:00:05 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.242.b08-0.el7_7.x86_64/jre/bin/java -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -Xmx20480m -Xms10240m -Dlog4j.configurationFile=mr3-container-log4j2.properties -Dmr3.root.logger=INFO -Dyarn.app.container.log.dir=/data1/gla/mr3-run/hive/hiveserver2-service-result/hive-mr3-91c8fdd-2022-12-14-16-47-00-0c3081d1/application_25621671004028547_0001 -Dsun.nio.ch.bugLevel='' com.datamonad.mr3.master.DAGAppMaster --session

Run Beeline and and send queries to HiveServer2.

$ hive/run-beeline.sh --local