This page shows how to operate Hive on MR3 in standalone mode using a MySQL database for Metastore. Metastore, HiveServer2, and MR3 DAGAppMaster will be running on a master node whereas all MR3 ContainerWorkers will be running on worker nodes. By following the instruction, the user will learn:

  1. how to install and configure Hive on MR3 in standalone mode
  2. how to start and stop Metastore with a MySQL compatible database
  3. how to start and stop HiveServer2

This scenario has the following prerequisites:

  • Java 17 is available. We assume that Java 17 is installed in the same directory on every node.
  • The user has access to a distributed storage such as HDFS or S3. In our example, we use S3.
  • The user has access to the MySQL server with all privileges.

In our example, we use a MySQL server for Metastore, but Postgres and MS SQL are also okay.

This scenario should take less than 30 minutes to complete, not including the time for downloading an MR3 release.

For asking any questions, please visit MR3 Google Group or join MR3 Slack.

Overview

Hive on MR3 uses the following components: Metastore, HiveServer2, MR3 DAGAppMaster, and MR3 ContainerWorkers. In our example, we use a cluster of 5 nodes. orange1 is the master node where Metastore, HiveServer2, and MR3 DAGAppMaster will be running, and orange2 to orange5 are worker nodes where MR3 ContainerWorkers will be running.

192.168.10.1  orange1   # master node
192.168.10.2  orange2   # worker node
192.168.10.3  orange3   # worker node
192.168.10.4  orange4   # worker node
192.168.10.5  orange5   # worker node

The MR3 distribution for standalone mode uses Java 17. In our example, Java 17 is installed in the directory /home/hive/jdk17/.

Hive on MR3 requires three types of storage:

  • Data source such as HDFS or S3
  • Distributed storage for storing transient data
  • Local directories on worker nodes for storing intermediate data

In our example, we use S3 for both data source and distributed storage for storing transient data. For data source, we use the S3 bucket s3a://hivemr3/warehouse. For storing transient data, we use the S3 bucket s3a://hivemr3/scratch.

We require that every worker node has an identical set of local directories for storing intermediate data. In our example, worker nodes use a local directory /data1/k8s.

Accessing MySQL

In our example, we assume that a MySQL user root has access to a MySQL database with all privileges. Later we will configure Metastore so that it connects to the MySQL server using the user name root. The MySQL server may run on any node, not necessarily on the node where Metastore or HiveServer2 will be running.

$ mysql -h 192.168.10.1 -u root -p
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 214981
Server version: 5.7.40-0ubuntu0.18.04.1 (Ubuntu)
...
mysql> SHOW GRANTS FOR 'root'@'%';
+-------------------------------------------------------------+
| Grants for root@%                                           |
+-------------------------------------------------------------+
| GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' WITH GRANT OPTION |
+-------------------------------------------------------------+
1 row in set (0.00 sec)

Installation

Download an MR3 distribution for standalone mode on the master node and uncompress it.

A pre-built MR3 release is intended for evaluating Hive on MR3 with a limit on the aggregate capacity of MR3 workers. For running Hive on MR3 in production, contact us.
$ hostname
orange1
$ wget https://github.com/mr3project/mr3-release/releases/download/v1.11/hivemr3-1.11-java17-hive3.1.3-process.tar.gz
$ gunzip -c hivemr3-1.11-java17-hive3.1.3-process.tar.gz | tar xvf -

Copy the MR3 distribution to every worker node and uncompress it. In our example, passwordless connection to worker nodes is already set up.

$ for i in {2..5}; do scp hivemr3-1.11-java17-hive3.1.3-process.tar.gz hive@orange$i:/home/hive/; done
$ for i in {2..5}; do ssh orange$i "cd /home/hive/; gunzip -c hivemr3-1.11-java17-hive3.1.3-process.tar.gz | tar xvf -;"; done

Change to the directory process/hive under the installation.

$ cd hivemr3-1.11-java17-hive3.1.3-process/process/hive

Setting environment variables

The first step in configuring Hive on MR3 is to set environment variables in env.sh. env.sh is read not only by HiveServer2 (running on the master node) but also by ContainerWorkers (running on worker nodes).

$ vi env.sh

export JAVA_HOME=/home/hive/jdk17/
  • Set JAVA_HOME to the install directory of Java which should be valid on every node.
$ vi env.sh

HIVE_DATABASE_HOST=192.168.10.1
HIVE_DATABASE_NAME=hive3mr3

HIVE_WAREHOUSE_DIR=s3a://hivemr3/warehouse
HIVE_SCRATCH_DIR=s3a://hivemr3/scratch
  • Set HIVE_DATABASE_HOST to the IP address of the MySQL server for Metastore. In our example, the MySQL server is running at the IP address 192.168.10.1.
  • Set HIVE_DATABASE_NAME to the database name for Metastore in the MySQL server.
  • Set HIVE_WAREHOUSE_DIR to the S3 bucket (or the HDFS directory) storing the warehouse. Note that for using S3, we should use prefix s3a, not s3.
  • Set HIVE_SCRATCH_DIR to the S3 bucket (or the HDFS directory) for storing transient data. If an NFS volume is mounted in the same directory on every node, HIVE_SCRATCH_DIR may be set to the mount point (e.g., /home/nfs/hivemr3).
$ vi env.sh

HIVE_METASTORE_HEAPSIZE=16384
HIVE_METASTORE_DB_TYPE=mysql

HIVE_SERVER2_HEAPSIZE=36864
MR3_AM_HEAPSIZE=32768
  • HIVE_METASTORE_HEAPSIZE specifies the memory size (in MB) of Metastore.
  • Set HIVE_METASTORE_DB_TYPE to the database type for Metastore which is used as an argument to schematool (mysql for MYSQL, postgres for Postgres, mssql for MS SQL).
  • HIVE_SERVER2_HEAPSIZE specifies the memory size (in MB) of HiveServer2.
  • MR3_AM_HEAPSIZE specifies the memory size (in MB) of DAGAppMaster.

The following environment variables are read by ContainerWorkers (which implies that env.sh will be copied to worker nodes later).

$ vi env.sh

export PROCESS_CONTAINER_WORKER_SECRET=worker-secret
export PROCESS_CONTAINER_WORKER_SERVER_HOST=192.168.10.1

export PROCESS_CONTAINER_WORKER_MEMORY_MB=32768
export PROCESS_CONTAINER_WORKER_CPU_CORES=8
export PROCESS_CONTAINER_WORKER_MEMORY_XMX=26214

export PROCESS_CONTAINER_WORKER_LOCAL_DIRS=/data1/k8s
  • PROCESS_CONTAINER_WORKER_SECRET specifies a string to be used as a secret for communicating to DAGAppMaster.
  • PROCESS_CONTAINER_WORKER_SERVER_HOST specifies the IP address of DAGAppMaster. In our example, it is set to the IP address of the master node orange1.
  • PROCESS_CONTAINER_WORKER_MEMORY_MB specifies the memory size (in MB) to be assumed by each ContainerWorker. That is, a ContainerWorker assumes that it may use as much memory as specified by PROCESS_CONTAINER_WORKER_MEMORY_MB.
  • PROCESS_CONTAINER_WORKER_CPU_CORES specifies the number of cores to be assumed by each ContainerWorker. That is, a ContainerWorker assumes that it may use as many cores as specified by PROCESS_CONTAINER_WORKER_CPU_CORES.
  • PROCESS_CONTAINER_WORKER_MEMORY_XMX specifies the argument for the Java -Xmx option.That is, it specifies the size of heap memory to be allocated to each ContainerWorker. The user should set PROCESS_CONTAINER_WORKER_MEMORY_XMX to a fraction (e.g., 0.8) of PROCESS_CONTAINER_WORKER_MEMORY_MB. In our example, each ContainerWorker starts with Java option -Xmx26214.
  • PROCESS_CONTAINER_WORKER_LOCAL_DIRS should be set to a (comma-separated) list of local directories for storing intermediate data in ContainerWorkers.

Configuring S3 (optional)

For accessing S3-compatible storage, additional configuration keys should be set in conf/core-site.xml. Open conf/core-site.xml and set configuration keys for S3. The configuration key fs.s3a.endpoint should be set to point to the storage server.

$ vi conf/core-site.xml

<property>
  <name>fs.s3a.aws.credentials.provider</name>
  <value>com.amazonaws.auth.EnvironmentVariableCredentialsProvider</value>
</property>

<property>
  <name>fs.s3a.endpoint</name>
  <value>http://192.168.10.100:9000</value>
</property>

<property>
  <name>fs.s3a.path.style.access</name>
  <value>true</value>
</property>

The user may need to change the parameters for accessing S3 to avoid SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool. For more details, see Troubleshooting.

When using the class EnvironmentVariableCredentialsProvider to read AWS credentials, two environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in env.sh should be set to the access ID and the password, respectively.

$ vi env.sh

export AWS_ACCESS_KEY_ID=_your_aws_access_key_id_
export AWS_SECRET_ACCESS_KEY=_your_aws_secret_secret_key_

Distributing env.sh to worker nodes

As it is also read by ContainerWorkers, copy env.sh to the same direction under the installation on every worker node.

$ for i in {2..5}; do scp env.sh hive@orange$i:/home/hive/hivemr3-1.11-java17-hive3.1.3-process/process/hive; done

We do not need to copy configuration files such as hive-site.xml to worker nodes because they are passed directly from MR3 DAGAppMaster.

Resources for mappers and reducers

Change resources to be allocated to each mapper, reducer, and ContainerWorker by updating conf/hive-site.xml. In particular, the configuration keys hive.mr3.all-in-one.containergroup.memory.mb and hive.mr3.all-in-one.containergroup.vcores should be adjusted to match PROCESS_CONTAINER_WORKER_MEMORY_MB and PROCESS_CONTAINER_WORKER_CPU_CORES in env.sh. hive.mr3.use.daemon.shufflehandler can be set to the same value as hive.mr3.all-in-one.containergroup.vcores.

$ vi conf/hive-site.xml

<property>
  <name>hive.mr3.map.task.memory.mb</name>
  <value>4096</value>
</property>

<property>
  <name>hive.mr3.map.task.vcores</name>
  <value>1</value>
</property>

<property>
  <name>hive.mr3.reduce.task.memory.mb</name>
  <value>4096</value>
</property>

<property>
  <name>hive.mr3.reduce.task.vcores</name>
  <value>1</value>
</property>

<property>
  <name>hive.mr3.all-in-one.containergroup.memory.mb</name>
  <value>32768</value>
</property>

<property>
  <name>hive.mr3.all-in-one.containergroup.vcores</name>
  <value>8</value>
</property>

<property>
  <name>hive.mr3.use.daemon.shufflehandler</name>
  <value>8</value>
</property>

For more details, see Performance Tuning.

Configuring Metastore

Open conf/hive-site.xml and update configurations for Metastore as necessary. Below we list some of configuration keys that the user should check. The two configuration keys javax.jdo.option.ConnectionUserName and javax.jdo.option.ConnectionPassword should match the user name and password of the MySQL server for Metastore. For simplicity, we disable security on the Metastore side.

$ vi conf/hive-site.xml

<property>
  <name>hive.metastore.db.type</name>
  <value>MYSQL</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.cj.jdbc.Driver</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>root</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>passwd</value>
</property>

<property>
  <name>hive.metastore.pre.event.listeners</name>
  <value></value>
</property>
<property>
  <name>metastore.pre.event.listeners</name>
  <value></value>
</property>

Configuring HiveServer2

Check the configuration for authentication and authorization:

$ vi conf/hive-site.xml

<property>
  <name>hive.security.authenticator.manager</name>
  <value>org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator</value>
</property>

<property>
  <name>hive.security.authorization.manager</name>
  <value>org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory</value>
</property>

Running Metastore and HiveServer2

The user can start Metastore by executing the script hive/metastore-service.sh. When using MySQL for Metastore, the script automatically downloads a MySQL connector from https://cdn.mysql.com/Downloads/Connector-J/mysql-connector-java-8.0.28.tar.gz. After Metastore starts, the user can start HiveServer2 by executing the script hive/hiveserver2-service.sh.

$ hive/metastore-service.sh start --init-schema
...
Output directory: /home/hive/hivemr3-1.11-java17-hive3.1.3-process/process/hive/hive/run-result/metastore/hivemr3-2024-01-05-17-15-43-f1ee0af8
Starting Metastore...

$ hive/hiveserver2-service.sh start --localprocess
...
Output directory: /home/hive/hivemr3-1.11-java17-hive3.1.3-process/process/hive/hive/run-result/hiveserver2/hivemr3-2024-01-05-17-15-47-0a8ef05e
Starting HiveServer2...

  • Use --init-schema option to initialize the database schema when running Metastore for the first time. Do not use it in order to reuse existing Hive databases.
  • Use --localprocess option to run DAGAppMaster as a separate process in LocalProcess mode. Without it, DAGAppMaster runs as a thread inside the HiveServer2 process.

After a while, a DAGAppMaster process is created and the user can find its log.

$ tail -f /home/hive/hivemr3-1.11-java17-hive3.1.3-process/process/hive/hive/run-result/hiveserver2/am-local-log-dir/application_221851685348155540_0001/run.log
...
2024-01-05 17:20:00,380 [DAGAppMaster-1-6] INFO  HeartbeatHandler$ [] - Timeout check in HeartbeatHandler:Container

Running ContainerWorkers

In standalone mode, the user should execute ContainerWorkers manually. For example, the user can create a script for executing ContainerWorkers.

$ vi run.sh

#!/bin/bash

$ for i in {2..5}; do 
  ssh orange$i "cd /home/hive/hivemr3-1.11-java17-hive3.1.3-process/process/hive/hive/; ./run-worker.sh;" &
done

ContainerWorkers are not immediately registered in DAGAppMaster, as shown in the log of DAGAppMaster:

$ tail /home/hive/hivemr3-1.11-java17-hive3.1.3-process/process/hive/hive/run-result/hiveserver2/am-local-log-dir/application_221851685348155540_0001/run.log
...
2024-01-05 17:53:26,523 [IPC Server handler 0 on default port 19666] INFO  ProcessCommunicatorServerClient$ [] - tryRegisterProcess(): 23ceb70e-ac97-4eee-b254-f8bd021e343a, 192.168.10.2, 32768MB, 8 cores

Instead DAGAppMaster registers ContainerWorkers after the first query is submitted.

Running Beeline

The user may use any client program to connect to HiveServer2. In our example, we run Beeline included in the installation.

$ hive/run-beeline.sh
...
# Running Beeline using Hive-MR3 #
...
Connecting to jdbc:hive2://orange1:9852/;;;
Connected to: Apache Hive (version 3.1.3)
Driver: Hive JDBC (version 3.1.3)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.3 by Apache Hive
0: jdbc:hive2://orange1:9852/>

Now the user can send queries to HiveServer2 normally.

Stopping Hive on MR3

In standalone mode, the user is responsible for starting and terminating all of Metastore, HiveServer, DAGAppMaster, and ContainerWorkers. Thus manually terminate processes with the command kill.