This page shows how to operate Hive on MR3 in standalone mode using a MySQL database for Metastore. Metastore, HiveServer2, and MR3 DAGAppMaster will be running on a master node whereas all MR3 ContainerWorkers will be running on worker nodes. By following the instruction, the user will learn:
- how to install and configure Hive on MR3 in standalone mode
- how to start and stop Metastore with a MySQL compatible database
- how to start and stop HiveServer2
This scenario has the following prerequisites:
- Java 17 is available. We assume that Java 17 is installed in the same directory on every node.
- The user has access to a distributed storage such as HDFS or S3. In our example, we use S3.
- The user has access to the MySQL server with all privileges.
In our example, we use a MySQL server for Metastore, but Postgres and MS SQL are also okay.
This scenario should take less than 30 minutes to complete, not including the time for downloading an MR3 release.
For asking any questions, please visit MR3 Google Group or join MR3 Slack.
Overview
Hive on MR3 uses the following components: Metastore, HiveServer2, MR3 DAGAppMaster, and MR3 ContainerWorkers.
In our example, we use a cluster of 5 nodes.
orange1
is the master node where Metastore, HiveServer2, and MR3 DAGAppMaster will be running,
and orange2
to orange5
are worker nodes where MR3 ContainerWorkers will be running.
192.168.10.1 orange1 # master node
192.168.10.2 orange2 # worker node
192.168.10.3 orange3 # worker node
192.168.10.4 orange4 # worker node
192.168.10.5 orange5 # worker node
The MR3 distribution for standalone mode uses Java 17.
In our example, Java 17 is installed in the directory /home/hive/jdk17/
.
Hive on MR3 requires three types of storage:
- Data source such as HDFS or S3
- Distributed storage for storing transient data
- Local directories on worker nodes for storing intermediate data
In our example, we use S3 for both data source and distributed storage for storing transient data.
For data source, we use the S3 bucket s3a://hivemr3/warehouse
.
For storing transient data, we use the S3 bucket s3a://hivemr3/scratch
.
We require that every worker node has an identical set of local directories
for storing intermediate data.
In our example, worker nodes use a local directory /data1/k8s
.
Accessing MySQL
In our example, we assume that a MySQL user root
has access to a MySQL database with all privileges.
Later we will configure Metastore so that it connects to the MySQL server using the user name root
.
The MySQL server may run on any node, not necessarily on the node where Metastore or HiveServer2 will be running.
$ mysql -h 192.168.10.1 -u root -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 214981
Server version: 5.7.40-0ubuntu0.18.04.1 (Ubuntu)
...
mysql> SHOW GRANTS FOR 'root'@'%';
+-------------------------------------------------------------+
| Grants for root@% |
+-------------------------------------------------------------+
| GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' WITH GRANT OPTION |
+-------------------------------------------------------------+
1 row in set (0.00 sec)
Installation
Download an MR3 distribution for standalone mode on the master node and uncompress it.
A pre-built MR3 release is intended for evaluating Hive on MR3 with a limit on the aggregate capacity of MR3 workers. For running Hive on MR3 in production, contact us.
$ hostname
orange1
$ wget https://github.com/mr3project/mr3-release/releases/download/v1.11/hivemr3-1.11-java17-hive3.1.3-process.tar.gz
$ gunzip -c hivemr3-1.11-java17-hive3.1.3-process.tar.gz | tar xvf -
Copy the MR3 distribution to every worker node and uncompress it. In our example, passwordless connection to worker nodes is already set up.
$ for i in {2..5}; do scp hivemr3-1.11-java17-hive3.1.3-process.tar.gz hive@orange$i:/home/hive/; done
$ for i in {2..5}; do ssh orange$i "cd /home/hive/; gunzip -c hivemr3-1.11-java17-hive3.1.3-process.tar.gz | tar xvf -;"; done
Change to the directory process/hive
under the installation.
$ cd hivemr3-1.11-java17-hive3.1.3-process/process/hive
Setting environment variables
The first step in configuring Hive on MR3 is to set environment variables in env.sh
.
env.sh
is read not only by HiveServer2 (running on the master node)
but also by ContainerWorkers (running on worker nodes).
$ vi env.sh
export JAVA_HOME=/home/hive/jdk17/
- Set
JAVA_HOME
to the install directory of Java which should be valid on every node.
$ vi env.sh
HIVE_DATABASE_HOST=192.168.10.1
HIVE_DATABASE_NAME=hive3mr3
HIVE_WAREHOUSE_DIR=s3a://hivemr3/warehouse
HIVE_SCRATCH_DIR=s3a://hivemr3/scratch
- Set
HIVE_DATABASE_HOST
to the IP address of the MySQL server for Metastore. In our example, the MySQL server is running at the IP address192.168.10.1
. - Set
HIVE_DATABASE_NAME
to the database name for Metastore in the MySQL server. - Set
HIVE_WAREHOUSE_DIR
to the S3 bucket (or the HDFS directory) storing the warehouse. Note that for using S3, we should use prefixs3a
, nots3
. - Set
HIVE_SCRATCH_DIR
to the S3 bucket (or the HDFS directory) for storing transient data. If an NFS volume is mounted in the same directory on every node,HIVE_SCRATCH_DIR
may be set to the mount point (e.g.,/home/nfs/hivemr3
).
$ vi env.sh
HIVE_METASTORE_HEAPSIZE=16384
HIVE_METASTORE_DB_TYPE=mysql
HIVE_SERVER2_HEAPSIZE=36864
MR3_AM_HEAPSIZE=32768
HIVE_METASTORE_HEAPSIZE
specifies the memory size (in MB) of Metastore.- Set
HIVE_METASTORE_DB_TYPE
to the database type for Metastore which is used as an argument toschematool
(mysql
for MYSQL,postgres
for Postgres,mssql
for MS SQL). HIVE_SERVER2_HEAPSIZE
specifies the memory size (in MB) of HiveServer2.MR3_AM_HEAPSIZE
specifies the memory size (in MB) of DAGAppMaster.
The following environment variables are read by ContainerWorkers
(which implies that env.sh
will be copied to worker nodes later).
$ vi env.sh
export PROCESS_CONTAINER_WORKER_SECRET=worker-secret
export PROCESS_CONTAINER_WORKER_SERVER_HOST=192.168.10.1
export PROCESS_CONTAINER_WORKER_MEMORY_MB=32768
export PROCESS_CONTAINER_WORKER_CPU_CORES=8
export PROCESS_CONTAINER_WORKER_MEMORY_XMX=26214
export PROCESS_CONTAINER_WORKER_LOCAL_DIRS=/data1/k8s
PROCESS_CONTAINER_WORKER_SECRET
specifies a string to be used as a secret for communicating to DAGAppMaster.PROCESS_CONTAINER_WORKER_SERVER_HOST
specifies the IP address of DAGAppMaster. In our example, it is set to the IP address of the master nodeorange1
.PROCESS_CONTAINER_WORKER_MEMORY_MB
specifies the memory size (in MB) to be assumed by each ContainerWorker. That is, a ContainerWorker assumes that it may use as much memory as specified byPROCESS_CONTAINER_WORKER_MEMORY_MB
.PROCESS_CONTAINER_WORKER_CPU_CORES
specifies the number of cores to be assumed by each ContainerWorker. That is, a ContainerWorker assumes that it may use as many cores as specified byPROCESS_CONTAINER_WORKER_CPU_CORES
.PROCESS_CONTAINER_WORKER_MEMORY_XMX
specifies the argument for the Java-Xmx
option.That is, it specifies the size of heap memory to be allocated to each ContainerWorker. The user should setPROCESS_CONTAINER_WORKER_MEMORY_XMX
to a fraction (e.g., 0.8) ofPROCESS_CONTAINER_WORKER_MEMORY_MB
. In our example, each ContainerWorker starts with Java option-Xmx26214
.PROCESS_CONTAINER_WORKER_LOCAL_DIRS
should be set to a (comma-separated) list of local directories for storing intermediate data in ContainerWorkers.
Configuring S3 (optional)
For accessing S3-compatible storage,
additional configuration keys should be set in conf/core-site.xml
.
Open conf/core-site.xml
and set configuration keys for S3.
The configuration key fs.s3a.endpoint
should be set to point to the storage server.
$ vi conf/core-site.xml
<property>
<name>fs.s3a.aws.credentials.provider</name>
<value>com.amazonaws.auth.EnvironmentVariableCredentialsProvider</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>http://192.168.10.100:9000</value>
</property>
<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
</property>
The user may need to change the parameters for accessing S3
to avoid SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
.
For more details, see Troubleshooting.
When using
the class EnvironmentVariableCredentialsProvider
to read AWS credentials,
two environment variables
AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
in env.sh
should be set to the access ID and the password, respectively.
$ vi env.sh
export AWS_ACCESS_KEY_ID=_your_aws_access_key_id_
export AWS_SECRET_ACCESS_KEY=_your_aws_secret_secret_key_
Distributing env.sh
to worker nodes
As it is also read by ContainerWorkers,
copy env.sh
to the same direction under the installation on every worker node.
$ for i in {2..5}; do scp env.sh hive@orange$i:/home/hive/hivemr3-1.11-java17-hive3.1.3-process/process/hive; done
We do not need to copy configuration files such as hive-site.xml
to worker nodes
because they are passed directly from MR3 DAGAppMaster.
Resources for mappers and reducers
Change resources to be allocated to each mapper, reducer, and ContainerWorker by updating conf/hive-site.xml
.
In particular,
the configuration keys hive.mr3.all-in-one.containergroup.memory.mb
and hive.mr3.all-in-one.containergroup.vcores
should be adjusted to match
PROCESS_CONTAINER_WORKER_MEMORY_MB
and PROCESS_CONTAINER_WORKER_CPU_CORES
in env.sh
.
hive.mr3.use.daemon.shufflehandler
can be set to the same value as hive.mr3.all-in-one.containergroup.vcores
.
$ vi conf/hive-site.xml
<property>
<name>hive.mr3.map.task.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>hive.mr3.map.task.vcores</name>
<value>1</value>
</property>
<property>
<name>hive.mr3.reduce.task.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>hive.mr3.reduce.task.vcores</name>
<value>1</value>
</property>
<property>
<name>hive.mr3.all-in-one.containergroup.memory.mb</name>
<value>32768</value>
</property>
<property>
<name>hive.mr3.all-in-one.containergroup.vcores</name>
<value>8</value>
</property>
<property>
<name>hive.mr3.use.daemon.shufflehandler</name>
<value>8</value>
</property>
For more details, see Performance Tuning.
Configuring Metastore
Open conf/hive-site.xml
and update configurations for Metastore as necessary.
Below we list some of configuration keys that the user should check.
The two configuration keys javax.jdo.option.ConnectionUserName
and javax.jdo.option.ConnectionPassword
should match
the user name and password of the MySQL server for Metastore.
For simplicity, we disable security on the Metastore side.
$ vi conf/hive-site.xml
<property>
<name>hive.metastore.db.type</name>
<value>MYSQL</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>passwd</value>
</property>
<property>
<name>hive.metastore.pre.event.listeners</name>
<value></value>
</property>
<property>
<name>metastore.pre.event.listeners</name>
<value></value>
</property>
Configuring HiveServer2
Check the configuration for authentication and authorization:
$ vi conf/hive-site.xml
<property>
<name>hive.security.authenticator.manager</name>
<value>org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator</value>
</property>
<property>
<name>hive.security.authorization.manager</name>
<value>org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory</value>
</property>
Running Metastore and HiveServer2
The user can start Metastore by executing the script hive/metastore-service.sh
.
When using MySQL for Metastore,
the script automatically downloads a MySQL connector
from https://cdn.mysql.com/Downloads/Connector-J/mysql-connector-java-8.0.28.tar.gz
.
After Metastore starts,
the user can start HiveServer2 by executing the script hive/hiveserver2-service.sh
.
$ hive/metastore-service.sh start --init-schema
...
Output directory: /home/hive/hivemr3-1.11-java17-hive3.1.3-process/process/hive/hive/run-result/metastore/hivemr3-2024-01-05-17-15-43-f1ee0af8
Starting Metastore...
$ hive/hiveserver2-service.sh start --localprocess
...
Output directory: /home/hive/hivemr3-1.11-java17-hive3.1.3-process/process/hive/hive/run-result/hiveserver2/hivemr3-2024-01-05-17-15-47-0a8ef05e
Starting HiveServer2...
- Use
--init-schema
option to initialize the database schema when running Metastore for the first time. Do not use it in order to reuse existing Hive databases. - Use
--localprocess
option to run DAGAppMaster as a separate process in LocalProcess mode. Without it, DAGAppMaster runs as a thread inside the HiveServer2 process.
After a while, a DAGAppMaster process is created and the user can find its log.
$ tail -f /home/hive/hivemr3-1.11-java17-hive3.1.3-process/process/hive/hive/run-result/hiveserver2/am-local-log-dir/application_221851685348155540_0001/run.log
...
2024-01-05 17:20:00,380 [DAGAppMaster-1-6] INFO HeartbeatHandler$ [] - Timeout check in HeartbeatHandler:Container
Running ContainerWorkers
In standalone mode, the user should execute ContainerWorkers manually. For example, the user can create a script for executing ContainerWorkers.
$ vi run.sh
#!/bin/bash
$ for i in {2..5}; do
ssh orange$i "cd /home/hive/hivemr3-1.11-java17-hive3.1.3-process/process/hive/hive/; ./run-worker.sh;" &
done
ContainerWorkers are not immediately registered in DAGAppMaster, as shown in the log of DAGAppMaster:
$ tail /home/hive/hivemr3-1.11-java17-hive3.1.3-process/process/hive/hive/run-result/hiveserver2/am-local-log-dir/application_221851685348155540_0001/run.log
...
2024-01-05 17:53:26,523 [IPC Server handler 0 on default port 19666] INFO ProcessCommunicatorServerClient$ [] - tryRegisterProcess(): 23ceb70e-ac97-4eee-b254-f8bd021e343a, 192.168.10.2, 32768MB, 8 cores
Instead DAGAppMaster registers ContainerWorkers after the first query is submitted.
Running Beeline
The user may use any client program to connect to HiveServer2. In our example, we run Beeline included in the installation.
$ hive/run-beeline.sh
...
# Running Beeline using Hive-MR3 #
...
Connecting to jdbc:hive2://orange1:9852/;;;
Connected to: Apache Hive (version 3.1.3)
Driver: Hive JDBC (version 3.1.3)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.3 by Apache Hive
0: jdbc:hive2://orange1:9852/>
Now the user can send queries to HiveServer2 normally.
Stopping Hive on MR3
In standalone mode,
the user is responsible for starting and terminating
all of Metastore, HiveServer, DAGAppMaster, and ContainerWorkers.
Thus manually terminate processes with the command kill
.