This page shows how to operate Hive on MR3 in a non-secure Hadoop cluster without Kerberos. The same user, not necessarily an administrator of the Hadoop cluster, will run all the components of Hive on MR3, such as Metastore, HiveServer2, and Beeline. For running Metastore, we will use a Derby database which is included in the MR3 release. By following the instruction, the user will learn:
- how to install and configure Hive on MR3 in a non-secure Hadoop cluster without Kerberos
- how to start and stop Metastore with a Derby database
- how to start and stop HiveServer2
- how to create Beeline connections and send queries to HiveServer2
- how to generate TPC-DS datasets and execute Hive queries in the TPC-DS benchmark
- how to run multiple Map/Reduce Tasks concurrently in a single ContainerWorker
This scenario has the following prerequisites:
- Java 1.8 is available.
- A non-secure Hadoop cluster 2.7 or higher is available.
- The user has access to the home directory and
/tmp
directory on HDFS. mvn
,gcc
, andjavac
should be available in order to generate TPC-DS datasets.
This scenario should take less than 30 minutes to complete, not including the time for downloading an MR3 release, generating a TPC-DS dataset, loading Hive tables, and computing column statistics. This page has been tested with MR3 release 1.0 on Hadoop 2.7 (HDP 2.6.4) and Hadoop 3.1 (HDP 3.0.1).
Installation
Download a pre-built MR3 release and uncompress it.
We choose the pre-built MR3 release based on Hive 3.1.2 which corresponds to --hivesrc3
option to be used later.
$ wget https://github.com/mr3project/mr3-release/releases/download/v1.0/hivemr3-1.0-hive3.1.2.tar.gz
$ gunzip -c hivemr3-1.0-hive3.1.2.tar.gz | tar xvf -;
Rename the new directory to mr3-run
and change the working directory.
Renaming the new directory is not strictly necessary,
but it is recommended because the sample configuration file hive-site.xml
included in the MR3 release uses the same directory name.
$ mv hivemr3-1.0-hive3.1.2/ mr3-run
$ cd mr3-run/
Configuring
Open env.sh
and set JAVA_HOME
and PATH
if necessary.
Set HADOOP_HOME
to the Hadoop installation directory.
$ vi env.sh
export JAVA_HOME=/usr/apps/java/default
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_HOME=/usr/hdp/2.6.4.0-91/hadoop
Set the following environment variables to adjust the memory size (in MB) to be allocated to each component:
HIVE_METASTORE_HEAPSIZE
specifies the memory size for Metastore.HIVE_SERVER2_HEAPSIZE
specifies the memory size for HiveServer2.HIVE_CLIENT_HEAPSIZE
specifies the memory size of HiveCLI (hive
command) and Beeline (beeline
command).MR3_AM_HEAPSIZE
specifies the memory size of MR3 DAGAppMaster.
In our example, we use the following values.
We increase the value for HIVE_CLIENT_HEAPSIZE
to 16384 (16GB) because we will run hive
command when generating TPC-DS datasets.
HIVE_METASTORE_HEAPSIZE=4096
HIVE_SERVER2_HEAPSIZE=16384
HIVE_CLIENT_HEAPSIZE=16384
MR3_AM_HEAPSIZE=10240
HIVE3_HDFS_WAREHOUSE
specifies the warehouse directory on HDFS.
Update it to use the current user name.
HIVE3_HDFS_WAREHOUSE=/user/gla/warehouse
Open conf/cluster/hive3/hive-site.xml
and set the following four configuration keys
according to the current user name (instead of the default user hive
) and the working directory
(instead of the default directory /home/hive
).
$ vi conf/cluster/hive3/hive-site.xml
<property>
<name>hive.users.in.admin.role</name>
<value>gla</value>
</property>
<property>
<name>hive.aux.jars.path</name>
<value>/home/gla/mr3-run/hive/hivejar/apache-hive-3.1.2-bin/lib/hive-llap-common-3.1.2.jar,/home/gla/mr3-run/hive/hivejar/apache-hive-3.1.2-bin/lib/hive-llap-server-3.1.2.jar,/home/gla/mr3-run/hive/hivejar/apache-hive-3.1.2-bin/lib/hive-llap-tez-3.1.2.jar</value>
</property>
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/gla</value>
</property>
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>/tmp/gla/operation_logs</value>
</property>
The following configuration keys specify the resource to be allocated to a Map Task, a Reduce Task, or a ContainerWorker. By default, we allocate 4GB and a single core to a Map Task, a Reduce Task, and a ContainerWorker.
<property>
<name>hive.mr3.map.task.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>hive.mr3.map.task.vcores</name>
<value>1</value>
</property>
<property>
<name>hive.mr3.reduce.task.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>hive.mr3.reduce.task.vcores</name>
<value>1</value>
</property>
<property>
<name>hive.mr3.all-in-one.containergroup.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>hive.mr3.all-in-one.containergroup.vcores</name>
<value>1</value>
</property>
When updating these configuration keys, we should meet the following requirements:
hive.mr3.map.task.memory.mb
<=hive.mr3.all-in-one.containergroup.memory.mb
hive.mr3.map.task.vcores
<=hive.mr3.all-in-one.containergroup.vcores
hive.mr3.reduce.task.memory.mb
<=hive.mr3.all-in-one.containergroup.memory.mb
hive.mr3.reduce.task.vcores
<=hive.mr3.all-in-one.containergroup.vcores
For simplicity,
disable impersonation by setting hive.server2.enable.doAs
to false.
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
</property>
Creating directories on HDFS
Create the warehouse directory specified in env.sh
.
$ grep HIVE3_HDFS_WAREHOUSE env.sh
HIVE3_HDFS_WAREHOUSE=/user/gla/warehouse
$ hdfs dfs -mkdir -p /user/gla/warehouse
Create a directory for storing MR3 and Tez jar files.
$ hdfs dfs -mkdir /user/gla/lib
Load MR3 jar files.
$ mr3/upload-hdfslib-mr3.sh
Load Tez jar files.
$ tez/upload-hdfslib-tez.sh
Make sure that /tmp/gla
does NOT exist on HDFS.
$ hdfs dfs -ls /tmp/gla
ls: `/tmp/gla': No such file or directory
If the directory already exists (e.g., when running Hive on MR3 for the second time), make sure that its permission is set to 733. HiveServer2 automatically creates a new directory with permission 733 if it does not exist.
$ hdfs dfs -ls /tmp/ | grep gla
drwx-wx-wx - gla hdfs 0 2020-02-03 00:05 /tmp/gla
Creating temporary directories
Create a new directory specified by hive.server2.logging.operation.log.location
.
$ ls -alt /tmp/gla/operation_logs
ls: cannot access /tmp/gla/operation_logs: No such file or directory
$ mkdir -p /tmp/gla/operation_logs
Running Metastore
Metastore uses the port specified by the environment variable HIVE3_METASTORE_PORT
in env.sh
.
Make sure that the port is not in use.
$ grep HIVE3_METASTORE_PORT env.sh
HIVE3_METASTORE_PORT=9830
$ ss -anpt | grep -E "LISTEN.*:9830"
$
Run Metastore with a Derby database using --cluster
option, and initialize it using --init-schema
option.
$ hive/metastore-service.sh start --cluster --hivesrc3 --init-schema
After a while, check if Metastore has successfully started.
$ tail /home/gla/mr3-run/hive/metastore-service-result/hive-mr3--2020-02-02--23-59-55-6775428c/out-metastore.txt
Initialization script completed
schemaTool completed
2020-02-03 00:00:04: Starting Hive Metastore Server
...
When the user restarts Metastore, do not use --init-schema
option in order to reuse existing Hive databases.
For example, the user can kill Metastore and restart it as follows.
(If the following command fails to kill Metastore for some reason, manually kill with kill -9
.)
$ hive/metastore-service.sh stop --cluster --hivesrc3
$ hive/metastore-service.sh start --cluster --hivesrc3
The log file for Metastore is found under /tmp/gla
.
$ ls /tmp/gla/hive.log
/tmp/gla/hive.log
Running HiveServer2
Run HiveServer2 using --cluster
option.
In order to use LocalProcess mode for MR3 DAGAppMaster, use --amprocess
option.
In our example, we use --amprocess
option.
$ hive/hiveserver2-service.sh start --cluster --hivesrc3 --amprocess
After a while, check if HiveServer2 has successfully started by inspecting its log file.
$ grep -e "New MR3Session created" /home/gla/mr3-run/hive/hiveserver2-service-result/hive-mr3--2020-02-03--00-07-12-5c0c29fb/hive-logs/hive.log
2020-02-03T00:07:31,805 INFO [main] session.MR3SessionManagerImpl: New MR3Session created: 76cc9479-0b74-4d17-a7f2-64d671f803d4, gla
The user can find a new Yarn application of type mr3
submitted by the user.
$ yarn application -list
20/02/03 00:08:06 INFO client.RMProxy: Connecting to ResourceManager at gold0/10.1.90.9:8050
20/02/03 00:08:06 INFO client.AHSProxy: Connecting to Application History server at gold0/10.1.90.9:10200
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1548163665378_0142 76cc9479-0b74-4d17-a7f2-64d671f803d4 mr3 gla default RUNNING UNDEFINED 0% N/A
The user can also find the process for MR3 DAGAppMaster.
$ ps -ef | grep DAGAppMaster | grep mr3
gla 6708 6528 10 00:07 pts/2 00:00:11 /usr/jdk64/jdk1.8.0_112/jre/bin/java -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseG1GC -XX:+ResizeTLAB -Xmx8192m -Dlog4j.configurationFile=mr3-container-log4j2.properties -Dlog4j.configuratorClass=com.datamonad.mr3.MR3Log4jConfigurator -Dmr3.root.logger=INFO -Dyarn.app.container.log.dir=/home/gla/mr3-run/hive/hiveserver2-service-result/hive-mr3--2020-02-03--00-07-12-5c0c29fb/application_1548163665378_0142 -Dsun.nio.ch.bugLevel='' com.datamonad.mr3.master.DAGAppMaster --session
The user can find the log file for MR3 DAGAppMaster.
$ ls /home/gla/mr3-run/hive/hiveserver2-service-result/hive-mr3--2020-02-03--00-07-12-5c0c29fb/application_1548163665378_0142/run.log
/home/gla/mr3-run/hive/hiveserver2-service-result/hive-mr3--2020-02-03--00-07-12-5c0c29fb/application_1548163665378_0142/run.log
Running queries
Download a sample dataset.
$ wget https://github.com/mr3project/mr3-release/releases/download/v1.0/pokemon.csv
Run Beeline using --cluster
option.
$ hive/run-beeline.sh --cluster --hivesrc3
Use the default database.
0: jdbc:hive2://gold0:9832/> use default;
Create a table called pokemon
.
0: jdbc:hive2://gold0:9832/> CREATE TABLE pokemon (Number Int,Name String,Type1 String,Type2 String,Total Int,HP Int,Attack Int,Defense Int,Sp_Atk Int,Sp_Def Int,Speed Int) row format delimited fields terminated BY ',' lines terminated BY '\n' tblproperties("skip.header.line.count"="1");
Import the sample dataset.
0: jdbc:hive2://gold0:9832/> load data local inpath './pokemon.csv' INTO table pokemon;
Execute queries.
0: jdbc:hive2://gold0:9832/> select avg(HP) from pokemon;
...
0: jdbc:hive2://gold0:9832/> create table pokemon1 as select *, IF(HP>160.0,'strong',IF(HP>140.0,'moderate','weak')) AS power_rate from pokemon;
...
0: jdbc:hive2://gold0:9832/> select COUNT(name), power_rate from pokemon1 group by power_rate;
...
Exit Beeline. The warehouse directory on HDFS has now two sub-directories.
$ hdfs dfs -ls /user/gla/warehouse/
Found 2 items
drwxr-xr-x - gla hdfs 0 2020-02-03 00:40 /user/gla/warehouse/pokemon
drwxr-xr-x - gla hdfs 0 2020-02-03 00:41 /user/gla/warehouse/pokemon1
Stop HiveServer2. MR3 DAGAppMaster also stops.
$ hive/hiveserver2-service.sh stop --cluster --hivesrc3
Generating a TPC-DS dataset
Open env.sh
and set HIVE_DS_SCALE_FACTOR
to specify a scale factor for the TPC-DS benchmark.
In our example, we use 100 for a dataset of 100GB.
$ vi env.sh
HIVE_DS_SCALE_FACTOR=100
The script for generating TPC-DS datasets creates sub-directories under /tmp/tpcds-generate/
on HDFS.
Make sure that the directory does not exist.
$ hdfs dfs -ls /tmp/tpcds-generate
ls: `/tmp/tpcds-generate': No such file or directory
It it already exists, make sure that the user has write permission so that the script can create sub-directories.
If a sub-directory /tmp/tpcds-generate/100
(corresponding to a scale factor of 100) already exists,
it should be created by the same script.
Generate a TPC-DS dataset and load Hive tables. Depending on the scale factor and the size of the cluster, it may take up to a few hours.
$ hive/gen-tpcds.sh --cluster --hivesrc3
The raw data is created in the directory /tmp/tpcds-generate/100
on HDFS.
$ hdfs dfs -du -h /tmp/tpcds-generate
96.3 G /tmp/tpcds-generate/100
The warehouse directory on HDFS now has two new sub-directories.
/user/gla/warehouse/tpcds_text_100.db
is empty because it corresponds to external Hive tables pointing to the raw data in /tmp/tpcds-generate/100
.
$ hdfs dfs -du -h /user/gla/warehouse
44.2 K /user/gla/warehouse/pokemon
36.9 K /user/gla/warehouse/pokemon1
30.1 G /user/gla/warehouse/tpcds_bin_partitioned_orc_100.db
0 /user/gla/warehouse/tpcds_text_100.db
Running TPC-DS queries
Run HiveServer2 using --cluster
option as well as --amprocess
option.
$ hive/hiveserver2-service.sh start --cluster --hivesrc3 --amprocess
Run Beeline using --cluster
option.
$ hive/run-beeline.sh --cluster --hivesrc3
The user can now find two new databases available.
0: jdbc:hive2://gold0:9832/> show databases;
INFO : Compiling command(queryId=gla_20190507030245_8d1e5560-f4c4-4901-8153-f01759aedfc9): show databases
INFO : Semantic Analysis Completed (retrial = false)
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:database_name, type:string, comment:from deserializer)], properties:null)
INFO : Completed compiling command(queryId=gla_20190507030245_8d1e5560-f4c4-4901-8153-f01759aedfc9); Time taken: 1.487 seconds
INFO : Executing command(queryId=gla_20190507030245_8d1e5560-f4c4-4901-8153-f01759aedfc9): show databases
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=gla_20190507030245_8d1e5560-f4c4-4901-8153-f01759aedfc9); Time taken: 0.158 seconds
INFO : OK
+--------------------------------+
| database_name |
+--------------------------------+
| default |
| tpcds_bin_partitioned_orc_100 |
| tpcds_text_100 |
+--------------------------------+
3 rows selected (2.004 seconds)
Use tpcds_bin_partitioned_orc_100
.
0: jdbc:hive2://gold0:9832/> use tpcds_bin_partitioned_orc_100;
Execute queries.
The complete set of 99 TPC-DS queries is found under hive/benchmarks/hive-testbench/sample-queries-tpcds-hive2
.
Most of the queries should run okay.
There are, however, a few queries that may fail occasionally.
As an example, try query 4 (hive/benchmarks/hive-testbench/sample-queries-tpcds-hive2/query4.sql
).
The execution may fail with java.lang.OutOfMemoryError
, even with the mechanism of query re-execution in Hive 3.
This is because column statistics of Hive tables are not available yet.
Compute column statistics of Hive tables. Depending on the scale factor and the size of the cluster, it may take a long time (but much less than generating the TPC-DS dataset).
0: jdbc:hive2://gold0:9832/> analyze table call_center compute statistics for columns; analyze table catalog_page compute statistics for columns; analyze table catalog_returns compute statistics for columns; analyze table catalog_sales compute statistics for columns; analyze table customer compute statistics for columns; analyze table customer_address compute statistics for columns; analyze table customer_demographics compute statistics for columns; analyze table date_dim compute statistics for columns; analyze table household_demographics compute statistics for columns; analyze table income_band compute statistics for columns; analyze table inventory compute statistics for columns; analyze table item compute statistics for columns; analyze table promotion compute statistics for columns; analyze table reason compute statistics for columns; analyze table ship_mode compute statistics for columns; analyze table store compute statistics for columns; analyze table store_returns compute statistics for columns; analyze table store_sales compute statistics for columns; analyze table time_dim compute statistics for columns; analyze table warehouse compute statistics for columns; analyze table web_page compute statistics for columns; analyze table web_returns compute statistics for columns; analyze table web_sales compute statistics for columns; analyze table web_site compute statistics for columns;
...
Execute queries. With column statistics available, every query should run more reliably.
Stop HiveServer2.
$ hive/hiveserver2-service.sh stop --cluster --hivesrc3
Running multiple Map/Reduce Tasks concurrently in a single ContainerWorker
With the default configuration, a ContainerWorker can run one Map or Reduce Task at a time. By allocating more resources, however, we can run multiple Map and Reduce Tasks concurrently inside the same ContainerWorker.
Open conf/cluster/hive3/hive-site.xml
.
Increase the memory size and the number of cores to be allocated to each ContainerWorker.
In our example, we increase the memory size to 80GB for each ContainerWorker.
Make sure that the configuration key hive.mr3.container.combine.taskattempts
is set to true.
$ vi conf/cluster/hive3/hive-site.xml
<property>
<name>hive.mr3.container.combine.taskattempts</name>
<value>true</value>
</property>
<property>
<name>hive.mr3.all-in-one.containergroup.memory.mb</name>
<value>81920</value>
</property>
<property>
<name>hive.mr3.all-in-one.containergroup.vcores</name>
<value>20</value>
</property>
Run HiveServer2 using --cluster
option as well as --amprocess
option.
$ hive/hiveserver2-service.sh start --cluster --hivesrc3 --amprocess
Run Beeline using --cluster
option.
$ hive/run-beeline.sh --cluster --hivesrc3
Use tpcds_bin_partitioned_orc_100
.
0: jdbc:hive2://gold0:9832/> use tpcds_bin_partitioned_orc_100;
Execute queries. The user can check that Yarn has created much fewer containers.
Stopping HiveServer2 and Metastore
Stop HiveServer2.
$ hive/hiveserver2-service.sh stop --cluster --hivesrc3
Stop Metastore.
$ hive/metastore-service.sh stop --cluster --hivesrc3
The user can check if Metastore has successfully stopped by reading its log file.
$ tail -n3 /tmp/gla/hive.log
/************************************************************
SHUTDOWN_MSG: Shutting down HiveMetaStore at gold0/10.1.90.9
************************************************************/