This page shows how to operate Hive on MR3 on Amazon EMR with external Hive tables created from S3 buckets. Currently Amazon EMR supports only Hive 2.3.6, but we will use an MR3 release based on Hive 3.1.2 so that the user can take advantage of the superior performance of Hive 3. By following the instruction, the user will learn:
- how to install and configure Hive on MR3 on Amazon EMR
- how to start and stop Metastore with a Derby database
- how to start and stop HiveServer2
- how to load Hive tables from S3 buckets
- how to create Beeline connections and send queries to HiveServer2
This scenario has the following prerequisites:
- The user can create an EMR cluster and connect to the master node via ssh.
- The user has access to an S3 bucket containing the dataset for loading Hive tables.
- The user knows the table format corresponding to the dataset in the S3 bucket.
This scenario should take less than 30 minutes to complete, not including the time for downloading an MR3 release and loading Hive tables. This page has been tested with MR3 release 1.0 on Amazon EMR 5.23.0 with m4.xlarge nodes.
Installation
Create an EMR cluster. Since downloading an MR3 release and loading Hive tables can take a long time, we recommend the user to start the EMR cluster with the master node and a single core node. After loading Hive tables, the user can attach more core/task nodes depending on the size of the dataset.
Connect to the master node, and
download the pre-built MR3 release based on Hive 3 (corresponding to --hivesrc3
option to be used later)
built specifically for Amazon EMR (ending with -emr.tar.gz
) and uncompress it.
Currently Hive on MR3 based on Hive 2 does not support Amazon EMR.
[hadoop@ip-172-31-19-1 ~]$ wget https://github.com/mr3project/mr3-release/releases/download/v1.0/hivemr3-1.0-hive3.1.2-emr.tar.gz
[hadoop@ip-172-31-19-1 ~]$ gunzip -c hivemr3-1.0-hive3.1.2-emr.tar.gz | tar xvf -;
Rename the new directory to mr3-run
and change the working directory.
[hadoop@ip-172-31-19-1 ~]$ mv hivemr3-1.0-hive3.1.2-emr/ mr3-run
[hadoop@ip-172-31-19-1 ~]$ cd mr3-run/
Creating directories on HDFS
Create the warehouse directory specified in env.sh
.
[hadoop@ip-172-31-19-1 mr3-run]$ grep HIVE3_HDFS_WAREHOUSE env.sh
HIVE3_HDFS_WAREHOUSE=/user/hadoop/warehouse
[hadoop@ip-172-31-19-1 mr3-run]$ hdfs dfs -mkdir -p /user/hadoop/warehouse
Create a directory for storing MR3 and Tez jar files.
[hadoop@ip-172-31-19-1 mr3-run]$ hdfs dfs -mkdir /user/hadoop/lib
Load MR3 jar files.
[hadoop@ip-172-31-19-1 mr3-run]$ mr3/upload-hdfslib-mr3.sh
Load Tez jar files.
[hadoop@ip-172-31-19-1 mr3-run]$ tez/upload-hdfslib-tez.sh
Open conf/local/hive3/hive-site.xml
and check the following four configuration keys
in order to make sure that an MR3 release built for Amazon EMR has been downloaded.
[hadoop@ip-172-31-19-1 mr3-run]$ vi conf/cluster/hive3/hive-site.xml
<property>
<name>hive.users.in.admin.role</name>
<value>hadoop</value>
</property>
<property>
<name>hive.aux.jars.path</name>
<value>/home/hadoop/mr3-run/hive/hivejar/apache-hive-3.1.2-bin/lib/hive-llap-common-3.1.2.jar,/home/hadoop/mr3-run/hive/hivejar/apache-hive-3.1.2-bin/lib/hive-llap-server-3.1.2.jar,/home/hadoop/mr3-run/hive/hivejar/apache-hive-3.1.2-bin/lib/hive-llap-tez-3.1.2.jar</value>
</property>
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hadoop</value>
</property>
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>/tmp/hadoop/operation_logs</value>
</property>
Make sure that /tmp/hadoop
does NOT exist on HDFS.
[hadoop@ip-172-31-19-1 mr3-run]$ hdfs dfs -ls /tmp/hadoop
ls: `/tmp/hadoop': No such file or directory
Creating temporary directories
Create a new directory specified by hive.server2.logging.operation.log.location
.
[hadoop@ip-172-31-19-1 mr3-run]$ mkdir -p /tmp/hadoop/operation_logs
Running Metastore
Run Metastore with a Derby database using --cluster
option, and initialize it using --init-schema
option.
[hadoop@ip-172-31-19-1 mr3-run]$ hive/metastore-service.sh start --cluster --hivesrc3 --init-schema
After a while, check if Metastore has successfully started.
[hadoop@ip-172-31-19-1 mr3-run]$ tail /home/hadoop/mr3-run/hive/metastore-service-result/hive-mr3--2020-02-03--14-24-21-c5ffbac7/out-metastore.txt
Initialization script completed
schemaTool completed
2020-02-03 14:24:29: Starting Hive Metastore Server
...
The log file for Metastore is found under /tmp/hadoop
.
[hadoop@ip-172-31-19-1 mr3-run]$ ls /tmp/hadoop/hive.log
/tmp/hadoop/hive.log
Running HiveServer2
Run HiveServer2 using --cluster
option.
[hadoop@ip-172-31-19-1 mr3-run]$ hive/hiveserver2-service.sh start --cluster --hivesrc3
After a while, check if HiveServer2 has successfully started.
[hadoop@ip-172-31-19-1 mr3-run]$ grep -e "New MR3Session created" /home/hadoop/mr3-run/hive/hiveserver2-service-result/hive-mr3--2020-02-03--14-44-03-eea0e605/hive-logs/hive.log
2020-02-03T14:44:26,488 INFO [main] session.MR3SessionManagerImpl: New MR3Session created: 585280e3-9247-468b-9cc5-0b37e439098c, hadoop
The user can find a new Yarn application.
[hadoop@ip-172-31-19-1 mr3-run]$ yarn application -list
20/02/03 14:45:15 INFO client.RMProxy: Connecting to ResourceManager at ip-172-31-19-1.ap-northeast-2.compute.internal/172.31.19.1:8032
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1556891135763_0001 585280e3-9247-468b-9cc5-0b37e439098c mr3 hadoop default RUNNING UNDEFINED 0% N/A
Loading Hive tables from an S3 bucket
In our example, we have prepared an S3 bucket containing a dataset in ORC format for the TPC-DS benchmark with a scale factor of 10GB.
[hadoop@ip-172-31-19-1 mr3-run]$ hdfs dfs -ls s3://xxxxx-partitioned-10-orc/warehouse
Found 1 items
-rw-rw-rw- 1 hadoop hadoop 2132 2020-02-02 12:15 s3://xxxxx-partitioned-10-orc/warehouse/000000_0
The SQL script for loading the dataset is included in the MR3 release.
[hadoop@ip-172-31-19-1 mr3-run]$ sed -i -e 's/DB/tpcds_partitioned_10_orc/g' hive/benchmarks/hive-testbench/ddl-tpcds/load_external_tables_template.sql
[hadoop@ip-172-31-19-1 mr3-run]$ sed -i -e 's/LOCATION/xxxxx-partitioned-10-orc/g' hive/benchmarks/hive-testbench/ddl-tpcds/load_external_tables_template.sql
Load Hive tables. Depending on the property of the dataset, it can take a long time.
[hadoop@ip-172-31-19-1 mr3-run]$ hive/run-beeline.sh --cluster --hivesrc3 -f /home/hadoop/mr3-run/hive/benchmarks/hive-testbench/ddl-tpcds/load_external_tables_template.sql
After loading Hive tables, the user can attach more core/task nodes. In our example, we use three core nodes.
[hadoop@ip-172-31-19-1 ~]$ hdfs dfsadmin -report | grep 'Configured Capacity'
Configured Capacity: 186735083520 (173.91 GB)
Configured Capacity: 62245027840 (57.97 GB)
Configured Capacity: 62245027840 (57.97 GB)
Configured Capacity: 62245027840 (57.97 GB)
Running Beeline
Run Beeline using --cluster
option.
[hadoop@ip-172-31-19-1 mr3-run]$ hive/run-beeline.sh --cluster --hivesrc3
...
Connecting to jdbc:hive2://ip-172-31-19-1:9832/;;
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.2 by Apache Hive
Use the new database.
0: jdbc:hive2://ip-172-31-19-1:9832/> show databases;
...
+---------------------------+
| database_name |
+---------------------------+
| default |
| tpcds_partitioned_10_orc |
+---------------------------+
2 rows selected (0.138 seconds)
0: jdbc:hive2://ip-172-31-19-1:9832/> use tpcds_partitioned_10_orc;
...
No rows affected (0.028 seconds)
Execute queries.
The complete set of 99 TPC-DS queries is found under hive/benchmarks/hive-testbench/sample-queries-tpcds-hive3
.
INFO : Map 1: -/- Map 5: 2/2 Map 6: 0(+1)/1 Reducer 2: 0/10 Reducer 3: 0/6 Reducer 4: 0/1
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 llap Running 1 0 1 0 0 0
Map 5 .......... llap Succeeded 2 2 0 0 0 0
Map 6 .......... llap Succeeded 1 1 0 0 0 0
Reducer 2 llap Ready 10 0 0 10 0 0
Reducer 3 llap Inited 6 0 0 6 0 0
Reducer 4 llap Inited 1 0 0 1 0 0
----------------------------------------------------------------------------------------------
VERTICES: 02/06 [===>>-----------------------] 14% ELAPSED TIME: 8.70 s
----------------------------------------------------------------------------------------------
Stopping HiveServer2
Stop HiveServer2.
[hadoop@ip-172-31-19-1 mr3-run]$ hive/hiveserver2-service.sh stop --cluster --hivesrc3
After a while, no Yarn application is found.
[hadoop@ip-172-31-19-1 mr3-run]$ yarn application -list
20/02/03 15:19:04 INFO client.RMProxy: Connecting to ResourceManager at ip-172-31-19-1.ap-northeast-2.compute.internal/172.31.19.1:8032
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):0
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
Stopping Metastore
Stop Metastore.
[hadoop@ip-172-31-28-236 mr3-run]$ hive/metastore-service.sh stop --cluster --hivesrc3
The user can check if Metastore has successfully stopped by reading its log file.
[hadoop@ip-172-31-28-236 mr3-run]$ tail -n3 /tmp/hadoop/hive.log
/************************************************************
SHUTDOWN_MSG: Shutting down HiveMetaStore at ip-172-31-28-236.ap-northeast-2.compute.internal/172.31.28.236
************************************************************/