This page shows how to operate Hive on MR3 on Amazon EMR with external Hive tables created from S3 buckets. Currently Amazon EMR supports only Hive 2.3.6, but we will use an MR3 release based on Hive 3.1.2 so that the user can take advantage of the superior performance of Hive 3. By following the instruction, the user will learn:

  1. how to install and configure Hive on MR3 on Amazon EMR
  2. how to start and stop Metastore with a Derby database
  3. how to start and stop HiveServer2
  4. how to load Hive tables from S3 buckets
  5. how to create Beeline connections and send queries to HiveServer2

This scenario has the following prerequisites:

  1. The user can create an EMR cluster and connect to the master node via ssh.
  2. The user has access to an S3 bucket containing the dataset for loading Hive tables.
  3. The user knows the table format corresponding to the dataset in the S3 bucket.

This scenario should take less than 30 minutes to complete, not including the time for downloading an MR3 release and loading Hive tables. This page has been tested with MR3 release 1.0 on Amazon EMR 5.23.0 with m4.xlarge nodes.

Installation

Create an EMR cluster. Since downloading an MR3 release and loading Hive tables can take a long time, we recommend the user to start the EMR cluster with the master node and a single core node. After loading Hive tables, the user can attach more core/task nodes depending on the size of the dataset.

Connect to the master node, and download the pre-built MR3 release based on Hive 3 (corresponding to --hivesrc3 option to be used later) built specifically for Amazon EMR (ending with -emr.tar.gz) and uncompress it. Currently Hive on MR3 based on Hive 2 does not support Amazon EMR.

[hadoop@ip-172-31-19-1 ~]$ wget https://github.com/mr3project/mr3-release/releases/download/v1.0/hivemr3-1.0-hive3.1.2-emr.tar.gz
[hadoop@ip-172-31-19-1 ~]$ gunzip -c hivemr3-1.0-hive3.1.2-emr.tar.gz | tar xvf -;

Rename the new directory to mr3-run and change the working directory.

[hadoop@ip-172-31-19-1 ~]$ mv hivemr3-1.0-hive3.1.2-emr/ mr3-run
[hadoop@ip-172-31-19-1 ~]$ cd mr3-run/

Creating directories on HDFS

Create the warehouse directory specified in env.sh.

[hadoop@ip-172-31-19-1 mr3-run]$ grep HIVE3_HDFS_WAREHOUSE env.sh
HIVE3_HDFS_WAREHOUSE=/user/hadoop/warehouse
[hadoop@ip-172-31-19-1 mr3-run]$ hdfs dfs -mkdir -p /user/hadoop/warehouse

Create a directory for storing MR3 and Tez jar files.

[hadoop@ip-172-31-19-1 mr3-run]$ hdfs dfs -mkdir /user/hadoop/lib

Load MR3 jar files.

[hadoop@ip-172-31-19-1 mr3-run]$ mr3/upload-hdfslib-mr3.sh

Load Tez jar files.

[hadoop@ip-172-31-19-1 mr3-run]$ tez/upload-hdfslib-tez.sh

Open conf/local/hive3/hive-site.xml and check the following four configuration keys in order to make sure that an MR3 release built for Amazon EMR has been downloaded.

[hadoop@ip-172-31-19-1 mr3-run]$ vi conf/cluster/hive3/hive-site.xml 

<property>
  <name>hive.users.in.admin.role</name>
  <value>hadoop</value>
</property>

<property>
  <name>hive.aux.jars.path</name>
  <value>/home/hadoop/mr3-run/hive/hivejar/apache-hive-3.1.2-bin/lib/hive-llap-common-3.1.2.jar,/home/hadoop/mr3-run/hive/hivejar/apache-hive-3.1.2-bin/lib/hive-llap-server-3.1.2.jar,/home/hadoop/mr3-run/hive/hivejar/apache-hive-3.1.2-bin/lib/hive-llap-tez-3.1.2.jar</value>
</property>

<property>
  <name>hive.exec.scratchdir</name>
  <value>/tmp/hadoop</value>
</property>

<property>
  <name>hive.server2.logging.operation.log.location</name>
  <value>/tmp/hadoop/operation_logs</value>
</property>

Make sure that /tmp/hadoop does NOT exist on HDFS.

[hadoop@ip-172-31-19-1 mr3-run]$ hdfs dfs -ls /tmp/hadoop
ls: `/tmp/hadoop': No such file or directory

Creating temporary directories

Create a new directory specified by hive.server2.logging.operation.log.location.

[hadoop@ip-172-31-19-1 mr3-run]$ mkdir -p /tmp/hadoop/operation_logs

Running Metastore

Run Metastore with a Derby database using --cluster option, and initialize it using --init-schema option.

[hadoop@ip-172-31-19-1 mr3-run]$ hive/metastore-service.sh start --cluster --hivesrc3 --init-schema

After a while, check if Metastore has successfully started.

[hadoop@ip-172-31-19-1 mr3-run]$ tail /home/hadoop/mr3-run/hive/metastore-service-result/hive-mr3--2020-02-03--14-24-21-c5ffbac7/out-metastore.txt 


Initialization script completed
schemaTool completed
2020-02-03 14:24:29: Starting Hive Metastore Server
...

The log file for Metastore is found under /tmp/hadoop.

[hadoop@ip-172-31-19-1 mr3-run]$ ls /tmp/hadoop/hive.log 
/tmp/hadoop/hive.log

Running HiveServer2

Run HiveServer2 using --cluster option.

[hadoop@ip-172-31-19-1 mr3-run]$ hive/hiveserver2-service.sh start --cluster --hivesrc3

After a while, check if HiveServer2 has successfully started.

[hadoop@ip-172-31-19-1 mr3-run]$ grep -e "New MR3Session created" /home/hadoop/mr3-run/hive/hiveserver2-service-result/hive-mr3--2020-02-03--14-44-03-eea0e605/hive-logs/hive.log 
2020-02-03T14:44:26,488  INFO [main] session.MR3SessionManagerImpl: New MR3Session created: 585280e3-9247-468b-9cc5-0b37e439098c, hadoop

The user can find a new Yarn application.

[hadoop@ip-172-31-19-1 mr3-run]$ yarn application -list
20/02/03 14:45:15 INFO client.RMProxy: Connecting to ResourceManager at ip-172-31-19-1.ap-northeast-2.compute.internal/172.31.19.1:8032
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1
                Application-Id      Application-Name      Application-Type        User       Queue               State         Final-State         Progress                        Tracking-URL
application_1556891135763_0001  585280e3-9247-468b-9cc5-0b37e439098c                   mr3      hadoop     default             RUNNING           UNDEFINED               0%                                 N/A

Loading Hive tables from an S3 bucket

In our example, we have prepared an S3 bucket containing a dataset in ORC format for the TPC-DS benchmark with a scale factor of 10GB.

[hadoop@ip-172-31-19-1 mr3-run]$ hdfs dfs -ls s3://xxxxx-partitioned-10-orc/warehouse
Found 1 items
-rw-rw-rw-   1 hadoop hadoop       2132 2020-02-02 12:15 s3://xxxxx-partitioned-10-orc/warehouse/000000_0

The SQL script for loading the dataset is included in the MR3 release.

[hadoop@ip-172-31-19-1 mr3-run]$ sed -i -e 's/DB/tpcds_partitioned_10_orc/g' hive/benchmarks/hive-testbench/ddl-tpcds/load_external_tables_template.sql
[hadoop@ip-172-31-19-1 mr3-run]$ sed -i -e 's/LOCATION/xxxxx-partitioned-10-orc/g' hive/benchmarks/hive-testbench/ddl-tpcds/load_external_tables_template.sql

Load Hive tables. Depending on the property of the dataset, it can take a long time.

[hadoop@ip-172-31-19-1 mr3-run]$ hive/run-beeline.sh --cluster --hivesrc3 -f /home/hadoop/mr3-run/hive/benchmarks/hive-testbench/ddl-tpcds/load_external_tables_template.sql 

After loading Hive tables, the user can attach more core/task nodes. In our example, we use three core nodes.

[hadoop@ip-172-31-19-1 ~]$ hdfs dfsadmin -report | grep 'Configured Capacity'
Configured Capacity: 186735083520 (173.91 GB)
Configured Capacity: 62245027840 (57.97 GB)
Configured Capacity: 62245027840 (57.97 GB)
Configured Capacity: 62245027840 (57.97 GB)

Running Beeline

Run Beeline using --cluster option.

[hadoop@ip-172-31-19-1 mr3-run]$ hive/run-beeline.sh --cluster --hivesrc3
...
Connecting to jdbc:hive2://ip-172-31-19-1:9832/;;
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.2 by Apache Hive

Use the new database.

0: jdbc:hive2://ip-172-31-19-1:9832/> show databases;
...
+---------------------------+
|       database_name       |
+---------------------------+
| default                   |
| tpcds_partitioned_10_orc  |
+---------------------------+
2 rows selected (0.138 seconds)
0: jdbc:hive2://ip-172-31-19-1:9832/> use tpcds_partitioned_10_orc;
...
No rows affected (0.028 seconds)

Execute queries. The complete set of 99 TPC-DS queries is found under hive/benchmarks/hive-testbench/sample-queries-tpcds-hive3.

INFO  : Map 1: -/-  Map 5: 2/2  Map 6: 0(+1)/1  Reducer 2: 0/10 Reducer 3: 0/6  Reducer 4: 0/1  
----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1                 llap       Running      1          0        1        0       0       0  
Map 5 ..........      llap     Succeeded      2          2        0        0       0       0  
Map 6 ..........      llap     Succeeded      1          1        0        0       0       0  
Reducer 2             llap         Ready     10          0        0       10       0       0  
Reducer 3             llap        Inited      6          0        0        6       0       0  
Reducer 4             llap        Inited      1          0        0        1       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 02/06  [===>>-----------------------] 14%   ELAPSED TIME: 8.70 s     
----------------------------------------------------------------------------------------------

Stopping HiveServer2

Stop HiveServer2.

[hadoop@ip-172-31-19-1 mr3-run]$ hive/hiveserver2-service.sh stop --cluster --hivesrc3

After a while, no Yarn application is found.

[hadoop@ip-172-31-19-1 mr3-run]$ yarn application -list
20/02/03 15:19:04 INFO client.RMProxy: Connecting to ResourceManager at ip-172-31-19-1.ap-northeast-2.compute.internal/172.31.19.1:8032
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):0
                Application-Id      Application-Name      Application-Type        User       Queue               State         Final-State         Progress                        Tracking-URL

Stopping Metastore

Stop Metastore.

[hadoop@ip-172-31-28-236 mr3-run]$ hive/metastore-service.sh stop --cluster --hivesrc3

The user can check if Metastore has successfully stopped by reading its log file.

[hadoop@ip-172-31-28-236 mr3-run]$ tail -n3 /tmp/hadoop/hive.log 
/************************************************************
SHUTDOWN_MSG: Shutting down HiveMetaStore at ip-172-31-28-236.ap-northeast-2.compute.internal/172.31.28.236
************************************************************/