This page shows how to operate Hive on MR3 in a non-secure Hortonworks HDP cluster without Kerberos. An ordinary user (not an administrator of the Hadoop cluster) will run HiveServer2. For Metastore, we will use the instance already running on HDP. By following the instruction, the user will learn:

  1. how to install and configure Hive on MR3 in a non-secure HDP cluster without Kerberos
  2. how to start and stop HiveServer2
  3. how to create Beeline connections

This scenario has the following prerequisites:

  • A non-secure HDP cluster is available.
  • The user has administrator access to Ambari of HDP.
  • The user has administrator access to HDFS.
  • The user has access to the home directory and /tmp directory on HDFS.

This scenario should take less than 30 minutes to complete, not including the time for downloading an MR3 release. The user can apply the instruction to operate Hive on MR3 in any Hadoop cluster where Metastore is already running.

This page has been tested with MR3 release 1.7 on HDP 3.1.4.
For asking any questions, please visit MR3 Google Group or join MR3 Slack.

Installation

In our example, we use an ordinary user gla for installing and running HiveServer2.

Download a pre-built MR3 release compatible with Metastore of HDP on a node where the command yarn is available. In the case of HDP 3.1.4, the version of Hive in it is 3.1, so we should download hivemr3-1.7-hive3.1.3-k8s.tar.gz (based on Hive 3.1.3 and corresponding to --hivesrc3 option to be used later).

Uncompress the pre-built MR3 release, rename the new directory to mr3-run, and change the working directory. Renaming the new directory is not strictly necessary, but it is recommended because the sample configuration file hive-site.xml included in the MR3 release uses the same directory name.

$ wget https://github.com/mr3project/mr3-release/releases/download/v1.7/hivemr3-1.7-hive3.1.3-k8s.tar.gz
$ gunzip -c hivemr3-1.7-hive3.1.3-k8s.tar.gz | tar xvf -;
$ mv hivemr3-1.7-hive3.1.3-k8s/ mr3-run
$ cd mr3-run/

Configuring env.sh

Open env.sh and set JAVA_HOME. Set HADOOP_HOME to the Hadoop installation directory.

$ vi env.sh

export JAVA_HOME=/usr/jdk64/jdk1.8.0_112
export HADOOP_HOME=/usr/hdp/3.1.4.0-315/hadoop

Set the following environment variables to adjust the memory size (in MB) to be allocated to each component:

  • HIVE_SERVER2_HEAPSIZE specifies the memory size for HiveServer2.
  • HIVE_CLIENT_HEAPSIZE specifies the memory size of Beeline (beeline command).
  • MR3_AM_HEAPSIZE specifies the memory size of MR3 DAGAppMaster.
$ vi env.sh

HIVE_SERVER2_HEAPSIZE=16384
HIVE_CLIENT_HEAPSIZE=1024
MR3_AM_HEAPSIZE=10240

Open the Ambari webpage as the administrator user and find the information on Metastore: Database Name, Database Username, Database URL, Hive Database Type, JDBC Driver Class, Database Password, hive.metastore.uris, and Hive Metastore Warehouse directory.

ambari-metastore1-blue ambari-metastore2-blue

Set the following environment variables according to the information on Metastore. Note the prefix HIVE3_ which corresponds to --hivesrc3 option.

$ vi env.sh

HIVE3_DATABASE_HOST=blue0         # from Database URL
HIVE3_METASTORE_HOST=blue0        # from hive.metastore.uris
HIVE3_METASTORE_PORT=9083         # from hive.metastore.uris
HIVE3_METASTORE_LOCAL_PORT=9083
HIVE3_DATABASE_NAME=hivellap      # from Database Name
HIVE3_HDFS_WAREHOUSE=/warehouse/tablespace/managed/hive   # from Hive Metastore Warehouse directory

Set HIVE_MYSQL_DRIVER to specify the path to a MySQL connector jar file.

$ vi env.sh

HIVE_MYSQL_DRIVER=/usr/share/java/mysql-connector-java.jar

Set the following environment variables to indicate a non-secure HDP cluster.

$ vi env.sh

SECURE_MODE=false
TOKEN_RENEWAL_HDFS_ENABLED=false
TOKEN_RENEWAL_HIVE_ENABLED=false
HIVE_SERVER2_AUTHENTICATION=NONE

Configuring hive-site.xml

Open conf/tpcds/hive3/hive-site.xml and set the following configuration keys according to the information on Metastore.

$ vi conf/tpcds/hive3/hive-site.xml

<property>
  <name>hive.metastore.db.type</name>
  <value>MYSQL</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>root</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>password</value>
</property>

Set the following configuration keys according to the current user name gla (instead of the default user hive) and the working directory (instead of the default directory /home/hive).

$ vi conf/tpcds/hive3/hive-site.xml

<property>
  <name>hive.exec.scratchdir</name>
  <value>/tmp/gla</value>
</property>

<property>
  <name>hive.server2.logging.operation.log.location</name>
  <value>/tmp/gla/operation_logs</value>
</property>

<property>
  <name>hive.aux.jars.path</name>
  <value>/home/gla/mr3-run/hive/hivejar/apache-hive-3.1.3-bin/lib/hive-llap-common-3.1.3.jar,/home/gla/mr3-run/hive/hivejar/apache-hive-3.1.3-bin/lib/hive-llap-server-3.1.3.jar,/home/gla/mr3-run/hive/hivejar/apache-hive-3.1.3-bin/lib/hive-llap-tez-3.1.3.jar</value>
</property>

The following configuration keys specify resources to be allocated to a Map Task, a Reduce Task, and a ContainerWorker. By default, we allocate 4GB and a single core to a Map Task and a Reduce Task. A single ContainerWorker can accommodate 10 concurrent Tasks.

$ vi conf/tpcds/hive3/hive-site.xml

<property>
  <name>hive.mr3.map.task.memory.mb</name>
  <value>4096</value>
</property>

<property>
  <name>hive.mr3.map.task.vcores</name>
  <value>1</value>
</property>

<property>
  <name>hive.mr3.reduce.task.memory.mb</name>
  <value>4096</value>
</property>

<property>
  <name>hive.mr3.reduce.task.vcores</name>
  <value>1</value>
</property>

<property>
  <name>hive.mr3.all-in-one.containergroup.memory.mb</name>
  <value>40960</value>
</property>

<property>
  <name>hive.mr3.all-in-one.containergroup.vcores</name>
  <value>10</value>
</property>

By default, we enable impersonation by setting hive.server2.enable.doAs to true.

$ vi conf/tpcds/hive3/hive-site.xml

<property>
  <name>hive.server2.enable.doAs</name>
  <value>true</value>
</property>

Creating directories on HDFS

The administrator user (e.g., hdfs) should create a directory for user gla.

# execute as the administrator user
$ hdfs dfs -mkdir /user/gla/
$ hdfs dfs -chown gla /user/gla/

The administrator user should also check that /tmp/gla (corresponding to the configuration key hive.exec.scratchdir in hive-site.xml) does not exist on HDFS. If the directory already exists (e.g., when running Hive on MR3 for the second time), make sure that its permission is set to 733. HiveServer2 automatically creates a new directory with permission 733 if it does not exist.

# execute as the administrator user
$ hdfs dfs -ls /tmp/gla
ls: `/tmp/gla': No such file or directory

Then the user gla should create a directory for storing MR3 and Tez jar files.

$ printenv | grep USER
USER=gla
$ hdfs dfs -mkdir -p /user/gla/lib/

Load MR3 jar files.

$ mr3/upload-hdfslib-mr3.sh

Load Tez jar files.

$ tez/upload-hdfslib-tez.sh

Creating temporary directories

Optionally the user can create a new directory specified by hive.server2.logging.operation.log.location before starting HiveServer2.

$ ls -alt /tmp/gla/operation_logs
ls: cannot access /tmp/gla/operation_logs: No such file or directory
$ mkdir -p /tmp/gla/operation_logs

Updating core-site.xml of HDP

Open the Ambari webpage as the administrator user and set two new configuration keys hadoop.proxyuser.gla.groups and hadoop.proxyuser.gla.hosts in core-site.xml. Then restart HDFS, Yarn, and Metastore.

ambari-proxyuser

Without this step, HiveServer2 may fail to start with the following error (which is related to HIVE-19740).

2023-06-18T15:21:00,154  WARN [main] metastore.RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect (1 of 24) after 5s. getCurrentNotificationEventId
org.apache.thrift.TApplicationException: Internal error processing get_current_notificationEventId

Running HiveServer2

Run HiveServer2 using --tpcds and --hivesrc3 options. In order to use LocalProcess mode for MR3 DAGAppMaster, use --amprocess option.

$ hive/hiveserver2-service.sh start --tpcds --hivesrc3

After a while, check if HiveServer2 has successfully started by inspecting its log file.

$ tail -f /data2/gla/mr3-run/hive/hiveserver2-service-result/hive-mr3--2023-06-19-00-28-16-8db63259/hive-logs/hive.log 

...
2023-06-18T15:28:39,389  INFO [main] server.Server: Started @13972ms
2023-06-18T15:28:39,390  INFO [main] server.HiveServer2: Web UI has started on port 10502
2023-06-18T15:28:39,390  INFO [main] server.HiveServer2: HS2 interactive HA not enabled. Starting sessions..
2023-06-18T15:28:39,390  INFO [main] http.HttpServer: Started HttpServer[hiveserver2] on port 10502

The user can find a new Yarn application of type mr3 submitted by the user gla.

$ yarn application -list
...
application_1660836356025_0002	a91f2aae-6efe-43dc-87d2-1e4ac1b50b78	                 mr3	       gla	   default	           RUNNING	         UNDEFINED	             0%	                                N/A

Running Beeline

Run Beeline using --tpcds and `–hivesrc3 options.

$ hive/run-beeline.sh --tpcds --hivesrc3

# Running Beeline using Hive-MR3 (3.1.3) #
...
Connecting to jdbc:hive2://blue0:9832/;;
Connected to: Apache Hive (version 3.1.3)
Driver: Hive JDBC (version 3.1.3)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.3 by Apache Hive
0: jdbc:hive2://blue0:9832/> 

The user can find all databases registered to Metastore of HDP.

0: jdbc:hive2://blue0:9832/> show databases;
...
+----------------------------------+
|          database_name           |
+----------------------------------+
| default                          |
| information_schema               |
| sys                              |
| tpcds_bin_partitioned_orc_10000  |
| tpcds_text_1000                  |
| tpcds_text_10000                 |
+----------------------------------+
6 rows selected (1.652 seconds)

Another user can also run Beeline to connect to the same HiveServer2. In the following example, another ordinary user gitlab-runner runs Beeline (where we assume that the directory hive/run-beeline-result/ is writable to user gitlab-runner).

$ printenv | grep USER
USER=gitlab-runner
$ pwd
/home/gla/mr3-run
$ hive/run-beeline.sh --tpcds --hivesrc3

# Running Beeline using Hive-MR3 (3.1.3) #
...
Connecting to jdbc:hive2://blue0:9832/;;
Connected to: Apache Hive (version 3.1.3)
Driver: Hive JDBC (version 3.1.3)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.3 by Apache Hive
0: jdbc:hive2://blue0:9832/> 

Stopping HiveServer2

Stop HiveServer2 as user gla.

$ hive/hiveserver2-service.sh stop --tpcds --hivesrc3