In order to install Hive on MR3 on Hadoop, download a pre-built MR3 release and uncompress it in a directory of your choice (e.g., under the user’s home directory). A pre-built MR3 release contains everything for running Hive on MR3 on Hadoop, including scripts, preset configuration files, and jar files. It suffices to install Hive on MR3 only on the master node where HiveServer2 or HiveCLI is to run, and the user does not have to install it on worker nodes. (Hive 3 is built with access to Amazon S3.)

$ wget https://github.com/mr3project/mr3-release/releases/download/v1.10/hivemr3-1.10-hive3.1.3-k8s.tar.gz
$ gunzip -c hivemr3-1.10-hive3.1.3-k8s.tar.gz| tar xvf -;
$ mv hivemr3-1.10-hive3.1.3-k8s/ mr3-run
$ cd mr3-run/
$ wget https://github.com/mr3project/mr3-release/releases/download/v1.10/hivemr3-1.10-java17-hive3.1.3-k8s.tar.gz
$ gunzip -c hivemr3-1.10-java17-hive3.1.3-k8s.tar.gz| tar xvf -;
$ mv hivemr3-1.10-java17-hive3.1.3-k8s/ mr3-run
$ cd mr3-run/

Then the user can run Hive on MR3 after a few additional steps. The following structure shows important files and directories in the release:

├── env.sh
├── conf
   ├── local
   ├── cluster
   └── tpcds
├── hadoop
├── hive
   ├── compile-hive.sh
   ├── gen-tpcds.sh
   ├── hiveserver2-service.sh
   ├── metastore-service.sh
   ├── run-beeline.sh
   ├── run-hive-cli.sh
   ├── run-tpcds.sh
   ├── benchmarks
      └── hive-testbench
   └── hivejar
       └── apache-hive-3.1.3-bin
├── mr3
   ├── upload-hdfslib-mr3.sh
   ├── mr3jar
   ├── mr3lib
   └── mr3-ui
└── tez
    ├── compile-tez.sh
    ├── upload-hdfslib-tez.sh
    └── tezjar
        └── tez-0.9.1.mr3.0.1

Prerequisites for running Hive on MR3 on Hadoop

In order to run Hive on MR3 on Hadoop, the following requirements should be met.

  • Java 8 or Java 17 should be available. For Java 8, We recommend Java update 161 or later which enables the unlimited cryptography policy by default. Java 17 should be installed in the same directory on every node.
  • Basic Hadoop commands such as hadoop, hdfs, and yarn should be available.
  • The user should have access to his home directory and /tmp directory on HDFS.
    • Ex. A user foo should have access to /user/foo and /tmp on HDFS.
    • Hive on MR3 stores MR3 and Tez jar files under /user/foo/lib.
  • If a directory to be specified by hive.exec.scratchdir in hive-site.xml already exists on HDFS, it must have directory permission 733, not 700.
    • Ex. if hive.exec.scratchdir in hive-site.xml specifies /tmp/hive, either a directory /tmp/hive should exist with directory permission 733, or such a directory should not exist. HiveServer2 automatically creates a new directory with permission 733 if it does not exist.
  • MySQL (or any database server supported by Metastore) should be running if the user wants to run Metastore with a MySQL database. The user should also have access to the database with a user name and a password.
  • mvn, gcc, and javac should be available in order to generate TPC-DS datasets.
  • Depending on the size of the cluster, the kernel configuration parameter SOMAXCONN (net.core.somaxconn) should be set to a sufficiently large value, e.g., 16384, on every node.
  • Depending on the size of the cluster, the user limits nofile (open files) and nproc (max user processes) reported by the command ulimit should be sufficiently large. The user can change these values by updating the file /etc/security/limits.conf or an equivalent file.

Then any user (not necessarily an administrator user) can run Hive on MR3.

In a Kerberos-enabled secure cluster

For running Hive on MR3 in a secure cluster with Kerberos, the user should have a principal as well as permission to get Kerberos tickets and create a keytab file. The following commands are commonly used:

kinit <your principal>      # for getting a new Kerberos ticket
ktutil                      # for creating a keytab file

In order to run Metastore and HiveServer2, the user (or the administrator user) should have access to a service keytab file. Typically the service keytab file is associated with user hive. The format of the principal in the service keytab file should be primary/instance@REALM.

  • Ex. hive/node0@MR3.COM where hive is the primary, node0 is the host where Metastore or HiveServer2 runs, and MR3.COM is the realm which is usually the domain name of the machine.

In comparison, the format of the principal in an ordinary keytab file is usually primary@REALM without an instance field.

In order to support impersonation in HiveServer2, Yarn should be configured to allow the user starting Metastore and HiveServer2 to impersonate. For example, in order to allow user hive to impersonate, the administrator user should add two configuration settings to core-site.xml and restart Yarn:

<property>
  <name>hadoop.proxyuser.hive.groups</name>
  <value>hive,foo,bar</value>
</property>

<property>
  <name>hadoop.proxyuser.hive.hosts</name>
  <value>red0</value>
</property> 

In this example, hive in hadoop.proxyuser.hive.groups and hadoop.proxyuser.hive.hosts denotes the user starting Metastore and HiveServer2. Thus hadoop.proxyuser.hive.groups is the key for specifying the list of groups whose members can be impersonated by user hive, and hadoop.proxyuser.hive.hosts is the key for specifying the list of nodes where user hive can impersonate.

Setting environment variables for Hive on MR3

The behavior of Hive on MR3 depends on env.sh and four configuration files (hive-site.xml, mr3-site.xml, tez-site.xml, and mapred-site.xml). hive-site.xml configures Hive, mr3-site.xml configures MR3, and tez-site.xml configures the Tez runtime. Hive reads mapred-site.xml when running Hive with the MapReduce execution engine and when generating TPC-DS data.

env.sh is a self-descriptive script located in the root directory of the installation. It contains major environment variables that should be set in every installation environment. The following environment variables should be set according to the configuration of the installation environment.

$ vi env.sh

export JAVA_HOME=/usr/jdk64/jdk1.8  # Java 8
export PATH=$JAVA_HOME/bin:$PATH

USE_JAVA_17=false
$ vi env.sh

export JAVA_HOME=/usr/jdk64/jdk17   # Java 17
export PATH=$JAVA_HOME/bin:$PATH

USE_JAVA_17=true
$ vi env.sh

export HADOOP_HOME=/usr/lib/hadoop
HDFS_LIB_DIR=/user/$USER/lib
HADOOP_HOME_LOCAL=$HADOOP_HOME
HADOOP_NATIVE_LIB=$HADOOP_HOME/lib/native

SECURE_MODE=false

USER_PRINCIPAL=hive@HADOOP
USER_KEYTAB=/home/hive/hive.keytab
  • HDFS_LIB_DIR specifies the directory on HDFS to which MR3 and Tez jar files are uploaded. Hence it is only for non-local mode.
  • HADOOP_HOME_LOCAL specifies the directory for the Hadoop installation to use in local mode in which everything runs on a single machine and does not require Yarn.
  • SECURE_MODE specifies whether the cluster is secure with Kerberos or not.
  • USER_PRINCIPAL and USER_KEYTAB specify the principal and keytab file for the user executing HiveCLI and Beeline.

For those who want to rebuild Hive or Tez runtime, the script also has optional environment variables that specify the directories for Hive and Tez source code (TEZ_SRC, HIVE3_SRC, HIVE4_SRC).

For Java 8 only

Update the configuration keys mr3.am.launch.cmd-opts and mr3.container.launch.cmd-opts in conf/tpcds/mr3/mr3-site.xml.

  • add -XX:+AggressiveOpts for performance.
  • remove --add-opens java.base/java.net=ALL-UNNAMED --add-opens java.base/java.util=ALL-UNNAMED --add-opens java.base/java.time=ALL-UNNAMED --add-opens java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens java.base/java.io=ALL-UNNAMED ... (which are Java 17 options).

Update the configuration keys mr3.am.launch.env and mr3.container.launch.env in conf/tpcds/mr3/mr3-site.xml.

  • remove JAVA_HOME=/home/hive/jdk17/.

Preset configuration files

The MR3 release contains three collections of preset configuration files under directories conf/local, conf/cluster, and conf/tpcds. These configuration directories are intended for the following scenarios:

  • conf/local (default): running Hive on MR3 in local mode (in which everything runs on a single machine) with a Derby database for Metastore
  • conf/cluster: running Hive on MR3 in a cluster with a Derby database for Metastore
  • conf/tpcds: running Hive on MR3 in a cluster with a MySQL database for Metastore

Each configuration directory has the following structure:

├── hive3
   ├── beeline-log4j2.properties
   ├── hive-log4j2.properties
   └── hive-site.xml
├── mapreduce
   └── mapred-site.xml
├── mr3
   └── mr3-site.xml
└── tez
    └── tez-site.xml

For typical use cases on a Hadoop cluster, the user can start with conf/tpcds and revise configuration files (hive-site.xml, mr3-site.xml, tez-site.xml) for performance tuning.

Every script in the MR3 release accepts one of the following options to choose a corresponding configuration directory:

--local             # Run jobs with configurations in conf/local/ (default).
--cluster           # Run jobs with configurations in conf/cluster/.
--tpcds             # Run jobs with configurations in conf/tpcds/.

A script may also accept an additional option to choose corresponding configuration files:

--hivesrc3          # Choose hive3-mr3 (based on Hive 3.1.3) (default).

For example, --tpcds --hivesrc3 chooses:

  • conf/tpcds/hive3/hive-site.xml
  • conf/tpcds/mr3/mr3-site.xml
  • conf/tpcds/tez/tez-site.xml
  • conf/tpcds/mapreduce/mapred-site.xml

In this way, the user can easily try different combinations of Hive and Tez when running Hive on MR3.

Using custom configuration settings

A script in the MR3 release may accept new configuration settings as command-line options according to the following syntax:

--hiveconf <key>=<value>  # Add a configuration key/value.

The user can append as many instances of --hiveconf as necessary to the command. A configuration value specified with --hiveconf takes the highest precedence and overrides any existing value in hive-site.xml, mr3-site.xml, and tez-site.xml (not just in hive-site.xml). Hence the user can change the behavior of Hive on MR3 without modifying preset configuration files at all. (Note that the user can use --hiveconf to configure not only Hive but also MR3 and Tez.) Alternatively the user can directly modify preset configuration files to make the change permanent.

The user may create hivemetastore-site.xml and hiveserver2-site.xml in a configuration directory for Hive (conf/???/hive3) as configuration files for Metastore and HiveServer2, respectively. Hive automatically reads these files when reading hive-site.xml. The order of precedence of the configuration files is as follows (lower to higher):

hive-site.xmlhivemetastore-site.xmlhiveserver2-site.xml--hiveconf command-line options

Uploading MR3 and Tez jar files

The last step before running Hive on MR3 is to upload MR3 and Tez jar files to HDFS. In order to run HiveServer2 or HiveCLI, the user should execute the following commands which copy all the MR3 and Tez jar files (under mr3/mr3jar and tez/tezjar) to the directory specified by HDFS_LIB_DIR in env.sh:

$ mr3/upload-hdfslib-mr3.sh
$ tez/upload-hdfslib-tez.sh

When running Hive on MR3, these jar files are registered as local resources for Hadoop jobs and automatically distributed to worker nodes (where NodeManagers are running). This step is unnecessary for running Hive on MR3 in local mode, or for running Metastore and Beeline.

To run HiveServer2 with doAs enabled (by setting hive.server2.enable.do to true in hive-site.xml), the user (typically the administrator user) should make the MR3 and Tez jar files readable to all end users after uploading to HDFS. This is because every job runs under an end user who actually submits it. If the MR3 and Tez jar files are not readable to the end user, the job immediately fails because no files can be registered as local resources.