In order to install Hive on MR3 on Hadoop, download a pre-built MR3 release and uncompress it in a directory of your choice (e.g., under the user’s home directory). A pre-built MR3 release contains everything for running Hive on MR3 on Hadoop, including scripts, preset configuration files, and jar files. It suffices to install Hive on MR3 only on the master node where HiveServer2 or HiveCLI is to run, and the user does not have to install it on worker nodes.
$ wget https://github.com/mr3project/mr3-release/releases/download/v1.11/hivemr3-1.11-hive3.1.3-k8s.tar.gz
$ gunzip -c hivemr3-1.11-hive3.1.3-k8s.tar.gz| tar xvf -;
$ mv hivemr3-1.11-hive3.1.3-k8s/ mr3-run
$ cd mr3-run/
$ wget https://github.com/mr3project/mr3-release/releases/download/v1.11/hivemr3-1.11-java17-hive3.1.3-k8s.tar.gz
$ gunzip -c hivemr3-1.11-java17-hive3.1.3-k8s.tar.gz| tar xvf -;
$ mv hivemr3-1.11-java17-hive3.1.3-k8s/ mr3-run
$ cd mr3-run/
$ wget https://github.com/mr3project/mr3-release/releases/download/v1.12/hivemr3-1.12-java17-hive4.0.1-k8s.tar.gz
$ gunzip -c hivemr3-1.12-java17-hive4.0.1-k8s.tar.gz | tar xvf -;
$ mv hivemr3-1.12-java17-hive4.0.1-k8s/ mr3-run
$ cd mr3-run/
Then the user can run Hive on MR3 after a few additional steps. The following structure shows important files and directories in the release:
├── env.sh
├── conf
│ ├── local
│ ├── cluster
│ └── tpcds
├── hadoop
├── hive
│ ├── compile-hive.sh
│ ├── gen-tpcds.sh
│ ├── hiveserver2-service.sh
│ ├── metastore-service.sh
│ ├── run-beeline.sh
│ ├── run-hive-cli.sh
│ ├── run-tpcds.sh
│ ├── benchmarks
│ │ └── hive-testbench
│ └── hivejar
│ └── apache-hive-3.1.3-bin
├── mr3
│ ├── upload-hdfslib-mr3.sh
│ ├── mr3jar
│ ├── mr3lib
│ └── mr3-ui
└── tez
├── compile-tez.sh
├── upload-hdfslib-tez.sh
└── tezjar
└── tez-0.9.1.mr3.1.0
Prerequisites for running Hive on MR3 on Hadoop
In order to run Hive on MR3 on Hadoop, the following requirements should be met.
- Java 8 or Java 17 should be available. For Java 8, We recommend Java update 161 or later which enables the unlimited cryptography policy by default. Java 17 should be installed in the same directory on every node.
- Basic Hadoop commands such as
hadoop
,hdfs
, andyarn
should be available. - The user should have access to his home directory and
/tmp
directory on HDFS.- Ex. A user
foo
should have access to/user/foo
and/tmp
on HDFS. - Hive on MR3 stores MR3 and Tez jar files under
/user/foo/lib
.
- Ex. A user
- If a directory to be specified by
hive.exec.scratchdir
inhive-site.xml
already exists on HDFS, it must have directory permission 733, not 700.- Ex. if
hive.exec.scratchdir
inhive-site.xml
specifies/tmp/hive
, either a directory/tmp/hive
should exist with directory permission 733, or such a directory should not exist. HiveServer2 automatically creates a new directory with permission 733 if it does not exist.
- Ex. if
- MySQL (or any database server supported by Metastore) should be running if the user wants to run Metastore with a MySQL database. The user should also have access to the database with a user name and a password.
mvn
,gcc
, andjavac
should be available in order to generate TPC-DS datasets.- Depending on the size of the cluster, the kernel configuration parameter SOMAXCONN (
net.core.somaxconn
) should be set to a sufficiently large value, e.g., 16384, on every node. - Depending on the size of the cluster,
the user limits
nofile
(open files) andnproc
(max user processes) reported by the commandulimit
should be sufficiently large. The user can change these values by updating the file/etc/security/limits.conf
or an equivalent file.
Then any user (not necessarily an administrator user) can run Hive on MR3.
In a Kerberos-enabled secure cluster
For running Hive on MR3 in a secure cluster with Kerberos, the user should have a principal as well as permission to get Kerberos tickets and create a keytab file. The following commands are commonly used:
kinit <your principal> # for getting a new Kerberos ticket
ktutil # for creating a keytab file
In order to run Metastore and HiveServer2, the user (or the administrator user) should have access to a service keytab file.
Typically the service keytab file is associated with user hive
.
The format of the principal in the service keytab file should be primary/instance@REALM
.
- Ex.
hive/node0@MR3.COM
wherehive
is the primary,node0
is the host where Metastore or HiveServer2 runs, andMR3.COM
is the realm which is usually the domain name of the machine.
In comparison, the format of the principal in an ordinary keytab file is usually primary@REALM
without an instance field.
In order to support impersonation in HiveServer2, Yarn should be configured to allow the user starting Metastore and HiveServer2 to impersonate.
For example, in order to allow user hive
to impersonate,
the administrator user should add two configuration settings to core-site.xml
and restart Yarn:
<property>
<name>hadoop.proxyuser.hive.groups</name>
<value>hive,foo,bar</value>
</property>
<property>
<name>hadoop.proxyuser.hive.hosts</name>
<value>red0</value>
</property>
In this example, hive
in hadoop.proxyuser.hive.groups
and hadoop.proxyuser.hive.hosts
denotes the user starting Metastore and HiveServer2.
Thus hadoop.proxyuser.hive.groups
is the key for specifying the list of groups whose members can be impersonated by user hive
, and
hadoop.proxyuser.hive.hosts
is the key for specifying the list of nodes where user hive
can impersonate.
Setting environment variables for Hive on MR3
The behavior of Hive on MR3 depends on env.sh
and four configuration files (hive-site.xml
, mr3-site.xml
, tez-site.xml
, and mapred-site.xml
).
hive-site.xml
configures Hive,
mr3-site.xml
configures MR3,
and tez-site.xml
configures the Tez runtime.
Hive reads mapred-site.xml
when running Hive with the MapReduce execution engine and when generating TPC-DS data.
env.sh
is a self-descriptive script located in the root directory of the installation.
It contains major environment variables that should be set in every installation environment.
The following environment variables should be set according to the configuration of the installation environment.
$ vi env.sh
export JAVA_HOME=/usr/jdk64/jdk1.8 # Java 8
export PATH=$JAVA_HOME/bin:$PATH
USE_JAVA_17=false
$ vi env.sh
export JAVA_HOME=/usr/jdk64/jdk17 # Java 17
export PATH=$JAVA_HOME/bin:$PATH
USE_JAVA_17=true
$ vi env.sh
export HADOOP_HOME=/usr/lib/hadoop
HDFS_LIB_DIR=/user/$USER/lib
HADOOP_HOME_LOCAL=$HADOOP_HOME
HADOOP_NATIVE_LIB=$HADOOP_HOME/lib/native
SECURE_MODE=false
USER_PRINCIPAL=hive@HADOOP
USER_KEYTAB=/home/hive/hive.keytab
HDFS_LIB_DIR
specifies the directory on HDFS to which MR3 and Tez jar files are uploaded. Hence it is only for non-local mode.HADOOP_HOME_LOCAL
specifies the directory for the Hadoop installation to use in local mode in which everything runs on a single machine and does not require Yarn.SECURE_MODE
specifies whether the cluster is secure with Kerberos or not.USER_PRINCIPAL
andUSER_KEYTAB
specify the principal and keytab file for the user executing HiveCLI and Beeline.
For those who want to rebuild Hive or Tez runtime,
the script also has optional environment variables that specify the directories for Hive and Tez source code
(TEZ_SRC
, HIVE3_SRC
, HIVE4_SRC
).
For Java 8 only
Update the configuration keys mr3.am.launch.cmd-opts
and mr3.container.launch.cmd-opts
in conf/tpcds/mr3/mr3-site.xml
.
- add
-XX:+AggressiveOpts
for performance. - remove
--add-opens java.base/java.net=ALL-UNNAMED --add-opens java.base/java.util=ALL-UNNAMED --add-opens java.base/java.time=ALL-UNNAMED --add-opens java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens java.base/java.io=ALL-UNNAMED ...
(which are Java 17 options).
Update the configuration keys mr3.am.launch.env
and mr3.container.launch.env
in conf/tpcds/mr3/mr3-site.xml
.
- remove
JAVA_HOME=/home/hive/jdk17/
.
For Java 17 only
Update the configuration keys mr3.am.launch.env
and mr3.container.launch.env
in conf/tpcds/mr3/mr3-site.xml
.
- set
JAVA_HOME=/home/hive/jdk17/
to point to the installation directory of Java 17 on every worker node.
In order to execute Metastore and HiveServer2 with Java 17,
JAVA_HOME
in hadoop-env.sh
in the Hadoop configuration directory
should also be set to point to the installation directory of Java 17.
$ vi /etc/hadoop/conf/hadoop-env.sh
JAVA_HOME=/home/hive/jdk17/
Preset configuration files
The MR3 release contains three collections of preset configuration files under directories conf/local
, conf/cluster
, and conf/tpcds
.
These configuration directories are intended for the following scenarios:
conf/local
(default): running Hive on MR3 in local mode (in which everything runs on a single machine) with a Derby database for Metastoreconf/cluster
: running Hive on MR3 in a cluster with a Derby database for Metastoreconf/tpcds
(for production use): running Hive on MR3 in a cluster with a MySQL database for Metastore
Each configuration directory has the following structure:
├── hive3
│ ├── beeline-log4j2.properties
│ ├── hive-log4j2.properties
│ └── hive-site.xml
├── mapreduce
│ └── mapred-site.xml
├── mr3
│ └── mr3-site.xml
└── tez
└── tez-site.xml
For typical use cases on a Hadoop cluster, the user can start with conf/tpcds
and revise configuration files (hive-site.xml
, mr3-site.xml
, tez-site.xml
) for performance tuning.
Every script in the MR3 release accepts one of the following options to choose a corresponding configuration directory:
--local # Use configurations in conf/local/ (default).
--cluster # Use configurations in conf/cluster/.
--tpcds # Use configurations in conf/tpcds/ (for production use).
A script may also accept an additional option to choose corresponding configuration files:
--hivesrc3 # Choose hive3 (based on Hive 3.1.3).
--hivesrc4 # Choose hive4 (based on Hive 4.0.0).
For example, --tpcds --hivesrc3
chooses:
conf/tpcds/hive3/hive-site.xml
conf/tpcds/mr3/mr3-site.xml
conf/tpcds/tez/tez-site.xml
conf/tpcds/mapreduce/mapred-site.xml
In this way, the user can easily try different combinations of Hive and Tez when running Hive on MR3.
Using custom configuration settings
A script in the MR3 release may accept new configuration settings as command-line options according to the following syntax:
--hiveconf <key>=<value> # Add a configuration key/value.
The user can append as many instances of --hiveconf
as necessary to the command.
A configuration value specified with --hiveconf
takes the highest precedence and overrides any existing value in hive-site.xml
, mr3-site.xml
, and tez-site.xml
(not just in hive-site.xml
).
Hence the user can change the behavior of Hive on MR3 without modifying preset configuration files at all.
(Note that the user can use --hiveconf
to configure not only Hive but also MR3 and Tez.)
Alternatively the user can directly modify preset configuration files to make the change permanent.
The user may create hivemetastore-site.xml
and hiveserver2-site.xml
in a configuration directory for Hive (conf/???/hive3
)
as configuration files for Metastore and HiveServer2, respectively.
Hive automatically reads these files when reading hive-site.xml
.
The order of precedence of the configuration files is as follows (lower to higher):
hive-site.xml
→ hivemetastore-site.xml
→ hiveserver2-site.xml
→ --hiveconf
command-line options
Uploading MR3 and Tez jar files
The last step before running Hive on MR3 is to upload MR3 and Tez jar files to HDFS.
In order to run HiveServer2 or HiveCLI,
the user should execute the following commands which copy all the MR3 and Tez jar files (under mr3/mr3jar
and tez/tezjar
) to the directory specified by HDFS_LIB_DIR
in env.sh
:
$ mr3/upload-hdfslib-mr3.sh
$ tez/upload-hdfslib-tez.sh
When running Hive on MR3, these jar files are registered as local resources for Hadoop jobs and automatically distributed to worker nodes (where NodeManagers are running). This step is unnecessary for running Hive on MR3 in local mode, or for running Metastore and Beeline.
To run HiveServer2 with doAs enabled (by setting hive.server2.enable.do
to true in hive-site.xml
),
the user (typically the administrator user) should make the MR3 and Tez jar files readable to all end users after uploading to HDFS.
This is because every job runs under an end user who actually submits it.
If the MR3 and Tez jar files are not readable to the end user, the job immediately fails because no files can be registered as local resources.