The MR3 release includes scripts for helping the user to test Hive on MR3 using the TPC-DS benchmark, which is the de-facto industry standard benchmark for measuring the performance of big data systems such as Hive. It contains a script for generating TPC-DS datasets and another script for running Hive on MR3. The scripts should also be useful for comparing the performance of MR3 against other systems.
Before generating a TPC-DS dataset, check the following:
HADOOP_HOME
inenv.sh
should be set. For local mode,HADOOP_HOME_LOCAL
should be set.- The script for generating TPC-DS datasets stores raw data in a directory under
/tmp/tpcds-generate/
(by executing a MapReduce job). Either delete/tmp/tpcds-generate/
or make it accessible to the user before running the script. - The script also runs HiveCLI to fill the Hive data warehouse.
For local mode,
HIVE_CLIENT_HEAPSIZE
inenv.sh
should be set to a sufficiently large value so that HiveCLI can accommodate many ContainerWorkers. - The script runs HiveCLI with MR3 as the execution engine. Hence the user should set the following configuration keys in
hive-site.xml
appropriately (assuming the all-in-one scheme):
<property>
<name>hive.mr3.map.task.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>hive.mr3.reduce.task.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>hive.mr3.all-in-one.containergroup.memory.mb</name>
<value>4096</value>
</property>
If these keys are set to too small values (e.g., 1024), the script may fail with an OutOfMemoryError exception:
make: *** [store_sales] Error 2
make: *** Waiting for unfinished jobs....
Now we illustrate how to generate a TPC-DS dataset and run the TPC-DS benchmark with an example. We assume that Hive on MR3 runs in a secure cluster with Kerberos and that Metastore uses a MySQL database.
- Metastore should already be running before generating TPC-DS datasets.
- HiveServer2 should be running before executing Hive queries with Beeline.
Generating a TPC-DS dataset
Before generating a TPC-DS dataset, set the following environment variables in env.sh
:
HIVE_DS_FORMAT=orc
HIVE_DS_SCALE_FACTOR=2
HIVE_DS_FORMAT
specifies the format of TPC-DS dataset eventually to be stored in the Hive data warehouse. Three options are available:orc
for ORC format (default),textfile
for text format, andrcfile
for RC format.HIVE_DS_SCALE_FACTOR
specifies the scale factor of TPC-DS dataset in GB. For example, a scale factor of 1000 generates a dataset of 1000GB.
In non-local mode, the user should have write permission in the directory for the Hive data warehouse (on HDFS)
which is specified by HIVE3_HDFS_WAREHOUSE
or HIVE4_HDFS_WAREHOUSE
in env.sh
:
HIVE3_HDFS_WAREHOUSE=/tmp/hivemr3/warehouse
HIVE4_HDFS_WAREHOUSE=/tmp/hivemr3/warehouse
This is not a problem if the same user generates a TPC-DS dataset after starting Metastore.
The user should also have write permission on the database for Metastore
which is specified by HIVE3_DATABASE_NAME
or HIVE4_DATABASE_NAME
in env.sh
:
HIVE3_DATABASE_NAME=hive3mr3
HIVE4_DATABASE_NAME=hive4mr3
This is not a problem if the same user generates a TPC-DS dataset after starting Metastore because the user name and the password are set in the same configuration file hive-site.xml
:
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hivemr3</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
</property>
hive/gen-tpcds.sh
is the script to run for generating TPC-DS datasets:
--local # Run jobs with configurations in conf/local/ (default).
--cluster # Run jobs with configurations in conf/cluster/.
--tpcds # Run jobs with configurations in conf/tpcds/.
--hivesrc3 # Choose hive3-mr3 (based on Hive 3.1.3)
--hivesrc4 # Choose hive4-mr3 (based on Hive 4.0.0, default)
Note that the user should specify which version of Hive to use. Here is an example of the command for generating a TPC-DS dataset:
$ hive/gen-tpcds.sh --tpcds --hivesrc3
The script first executes a MapReduce job to generates raw data in the directory /tmp/tpcds-generate/[HIVE_DS_SCALE_FACTOR]
(such as /tmp/tpcds-generate/1000
).
Then it fills the data warehouse and populates the database by executing Hive queries with MR3 as the execution engine.
After the script completes successfully, the directory for the data warehouse contains a new directory such as tpcds_bin_partitioned_orc_10000.db
.
Then the user may delete the directory for raw data.
Depending on the scale factor, running gen-tpcds.sh
may fail,
especially when populating the database after generating raw data.
In such a case,
the user can retry after adjusting several configuration parameters in hive/benchmarks/hive-testbench/settings/load-partitioned.sql
as well as hive-site.xml
.
Here are a few examples:
- The user can increase the value for the configuration key
hive.exec.reducers.max
to create more reducers. It should not, however, exceed 4095 in order to prevent an errorjava.lang.IllegalArgumentException: bucketId out of range
. - The user can increase the value for the configuration key
hive.mr3.am.task.max.failed.attempts
(e.g., to 5) so that queries can recover from fetch-failures. - If the script fails with too many fetch-failures after generating raw data,
the user can reduce network traffic
by increasing the value for the configuration key
tez.runtime.shuffle.connect.timeout
(e.g., to 17500).
For running hive/gen-tpcds.sh
with the option --hivesrc4
for Hive 4 on MR3,
the user should explicitly set the configuration key hive.metastore.warehouse.external.dir
to /tmp/tpcds-generate
in hive-site.xml
.
Running Hive on MR3 using the TPC-DS benchmark
The user can run hive/run-tpcds.sh
to test Hive on MR3 using the TPC-DS benchmark in various scenarios.
The script runs either Beeline or HiveCLI to execute Hive queries in the TPC-DS benchmark.
Here are a few important options for hive/run-tpcds.sh
:
-q <num> # Run a single TPC-DS query where num=[1..99].
-n <num> # Specify the number of times that the query runs.
--beeline # Execute queries using Beeline.
--session # Execute all queries in the same MR3 session.
--amprocess # Run the MR3 DAGAppMaster in LocalProcess mode.
--hiveconf <key>=<value> # Add a configuration key/value.
- With
--beeline
, the script starts a new Beeline connection which attempts to connect to HiveServer2 running at the address specified inenv.sh
. By default, the script runs HiveCLI. - With
--session
, the script executes a sequence of queries with a single Beeline connection or HiveCLI session. The sequence of queries to be executed is determined by the following suboptions:--querystart <num> # Specify the starting TPC-DS query. --queryend <num> # Specify the last TPC-DS query. --queryskip <num> # Skip a query. --repeat <num> # Repeat the same sequence of queries.
- With
--amprocess
, if the script runs HiveCLI, the MR3 DAGAppMaster runs in LocalProcess mode. Hence a new process starts on the same machine where the script is run. This option is ignored if--beeline
is used, in which case HiveServer2 starts a DAGAppMaster. (Currently--amprocess
cannot be used in a secure cluster with Kerberos; see the documentation on LocalProcess mode.) - The user can append as many instances of
--hiveconf
as necessary to the command.
Here are a few examples of running the script:
# run query 12 with a HiveCLI session
$ hive/run-tpcds.sh --local --hivesrc3 -q 12
# run query 12 five times with a new HiveCLI session for each run
$ hive/run-tpcds.sh --local --hivesrc3 -q 12 -n 5
# run query 12 five times with a new HiveCLI session for each run
# where the MR3 DAGAppMaster runs in LocalProcess mode
$ hive/run-tpcds.sh --tpcds --hivesrc3 -q 12 -n 5 --amprocess
# run queries 12 to 66 with a single HiveCLI session
$ hive/run-tpcds.sh --tpcds --hivesrc3 --session --querystart 12 --queryend 66
# run queries 12 to 66 with a single HiveCLI session
# whose MR3 DAGAppMaster runs in LocalProcess mode
$ hive/run-tpcds.sh --tpcds --hivesrc3 --session --querystart 12 --queryend 66 --amprocess
# run queries 12 to 66, skipping query 64, with a single Beeline connection
$ hive/run-tpcds.sh --tpcds --hivesrc3 --beeline --session --querystart 12 --queryend 66 --queryskip 64
# repeat five times running queries 12 to 66 with a single Beeline connection
$ hive/run-tpcds.sh --tpcds --hivesrc3 --beeline --session --querystart 12 --queryend 66 --repeat 5
Sample queries in the TPC-DS benchmark
The script executes Hive queries drawn
from the directory hive/benchmarks/hive-testbench/sample-queries-tpcds-hive2
which contains the complete set of 99 TPC-DS queries.
The user can switch the directory by editing hive/run-tpcds.sh
:
if [[ $HIVE_SRC_TYPE = 3 ]]; then
TPC_DS_QUERY_DIR=$TPC_DS_DIR/sample-queries-tpcds-hive2
fi
if [[ $HIVE_SRC_TYPE = 4 ]]; then
TPC_DS_QUERY_DIR=$TPC_DS_DIR/sample-queries-tpcds-hive2
fi
The TPC-DS benchmark in the MR3 release is originally from https://github.com/hortonworks/hive-testbench.