Running the TPC-DS Benchmark |

The MR3 release includes scripts for helping the user to test Hive on MR3 using the TPC-DS benchmark, which is the de-facto industry standard benchmark for measuring the performance of big data systems such as Hive. It contains a script for generating TPC-DS datasets and another script for running Hive on MR3. The scripts should also be useful for comparing the performance of MR3 against other systems.

Before generating a TPC-DS dataset, check the following:

HADOOP_HOME in env.sh should be set. For local mode, HADOOP_HOME_LOCAL should be set.
The script for generating TPC-DS datasets stores raw data in a directory under /tmp/tpcds-generate/ (by executing a MapReduce job). Either delete /tmp/tpcds-generate/ or make it accessible to the user before running the script.
The script also runs HiveCLI to fill the Hive data warehouse. For local mode, HIVE_CLIENT_HEAPSIZE in env.sh should be set to a sufficiently large value so that HiveCLI can accommodate many ContainerWorkers.
The script runs HiveCLI with MR3 as the execution engine. Hence the user should set the following configuration keys in hive-site.xml appropriately (assuming the all-in-one scheme):

<property>
  <name>hive.mr3.map.task.memory.mb</name>
  <value>4096</value>
</property>

<property>
  <name>hive.mr3.reduce.task.memory.mb</name>
  <value>4096</value>
</property>

<property>
  <name>hive.mr3.all-in-one.containergroup.memory.mb</name>
  <value>4096</value>
</property>

If these keys are set to too small values (e.g., 1024), the script may fail with an OutOfMemoryError exception:

make: *** [store_sales] Error 2
make: *** Waiting for unfinished jobs....

Now we illustrate how to generate a TPC-DS dataset and run the TPC-DS benchmark with an example. We assume that Hive on MR3 runs in a secure cluster with Kerberos and that Metastore uses a MySQL database.

Metastore should already be running before generating TPC-DS datasets.
HiveServer2 should be running before executing Hive queries with Beeline.

Generating a TPC-DS dataset

Before generating a TPC-DS dataset, set the following environment variables in env.sh:

HIVE_DS_FORMAT=orc
HIVE_DS_SCALE_FACTOR=2

HIVE_DS_FORMAT specifies the format of TPC-DS dataset eventually to be stored in the Hive data warehouse. Three options are available: orc for ORC format (default), textfile for text format, and rcfile for RC format.
HIVE_DS_SCALE_FACTOR specifies the scale factor of TPC-DS dataset in GB. For example, a scale factor of 1000 generates a dataset of 1000GB.

In non-local mode, the user should have write permission in the directory for the Hive data warehouse (on HDFS) which is specified by HIVE3_HDFS_WAREHOUSE or HIVE4_HDFS_WAREHOUSE in env.sh:

HIVE3_HDFS_WAREHOUSE=/tmp/hivemr3/warehouse
HIVE4_HDFS_WAREHOUSE=/tmp/hivemr3/warehouse

This is not a problem if the same user generates a TPC-DS dataset after starting Metastore.

The user should also have write permission on the database for Metastore which is specified by HIVE3_DATABASE_NAME or HIVE4_DATABASE_NAME in env.sh:

HIVE3_DATABASE_NAME=hive3mr3
HIVE4_DATABASE_NAME=hive4mr3

This is not a problem if the same user generates a TPC-DS dataset after starting Metastore because the user name and the password are set in the same configuration file hive-site.xml:

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hivemr3</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>password</value>
</property>

hive/gen-tpcds.sh is the script to run for generating TPC-DS datasets:

--local                   # Run jobs with configurations in conf/local/ (default).
--cluster                 # Run jobs with configurations in conf/cluster/.
--tpcds                   # Run jobs with configurations in conf/tpcds/.
--hivesrc3                # Choose hive3-mr3 (based on Hive 3.1.3)
--hivesrc4                # Choose hive4-mr3 (based on Hive 4.0.0, default)

Note that the user should specify which version of Hive to use. Here is an example of the command for generating a TPC-DS dataset:

$ hive/gen-tpcds.sh --tpcds --hivesrc3

The script first executes a MapReduce job to generates raw data in the directory /tmp/tpcds-generate/[HIVE_DS_SCALE_FACTOR] (such as /tmp/tpcds-generate/1000). Then it fills the data warehouse and populates the database by executing Hive queries with MR3 as the execution engine. After the script completes successfully, the directory for the data warehouse contains a new directory such as tpcds_bin_partitioned_orc_10000.db. Then the user may delete the directory for raw data.

Depending on the scale factor, running gen-tpcds.sh may fail, especially when populating the database after generating raw data. In such a case, the user can retry after adjusting several configuration parameters in hive/benchmarks/hive-testbench/settings/load-partitioned.sql as well as hive-site.xml. Here are a few examples:

The user can increase the value for the configuration key hive.exec.reducers.max to create more reducers. It should not, however, exceed 4095 in order to prevent an error java.lang.IllegalArgumentException: bucketId out of range.
The user can increase the value for the configuration key hive.mr3.am.task.max.failed.attempts (e.g., to 5) so that queries can recover from fetch-failures.
If the script fails with too many fetch-failures after generating raw data, the user can reduce network traffic by increasing the value for the configuration key tez.runtime.shuffle.connect.timeout (e.g., to 17500).

For running hive/gen-tpcds.sh with the option --hivesrc4 for Hive 4 on MR3, the user should explicitly set the configuration key hive.metastore.warehouse.external.dir to /tmp/tpcds-generate in hive-site.xml.

Running Hive on MR3 using the TPC-DS benchmark

The user can run hive/run-tpcds.sh to test Hive on MR3 using the TPC-DS benchmark in various scenarios. The script runs either Beeline or HiveCLI to execute Hive queries in the TPC-DS benchmark. Here are a few important options for hive/run-tpcds.sh:

-q <num>                  # Run a single TPC-DS query where num=[1..99].
-n <num>                  # Specify the number of times that the query runs. 
--beeline                 # Execute queries using Beeline.
--session                 # Execute all queries in the same MR3 session.
--amprocess               # Run the MR3 DAGAppMaster in LocalProcess mode.
--hiveconf <key>=<value>  # Add a configuration key/value.

With --beeline, the script starts a new Beeline connection which attempts to connect to HiveServer2 running at the address specified in env.sh. By default, the script runs HiveCLI.

With --session, the script executes a sequence of queries with a single Beeline connection or HiveCLI session. The sequence of queries to be executed is determined by the following suboptions:

--querystart <num>        # Specify the starting TPC-DS query.
--queryend <num>          # Specify the last TPC-DS query.
--queryskip <num>         # Skip a query.
--repeat <num>            # Repeat the same sequence of queries.

With --amprocess, if the script runs HiveCLI, the MR3 DAGAppMaster runs in LocalProcess mode. Hence a new process starts on the same machine where the script is run. This option is ignored if --beeline is used, in which case HiveServer2 starts a DAGAppMaster. (Currently --amprocess cannot be used in a secure cluster with Kerberos; see the documentation on LocalProcess mode.)
The user can append as many instances of --hiveconf as necessary to the command.

Here are a few examples of running the script:

# run query 12 with a HiveCLI session
$ hive/run-tpcds.sh --local --hivesrc3 -q 12

# run query 12 five times with a new HiveCLI session for each run
$ hive/run-tpcds.sh --local --hivesrc3 -q 12 -n 5

# run query 12 five times with a new HiveCLI session for each run 
# where the MR3 DAGAppMaster runs in LocalProcess mode
$ hive/run-tpcds.sh --tpcds --hivesrc3 -q 12 -n 5 --amprocess  

# run queries 12 to 66 with a single HiveCLI session
$ hive/run-tpcds.sh --tpcds --hivesrc3 --session --querystart 12 --queryend 66  

# run queries 12 to 66 with a single HiveCLI session 
# whose MR3 DAGAppMaster runs in LocalProcess mode
$ hive/run-tpcds.sh --tpcds --hivesrc3 --session --querystart 12 --queryend 66 --amprocess  

# run queries 12 to 66, skipping query 64, with a single Beeline connection 
$ hive/run-tpcds.sh --tpcds --hivesrc3 --beeline --session --querystart 12 --queryend 66 --queryskip 64

# repeat five times running queries 12 to 66 with a single Beeline connection
$ hive/run-tpcds.sh --tpcds --hivesrc3 --beeline --session --querystart 12 --queryend 66 --repeat 5

Sample queries in the TPC-DS benchmark

The script executes Hive queries drawn from the directory hive/benchmarks/hive-testbench/sample-queries-tpcds-hive2 which contains the complete set of 99 TPC-DS queries. The user can switch the directory by editing hive/run-tpcds.sh:

    if [[ $HIVE_SRC_TYPE = 3 ]]; then
        TPC_DS_QUERY_DIR=$TPC_DS_DIR/sample-queries-tpcds-hive2
    fi
    if [[ $HIVE_SRC_TYPE = 4 ]]; then
        TPC_DS_QUERY_DIR=$TPC_DS_DIR/sample-queries-tpcds-hive2
    fi

The TPC-DS benchmark in the MR3 release is originally from https://github.com/hortonworks/hive-testbench.