To install Spark on MR3, download an MR3 release for Spark and uncompress it. An MR3 release includes pre-built jar files of Spark-MR3 and MR3. We rename the new directory to mr3-run.

$ wget https://github.com/mr3project/mr3-release/releases/download/v1.3/spark-mr3-1.3-spark3.0.3.tar.gz
$ gunzip -c spark-mr3-1.3-spark3.0.3.tar.gz | tar xvf -;
$ mv spark-mr3-1.3-spark3.0.3 mr3-run
$ cd mr3-run

The following structure shows important files and directories in the release:

# script for configuring Spark on MR3
|-- env.sh

# pre-compiled MR3
`-- mr3
    `-- mr3lib
        `-- mr3-spark-1.0-assembly.jar

# scripts for populating and cleaning directories for Docker images
|-- build-k8s-spark.sh
|-- clean-k8s-spark.sh

# scripts and resources for running Spark on MR3 on Kubernetes
`-- kubernetes
    |-- build-spark.sh
    |-- config-run.sh
    |-- run-spark-setup.sh
    |-- spark
    |   |-- Dockerfile
    |   |-- env.sh
    |   |-- conf
    |   |   |-- mr3-site.xml
    |   |   |-- spark-defaults.conf
    |   |   `-- spark-env.sh
    |   `-- spark
    |       |-- run-spark-shell.sh
    |       |-- run-spark-submit.sh
    |       |-- run-master.sh
    |       `-- run-worker.sh
    `-- spark-yaml
        |-- cluster-role.yaml
        |-- driver-service.yaml
        |-- master-role.yaml
        |-- master-service-account.yaml
        |-- mr3-service.yaml
        |-- prometheus-service.yaml
        |-- spark-role.yaml
        |-- spark-service-account.yaml
        |-- spark-submit.yaml
        |-- workdir-pv.yaml
        |-- workdir-pvc.yaml
        |-- worker-role.yaml
        `-- worker-service-account.yaml

# configuration directories for Spark on MR3 on Hadoop
`-- conf
    |-- local
    |-- cluster
    `-- tpcds

# pre-compiled Spark-MR3 and scripts for running Spark on MR3 on Hadoop
`-- spark
    |-- compile-spark.sh
    |-- upload-hdfslib-spark.sh
    |-- run-spark-shell.sh
    |-- run-spark-submit.sh
    `-- sparkjar
        `-- sparkmr3
            `-- spark-mr3-3.0.3-assembly.jar

Setting environment variables for Spark on MR3

env.sh is a self-descriptive script located in the root directory of the installation. It contains major environment variables that should be set in every installation environment.

Running Spark on MR3 requires a Spark release (on both Kubernetes and Hadoop). The user can download a pre-built release of Spark from the Spark webpage. The following environment variables should be set in env.sh according to the configuration of the installation environment:

$ vi env.sh

SPARK_JARS_DIR=~/spark/assembly/target/scala-2.12/jars
export SPARK_HOME=~/spark
  • SPARK_JARS_DIR specifies the directory for containing Spark jar files in the Spark installation. The jar files in this directory are copied to the Docker image for Spark on MR3.
  • SPARK_HOME specifies the directory of the Spark installation. Spark on MR3 needs the scripts in the Spark installation (e.g., bin/spark-shell and bin/spark-submit).

Setting up for Spark on MR3 on Hadoop (for Hadoop only)

To run Spark on MR3 on Hadoop, the following environment variables should be set in env.sh.

$ vi env.sh

export HADOOP_HOME=${HADOOP_HOME:-/usr/lib/hadoop}
HDFS_LIB_DIR=/user/$USER/lib
HADOOP_HOME_LOCAL=$HADOOP_HOME
HADOOP_NATIVE_LIB=$HADOOP_HOME_LOCAL/lib/native

SECURE_MODE=false

USER_PRINCIPAL=spark@HADOOP
USER_KEYTAB=/home/spark/spark.keytab

MR3_TEZ_ENABLED=false
MR3_SPARK_ENABLED=true
  • HDFS_LIB_DIR specifies the directory on HDFS to which MR3 jar files are uploaded. Hence it is only for non-local mode.
  • HADOOP_HOME_LOCAL specifies the directory for the Hadoop installation to use in local mode in which everything runs on a single machine and does not require Yarn.
  • SECURE_MODE specifies whether the cluster is secure with Kerberos or not.
  • USER_PRINCIPAL and USER_KEYTAB specify the principal and keytab file for the user executing Spark.
  • MR3_TEZ_ENABLED and MR3_SPARK_ENABLED specify which internal runtime (Tez or Spark) to use in MR3.

Then the user should copy all the jar files (or MR3, Spark-MR3, and Spark) to HDFS.

$ mr3/upload-hdfslib-mr3.sh
$ spark/upload-hdfslib-spark.sh

Building Spark-MR3 (Optional)

To build Spark-MR3 from the source code, the following environment variables should be set in env.sh.

$ vi env.sh

SPARK_MR3_SRC=~/spark-mr3
SPARK_MR3_REV=3.0.3
  • SPARK_MR3_SRC specifies the directory containing the source code of Spark-MR3. The user can clone the GitHub repository (https://github.com/mr3project/tez-mr3.git) to obtain the source code.
  • SPARK_MR3_REV specifies the version of Spark-MR3 (e.g., 3.0.3 for running Spark 3.0.3 on MR3).

Then execute spark/compile-spark.sh in the MR3 release.

$ spark/compile-spark.sh

Building a Docker image (for Kubernetes only)

The user can build a Docker image for running Spark on MR3 on Kubernetes. (We assume that the user can execute the command docker so as to build a Docker image.) The first step is to collect all necessary files in the directory kubernetes/spark by executing build-k8s-spark.sh which copies the script and jar files from the Spark installation (specified by SPARK_HOME in env.sh).

$ clean-k8s-spark.sh
$ build-k8s-spark.sh
$ ls kubernetes/spark/mr3/mr3lib/       # MR3 jar file
mr3-spark-assembly.jar
$ ls kubernetes/spark/spark/sparkmr3/   # Spark-MR3 jar file
spark-mr3-assembly.jar
$ ls kubernetes/spark/spark/bin/        # Spark scripts
...
$ ls kubernetes/spark/spark/jars/       # Spark jar files
...

Next the user should set two environment variables in kubernetes/spark/env.sh (not env.sh in the installation directory):

$ vi kubernetes/spark/env.sh
DOCKER_SPARK_IMG=10.1.90.9:5000/spark3:latest
SPARK_DOCKER_USER=root
  • DOCKER_HIVE_IMG is the full name of the Docker image including a tag. It specifies the name of the Docker image for running Spark on MR3 which may include the address of a running Docker server.
  • DOCKER_USER should match the user specified in kubernetes/hive/Dockerfile (which is root by default).

The last step is to build a Docker image from Dockerfile in the directory kubernetes/spark/ by executing kubernetes/build-spark.sh.

$ kubernetes/build-spark.sh