Setting up Spark on MR3 |

Installing a pre-built MR3 release

Download an MR3 release for Spark and uncompress it. An MR3 release includes pre-built jar files of Spark-MR3 and MR3. We rename the new directory to mr3-run.

Spark 3.2.2 on MR3 1.5

$ wget https://github.com/mr3project/mr3-release/releases/download/v1.5/spark-mr3-1.5-spark3.2.2.tar.gz
$ gunzip -c spark-mr3-1.5-spark3.2.2.tar.gz | tar xvf -;
$ mv spark-mr3-1.5-spark3.2.2 mr3-run
$ cd mr3-run

The following structure shows important files and directories in the release:

# script for configuring Spark on MR3
|-- env.sh

# pre-compiled MR3
`-- mr3
    `-- mr3lib
        `-- mr3-spark-1.0-assembly.jar

# scripts for populating and cleaning directories for Docker images
|-- build-k8s-spark.sh
|-- clean-k8s-spark.sh

# scripts and resources for running Spark on MR3 on Kubernetes
`-- kubernetes
    |-- build-spark.sh
    |-- config-run.sh
    |-- run-spark-setup.sh
    |-- spark
    |   |-- Dockerfile
    |   |-- env.sh
    |   |-- conf
    |   |   |-- core-site.xml
    |   |   |-- hive-site.xml
    |   |   |-- mr3-site.xml
    |   |   |-- spark-defaults.conf
    |   |   `-- spark-env.sh
    |   `-- spark
    |       |-- master-control.sh
    |       |-- run-spark-shell.sh
    |       |-- run-spark-submit.sh
    |       |-- run-master.sh
    |       `-- run-worker.sh
    `-- spark-yaml
        |-- cluster-role.yaml
        |-- driver-service.yaml
        |-- master-role.yaml
        |-- master-service-account.yaml
        |-- mr3-service.yaml
        |-- spark-role.yaml
        |-- spark-run.yaml
        |-- spark-service-account.yaml
        |-- workdir-pv.yaml
        |-- workdir-pvc.yaml
        |-- worker-role.yaml
        `-- worker-service-account.yaml

# configuration directories for Spark on MR3 on Hadoop
`-- conf
    |-- local
    |-- cluster
    `-- tpcds

# pre-compiled Spark-MR3 and scripts for running Spark on MR3 on Hadoop
`-- spark
    |-- compile-spark.sh
    |-- upload-hdfslib-spark.sh
    |-- run-spark-shell.sh
    |-- run-spark-submit.sh
    `-- sparkjar
        `-- sparkmr3
            `-- spark-mr3-3.2.2-assembly.jar

Setting environment variables

env.sh is a self-descriptive script located in the root directory of the installation. It contains major environment variables that should be set in every installation environment.

Running Spark on MR3 requires a Spark release (on both Kubernetes and Hadoop). The user can download a pre-built release of Spark from the Spark webpage. The following environment variables should be set in env.sh according to the configuration of the installation environment:

$ vi env.sh

SPARK_JARS_DIR=~/spark/jars
export SPARK_HOME=~/spark

SPARK_JARS_DIR specifies the directory for containing Spark jar files in the Spark installation. The jar files in this directory are copied to the Docker image for Spark on MR3.
SPARK_HOME specifies the directory of the Spark installation. Spark on MR3 needs the scripts in the Spark installation (e.g., bin/spark-shell and bin/spark-submit).

Building Spark on MR3 (optional)

To build Spark-MR3 from the source code, the following environment variables should be set in env.sh.

$ vi env.sh

SPARK_MR3_SRC=~/spark-mr3
SPARK_MR3_REV=3.2.2

SPARK_MR3_SRC specifies the directory containing the source code of Spark-MR3. The user can clone the GitHub repository (https://github.com/mr3project/spark-mr3.git) to obtain the source code.
SPARK_MR3_REV specifies the version of Spark-MR3 (e.g., 3.2.2 for running Spark 3.2.2 on MR3).

Then execute spark/compile-spark.sh in the MR3 release.

$ spark/compile-spark.sh