To install Spark on MR3, download an MR3 release for Spark and uncompress it.
An MR3 release includes pre-built jar files of Spark-MR3 and MR3.
We rename the new directory to mr3-run
.
$ wget https://github.com/mr3project/mr3-release/releases/download/v1.3/spark-mr3-1.3-spark3.0.3.tar.gz
$ gunzip -c spark-mr3-1.3-spark3.0.3.tar.gz | tar xvf -;
$ mv spark-mr3-1.3-spark3.0.3 mr3-run
$ cd mr3-run
The following structure shows important files and directories in the release:
# script for configuring Spark on MR3
|-- env.sh
# pre-compiled MR3
`-- mr3
`-- mr3lib
`-- mr3-spark-1.0-assembly.jar
# scripts for populating and cleaning directories for Docker images
|-- build-k8s-spark.sh
|-- clean-k8s-spark.sh
# scripts and resources for running Spark on MR3 on Kubernetes
`-- kubernetes
|-- build-spark.sh
|-- config-run.sh
|-- run-spark-setup.sh
|-- spark
| |-- Dockerfile
| |-- env.sh
| |-- conf
| | |-- mr3-site.xml
| | |-- spark-defaults.conf
| | `-- spark-env.sh
| `-- spark
| |-- run-spark-shell.sh
| |-- run-spark-submit.sh
| |-- run-master.sh
| `-- run-worker.sh
`-- spark-yaml
|-- cluster-role.yaml
|-- driver-service.yaml
|-- master-role.yaml
|-- master-service-account.yaml
|-- mr3-service.yaml
|-- prometheus-service.yaml
|-- spark-role.yaml
|-- spark-service-account.yaml
|-- spark-submit.yaml
|-- workdir-pv.yaml
|-- workdir-pvc.yaml
|-- worker-role.yaml
`-- worker-service-account.yaml
# configuration directories for Spark on MR3 on Hadoop
`-- conf
|-- local
|-- cluster
`-- tpcds
# pre-compiled Spark-MR3 and scripts for running Spark on MR3 on Hadoop
`-- spark
|-- compile-spark.sh
|-- upload-hdfslib-spark.sh
|-- run-spark-shell.sh
|-- run-spark-submit.sh
`-- sparkjar
`-- sparkmr3
`-- spark-mr3-3.0.3-assembly.jar
Setting environment variables for Spark on MR3
env.sh
is a self-descriptive script located in the root directory of the installation.
It contains major environment variables that should be set in every installation environment.
Running Spark on MR3 requires a Spark release (on both Kubernetes and Hadoop).
The user can download a pre-built release of Spark from the Spark webpage.
The following environment variables should be set in env.sh
according to the configuration of the installation environment:
$ vi env.sh
SPARK_JARS_DIR=~/spark/assembly/target/scala-2.12/jars
export SPARK_HOME=~/spark
SPARK_JARS_DIR
specifies the directory for containing Spark jar files in the Spark installation. The jar files in this directory are copied to the Docker image for Spark on MR3.SPARK_HOME
specifies the directory of the Spark installation. Spark on MR3 needs the scripts in the Spark installation (e.g.,bin/spark-shell
andbin/spark-submit
).
Setting up for Spark on MR3 on Hadoop (for Hadoop only)
To run Spark on MR3 on Hadoop, the following environment variables should be set in env.sh
.
$ vi env.sh
export HADOOP_HOME=${HADOOP_HOME:-/usr/lib/hadoop}
HDFS_LIB_DIR=/user/$USER/lib
HADOOP_HOME_LOCAL=$HADOOP_HOME
HADOOP_NATIVE_LIB=$HADOOP_HOME_LOCAL/lib/native
SECURE_MODE=false
USER_PRINCIPAL=spark@HADOOP
USER_KEYTAB=/home/spark/spark.keytab
MR3_TEZ_ENABLED=false
MR3_SPARK_ENABLED=true
HDFS_LIB_DIR
specifies the directory on HDFS to which MR3 jar files are uploaded. Hence it is only for non-local mode.HADOOP_HOME_LOCAL
specifies the directory for the Hadoop installation to use in local mode in which everything runs on a single machine and does not require Yarn.SECURE_MODE
specifies whether the cluster is secure with Kerberos or not.USER_PRINCIPAL
andUSER_KEYTAB
specify the principal and keytab file for the user executing Spark.MR3_TEZ_ENABLED
andMR3_SPARK_ENABLED
specify which internal runtime (Tez or Spark) to use in MR3.
Then the user should copy all the jar files (or MR3, Spark-MR3, and Spark) to HDFS.
$ mr3/upload-hdfslib-mr3.sh
$ spark/upload-hdfslib-spark.sh
Building Spark-MR3 (Optional)
To build Spark-MR3 from the source code, the following environment variables should be set in env.sh
.
$ vi env.sh
SPARK_MR3_SRC=~/spark-mr3
SPARK_MR3_REV=3.0.3
SPARK_MR3_SRC
specifies the directory containing the source code of Spark-MR3. The user can clone the GitHub repository (https://github.com/mr3project/tez-mr3.git) to obtain the source code.SPARK_MR3_REV
specifies the version of Spark-MR3 (e.g., 3.0.3 for running Spark 3.0.3 on MR3).
Then execute spark/compile-spark.sh
in the MR3 release.
$ spark/compile-spark.sh
Building a Docker image (for Kubernetes only)
The user can build a Docker image for running Spark on MR3 on Kubernetes.
(We assume that the user can execute the command docker
so as to build a Docker image.)
The first step is to collect all necessary files in the directory kubernetes/spark
by executing build-k8s-spark.sh
which copies the script and jar files from the Spark installation (specified by SPARK_HOME
in env.sh
).
$ clean-k8s-spark.sh
$ build-k8s-spark.sh
$ ls kubernetes/spark/mr3/mr3lib/ # MR3 jar file
mr3-spark-assembly.jar
$ ls kubernetes/spark/spark/sparkmr3/ # Spark-MR3 jar file
spark-mr3-assembly.jar
$ ls kubernetes/spark/spark/bin/ # Spark scripts
...
$ ls kubernetes/spark/spark/jars/ # Spark jar files
...
Next the user should set two environment variables in kubernetes/spark/env.sh
(not env.sh
in the installation directory):
$ vi kubernetes/spark/env.sh
DOCKER_SPARK_IMG=10.1.90.9:5000/spark3:latest
SPARK_DOCKER_USER=root
DOCKER_HIVE_IMG
is the full name of the Docker image including a tag. It specifies the name of the Docker image for running Spark on MR3 which may include the address of a running Docker server.DOCKER_USER
should match the user specified inkubernetes/hive/Dockerfile
(which isroot
by default).
The last step is to build a Docker image from Dockerfile
in the directory kubernetes/spark/
by executing kubernetes/build-spark.sh
.
$ kubernetes/build-spark.sh