Installing a pre-built MR3 release
Download an MR3 release for Spark and uncompress it.
An MR3 release includes pre-built jar files of Spark-MR3 and MR3.
We rename the new directory to mr3-run
.
$ wget https://github.com/mr3project/mr3-release/releases/download/v1.5/spark-mr3-1.5-spark3.2.2.tar.gz
$ gunzip -c spark-mr3-1.5-spark3.2.2.tar.gz | tar xvf -;
$ mv spark-mr3-1.5-spark3.2.2 mr3-run
$ cd mr3-run
The following structure shows important files and directories in the release:
# script for configuring Spark on MR3
|-- env.sh
# pre-compiled MR3
`-- mr3
`-- mr3lib
`-- mr3-spark-1.0-assembly.jar
# scripts for populating and cleaning directories for Docker images
|-- build-k8s-spark.sh
|-- clean-k8s-spark.sh
# scripts and resources for running Spark on MR3 on Kubernetes
`-- kubernetes
|-- build-spark.sh
|-- config-run.sh
|-- run-spark-setup.sh
|-- spark
| |-- Dockerfile
| |-- env.sh
| |-- conf
| | |-- core-site.xml
| | |-- hive-site.xml
| | |-- mr3-site.xml
| | |-- spark-defaults.conf
| | `-- spark-env.sh
| `-- spark
| |-- master-control.sh
| |-- run-spark-shell.sh
| |-- run-spark-submit.sh
| |-- run-master.sh
| `-- run-worker.sh
`-- spark-yaml
|-- cluster-role.yaml
|-- driver-service.yaml
|-- master-role.yaml
|-- master-service-account.yaml
|-- mr3-service.yaml
|-- spark-role.yaml
|-- spark-run.yaml
|-- spark-service-account.yaml
|-- workdir-pv.yaml
|-- workdir-pvc.yaml
|-- worker-role.yaml
`-- worker-service-account.yaml
# configuration directories for Spark on MR3 on Hadoop
`-- conf
|-- local
|-- cluster
`-- tpcds
# pre-compiled Spark-MR3 and scripts for running Spark on MR3 on Hadoop
`-- spark
|-- compile-spark.sh
|-- upload-hdfslib-spark.sh
|-- run-spark-shell.sh
|-- run-spark-submit.sh
`-- sparkjar
`-- sparkmr3
`-- spark-mr3-3.2.2-assembly.jar
Setting environment variables
env.sh
is a self-descriptive script located in the root directory of the installation.
It contains major environment variables that should be set in every installation environment.
Running Spark on MR3 requires a Spark release (on both Kubernetes and Hadoop).
The user can download a pre-built release of Spark from the Spark webpage.
The following environment variables should be set in env.sh
according to the configuration of the installation environment:
$ vi env.sh
SPARK_JARS_DIR=~/spark/jars
export SPARK_HOME=~/spark
SPARK_JARS_DIR
specifies the directory for containing Spark jar files in the Spark installation. The jar files in this directory are copied to the Docker image for Spark on MR3.SPARK_HOME
specifies the directory of the Spark installation. Spark on MR3 needs the scripts in the Spark installation (e.g.,bin/spark-shell
andbin/spark-submit
).
Building Spark on MR3 (optional)
To build Spark-MR3 from the source code, the following environment variables should be set in env.sh
.
$ vi env.sh
SPARK_MR3_SRC=~/spark-mr3
SPARK_MR3_REV=3.2.2
SPARK_MR3_SRC
specifies the directory containing the source code of Spark-MR3. The user can clone the GitHub repository (https://github.com/mr3project/spark-mr3.git) to obtain the source code.SPARK_MR3_REV
specifies the version of Spark-MR3 (e.g., 3.2.2 for running Spark 3.2.2 on MR3).
Then execute spark/compile-spark.sh
in the MR3 release.
$ spark/compile-spark.sh