In order to run Spark on MR3 with Kerberos, the user should have a principal name and a keytab file containing valid Kerberos credentials. Since token renewal is managed by the Spark driver, the keytab file is not distributed to DAGAppMaster and ContainerWorker Pods. Hence Spark on MR3 is simpler to run with Kerberos than Hive on MR3.

Spark on MR3 on Kubernetes

In our example, we make the following assumptions:

  • The principal name is spark@RED.
  • The keytab file is spark.keytab.
  • The KDC server runs on a host red0 with IP address 10.1.91.4.
  • Spark on MR3 accesses an HDFS file system hdfs://red0:8020.

Before running Spark on MR3 on Kubernetes with Kerberos, copy the keytab file spark.keytab to the directory kubernetes/spark/key and the Kerberos configuration file krb5.conf to the directory kubernetes/spark/conf.

$ ls kubernetes/spark/key
spark.keytab

$ ls kubernetes/spark/conf/krb5.conf
kubernetes/spark/conf/krb5.conf

Then update the file kubernetes/spark/conf/core-site.xml.

$ vi kubernetes/spark/conf/core-site.xml

<property>
  <name>fs.defaultFS</name>
  <value>file:///</value>
</property>

<property>
  <name>hadoop.security.authentication</name>
  <value>kerberos</value>
</property>
  • fs.defaultFS should be set file:///, not an HDFS address like hdfs://red0:8020. This is because from the viewpoint of Spark on MR3 running in a Kubernetes cluster, the default file system is the local file system.
  • hadoop.security.authentication should be set to kerberos to enable Kerberos for authentication.

Optionally update the file kubernetes/spark/conf/spark-defaults.conf.

$ vi kubernetes/spark/conf/spark-defaults.conf

spark.kerberos.access.hadoopFileSystems=hdfs://red0:8020
  • spark.kerberos.access.hadoopFileSystems lists HDFS file systems that Spark on MR3 accesses.

The remaining configuration files can be updated as explained below. After updating all the configuration files, see Running Spark on MR3 on Kubernetes.

Option 1. Running the Spark driver inside Kubernetes

kubernetes/spark/env.sh

$ vi kubernetes/spark/env.sh 

CREATE_KEYTAB_SECRET=true

SPARK_KERBEROS_KEYTAB=$KEYTAB_MOUNT_DIR/spark.keytab
SPARK_KERBEROS_PRINCIPAL=spark@RED
SPARK_KERBEROS_USER=spark
  • CREATE_KEYTAB_SECRET should be set to true so that a Secret is created from the files in the directory kubernetes/spark/key.
  • SPARK_KERBEROS_KEYTAB specifies the path to the keytab file inside the Spark driver Pod.
  • SPARK_KERBEROS_PRINCIPAL specifies the principal name.
  • SPARK_KERBEROS_USER specifies the user name derived from the principal name.

kubernetes/spark/conf/mr3-site.xml

$ vi kubernetes/spark/conf/mr3-site.xml

<property>
  <name>mr3.k8s.host.aliases</name>
  <value>red0=10.1.91.4</value>
</property>
  • mr3.k8s.host.aliases should include a mapping for the KDC server (as well as the HDFS NameNode).

kubernetes/spark-yaml/spark-run.yaml

$ vi kubernetes/spark-yaml/spark-run.yaml

spec:
  hostAliases:
  - ip: "10.1.91.4"
    hostnames:
    - "red0"
  • The spec/hostAliases field should include aliases for hosts for the HDFS NameNode and the KDC server.

Option 2. Running the Spark driver outside Kubernetes

kubernetes/spark/env.sh

$ vi kubernetes/spark/env.sh 

CREATE_KEYTAB_SECRET=false

SPARK_KERBEROS_KEYTAB=/home/spark/mr3-run/kubernetes/spark/key/spark.keytab
SPARK_KERBEROS_PRINCIPAL=spark@RED
SPARK_KERBEROS_USER=spark
  • CREATE_KEYTAB_SECRET can be set to false because the keytab file is not used by DAGAppMaster and ContainerWorker Pods.
  • SPARK_KERBEROS_KEYTAB specifies the path to the keytab file on the node where the Spark driver runs. This is because the Spark driver directly reads the keytab file on the local file system.
  • The other environment variables are set in the same way as in Option 1.

kubernetes/spark/conf/mr3-site.xml

$ vi kubernetes/spark/conf/mr3-site.xml

<property>
  <name>mr3.k8s.host.aliases</name>
  <value>red0=10.1.91.4,gold0=10.1.90.9</value>
</property>
  • mr3.k8s.host.aliases should include a mapping for the KDC server (as well as the HDFS NameNode). In the example shown above, gold0 is the node where the Spark driver is to run.

Spark on MR3 on Hadoop

In order to run Spark on MR3 on Hadoop with Kerberos, the user should update spark-defaults.conf (such as conf/cluster/spark/spark-defaults.conf).

$ vi conf/cluster/spark/spark-defaults.conf

spark.kerberos.principal=spark@RED
spark.kerberos.keytab=/home/spark/spark.keytab
spark.hadoop.dfs.namenode.kerberos.principal=nn/red0@RED
  • spark.kerberos.principal specifies the principal name.
  • spark.kerberos.keytab specifies the path to the keytab file.
  • spark.hadoop.dfs.namenode.kerberos.principal specifies the principal for the HDFS NameNode.

Note that the environment variables in env.sh (such as SECURE_MODE, USER_PRINCIPAL, USER_KEYTAB) are irrelevant because token renewal is managed by the Spark driver. Hence even TOKEN_RENEWAL_HDFS_ENABLED can be set to false in env.sh.