In order to run Spark on MR3 with Kerberos, the user should have a principal name and a keytab file containing valid Kerberos credentials. Since token renewal is managed by the Spark driver, the keytab file is not distributed to DAGAppMaster and ContainerWorker Pods. Hence Spark on MR3 is simpler to run with Kerberos than Hive on MR3.

Spark on MR3 on Kubernetes

In our example, we make the following assumptions:

  • The principal name is spark@RED.
  • The keytab file is spark.keytab.
  • The KDC server runs on a host red0 with IP address 10.1.91.4.
  • Spark on MR3 accesses an HDFS file system hdfs://red0:8020.

Before running Spark on MR3 on Kubernetes with Kerberos, copy the keytab file spark.keytab to the directory kubernetes/spark/key and the Kerberos configuration file krb5.conf to the directory kubernetes/spark/conf.

$ ls kubernetes/spark/key
spark.keytab

$ ls kubernetes/spark/conf/krb5.conf
kubernetes/spark/conf/krb5.conf

Then update the file kubernetes/spark/conf/core-site.xml.

$ kubernetes/spark/conf/core-site.xml

<property>
  <name>fs.defaultFS</name>
  <value>file:///</value>
</property>

<property>
  <name>hadoop.security.authentication</name>
  <value>kerberos</value>
</property>
  • fs.defaultFS should be set file:///, not an HDFS address like hdfs://red0:8020. This is because from the viewpoint of Spark on MR3 running in a Kubernetes cluster, the default file system is the local file system.
  • hadoop.security.authentication should be set to kerberos to enable Kerberos for authentication.

The remaining configuration files can be updated as explained below. After updating all the configuration files, see Running Spark on MR3 on Kubernetes.

Option 1. Running the Spark driver inside Kubernetes

kubernetes/spark/env.sh

$ vi kubernetes/spark/env.sh 

CREATE_KEYTAB_SECRET=true
  • CREATE_KEYTAB_SECRET should be set to true so that a Secret is created from the files in the directory kubernetes/spark/key.

kubernetes/spark/conf/spark-defaults.conf

$ vi kubernetes/spark/conf/spark-defaults.conf

spark.kerberos.principal=spark@RED
spark.kerberos.keytab=/opt/mr3-run/key/spark.keytab
spark.hadoop.yarn.resourcemanager.principal=spark
spark.kerberos.access.hadoopFileSystems=hdfs://red0:8020
  • spark.kerberos.principal specifies the principal name.
  • spark.kerberos.keytab specifies the path to the keytab file inside the Spark driver Pod. Hence the directory should be set to /opt/mr3-run/key/.
  • spark.hadoop.yarn.resourcemanager.principal specifies the primary of the principal (such as spark in spark@RED).
  • spark.kerberos.access.hadoopFileSystems lists HDFS file systems that Spark on MR3 accesses.

mr3-site.xml

$ vi kubernetes/spark/conf/mr3-site.xml

<property>
  <name>mr3.k8s.host.aliases</name>
  <value>red0=10.1.91.4</value>
</property>
  • mr3.k8s.host.aliases should include a mapping for the KDC server (as well as the HDFS NameNode).

spark-yaml/spark-submit.yaml

spec:
  hostAliases:
  - ip: "10.1.91.4"
    hostnames:
    - "red0"

  containers:
    args: [
      "--driver-java-options -Djava.security.krb5.conf=/opt/mr3-run/conf/krb5.conf",
  • The spec/hostAliases field should include aliases for hosts for the HDFS NameNode and the KDC server.
  • The spec/containers/args field should include an argument "--driver-java-options -Djava.security.krb5.conf=/opt/mr3-run/conf/krb5.conf".

Option 2. Running the Spark driver outside Kubernetes

kubernetes/spark/env.sh

$ vi kubernetes/spark/env.sh 

CREATE_KEYTAB_SECRET=false
  • CREATE_KEYTAB_SECRET can be set to false because the keytab file is not used by DAGAppMaster and ContainerWorker Pods.

kubernetes/spark/conf/spark-defaults.conf

$ vi kubernetes/spark/conf/spark-defaults.conf

spark.kerberos.principal=spark@RED
spark.kerberos.keytab=/home/spark/mr3-run/kubernetes/spark/key/spark.keytab
spark.hadoop.yarn.resourcemanager.principal=spark
spark.kerberos.access.hadoopFileSystems=hdfs://red0:8020
  • spark.kerberos.keytab specifies the path to the keytab file on the node where the Spark driver runs. This is because the Spark driver directly reads the keytab file on the local file system.
  • The other configuration keys are set in the same way as in Option 1.

mr3-site.xml

$ vi kubernetes/spark/conf/mr3-site.xml

<property>
  <name>mr3.k8s.host.aliases</name>
  <value>red0=10.1.91.4,gold0=10.1.90.9</value>
</property>
  • mr3.k8s.host.aliases should include a mapping for the KDC server (as well as the HDFS NameNode). In the example shown above, gold0 is the node where the Spark driver is to run.

Spark on MR3 on Hadoop

In order to run Spark on MR3 on Hadoop with Kerberos, the user should update spark-defaults.conf (such as conf/cluster/spark/spark-defaults.conf).

$ vi conf/cluster/spark/spark-defaults.conf

spark.kerberos.principal=spark@RED
spark.kerberos.keytab=/home/spark/spark.keytab
spark.hadoop.dfs.namenode.kerberos.principal=nn/red0@RED
  • spark.kerberos.principal specifies the principal name.
  • spark.kerberos.keytab specifies the path to the keytab file.
  • spark.hadoop.dfs.namenode.kerberos.principal specifies the principal for the HDFS NameNode.

Note that the environment variables in env.sh (such as SECURE_MODE, USER_PRINCIPAL, USER_KEYTAB) are irrelevant because token renewal is managed by the Spark driver. Hence even TOKEN_RENEWAL_HDFS_ENABLED can be set to false in env.sh.