The user can create a Kubernetes cluster in which Hive on MR3 runs in conjunction with additional components for managing data security and user interface. A Kubernetes cluster created with Hive on MR3 consists of the following components:

  • Key components of Hive on MR3 – HiveServer2, DAGAppMaster, and ContainerWorkers
  • Apache Timeline Server for preserving history logs
  • Apache Ranger for managing data security

Hive on MR3 requires several external components which should be set up by the administrator user before creating a Kubernetes cluster. For the case of using secure (Kerberized) HDFS as a remote data source, the following external components are usually set up by the administrator user:

  • Secure HDFS
  • Kubernetes PersistentVolume for storing transient data such as history logs created by Timeline Server and results of running queries
  • Metastore along with its MySQL database
  • KDC (Key Distribution Center) for managing Kerberos tickets
  • KMS (Key Management Server) for managing impersonation and delegation tokens
  • MySQL database for Ranger

Note that a Kubernetes PersistentVolume stores only transient data that becomes obsolete once the Kubernetes cluster is destroyed. Thus it does not serve as a remote data source for storing Hive tables. If secure HDFS is part of a secure Hadoop cluster, the administrator user can reuse the existing KDC and KMS servers. As Ranger is an optional component, the administrator user may choose not to include it in the Kubernetes cluster. For the MySQL database for Ranger, it is okay to share the same MySQL database for Metastore.

The following diagram depicts a Kubernetes cluster created with Hive on MR3 along with external components for a typical production environment in which secure HDFS is part of an existing Hadoop cluster. Later we explain how to run Metastore as a Pod (if the MySQL database for Metastore is accessible), how to use non-secure HDFS instead of secure HDFS, and how to use Amazon S3 instead of HDFS.

hive.k8s.hdfs.overview

An ordinary user can run Beeline to submit queries to HiveServer2, and check the status of queries on the web browser via MR3-UI. The administrator user can connect to Ranger to manage data security. The administrator user can also log on directly to the HiveServer2 Pod in order to manage running DAGs (e.g., kill DAGs).

This section gives details on how to create a Kubernetes cluster based on the above setting and a few variations thereof. We assume some familiarity with key concepts in Kubernetes such as Pods, Containers, Volumes, ConfigMaps, and Secrets.