Since it is agnostic to the type of data sources, Hive on MR3 can access multiple data sources simultaneously (e.g. by joining tables from two separate Hadoop clusters). The only restriction is that it must use a single KDC and a single KMS, if it uses them at all. Below we illustrate how to use a nonsecure HDFS as another remote data source in addition to an existing secure HDFS, as depicted in the following diagram. We assume that the secure HDFS runs on red0 and the nonsecure HDFS runs on gold0.

hive.k8s.nonsecure.hdfs

As the first step, we allow Hive to read from secure HDFS and nonsecure HDFS by setting the configuration key ipc.client.fallback-to-simple-auth-allowed in mr3-run/kubernetes/conf/core-site.xml.

<property>
  <name>hadoop.security.authentication</name>
  <value>kerberos</value>
</property>

<property>
  <name>ipc.client.fallback-to-simple-auth-allowed</name>
  <value>true</value>
</property>

Usually it is impersonation issues that prevent access to nonsecure HDFS. Assuming that HIVE_SERVER2_KERBEROS_PRINCIPAL is set to hive/red0@RED in mr3-run/kubernetes/env.sh, creating an external table from nonsecure HDFS may generate an error message shown below.

2019-07-23 14:33:49,950 INFO  ipc.Server (Server.java:authorizeConnection(2235)) - Connection from 10.1.91.38:57090 for protocol org.apache.hadoop.hdfs.protocol.ClientProtocol is unauthorized for user gitlab-runner (auth:PROXY) via hive/red0@RED (auth:SIMPLE)

2019-07-23 14:33:49,951 INFO  ipc.Server (Server.java:doRead(1006)) - Socket Reader #1 for port 8020: readAndProcess from client 10.1.91.38 threw exception [org.apache.hadoop.security.authorize.AuthorizationException: User: hive/red0@RED is not allowed to impersonate gitlab-runner]

Here an ordinary user gitlab-runner runs Beeline and tries to create an external table from a directory on HDFS running on gold0. As indicated by the error message, NameNode on gold0 should allow hive/red0@RED to impersonate gitlab-runner. This requires two changes in core-site.xml on gold0:

  • The configuration key hadoop.proxyuser.hive.users should be set to * or gitlab-runner so that hive can impersonate gitlab-runner on gold0 (where the nonsecure HDFS runs).
  • The configuration key hadoop.security.auth_to_local should be set so that user hive/red0@RED can be mapped to user hive in auth_to_local rules, as shown in the following example.
    RULE:[2:$1@$0](hive@RED)s/.*/hive/
    RULE:[1:$1@$0](hive@RED)s/.*/hive/
    DEFAULT
    

Then the user (or the administrator user of gold0) should restart NameNode, and the impersonation issue should disappear. Note that the impersonation issue arises because of accessing nonsecure HDFS and has nothing to do with the value of the configuration key hive.server2.enable.doAs in kubernetes/conf/hive-site.xml. That is, even if hive.server2.enable.doAs is set to false, the user may still see the impersonation issue.

Now the user with proper permission can create an external table from the nonsecure HDFS.

0: jdbc:hive2://10.1.91.41:9852/> create external table call_center_gold(
. . . . . . . . . . . . . . . . >       cc_call_center_sk         bigint
...
. . . . . . . . . . . . . . . . > ,     cc_tax_percentage         double
. . . . . . . . . . . . . . . . > )
. . . . . . . . . . . . . . . . > stored as orc
. . . . . . . . . . . . . . . . > location 'hdfs://gold0:8020/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_2.db/call_center';
...
INFO  : OK
No rows affected (0.256 seconds)