The user can access Hive on MR3 using a Python client impyla or PyHive. We recommend the user to use impyla if HiveServer2 runs with SSL enabled, and PyHive if HiveServer2 runs with SSL disabled. Below we demonstrate the use of impyla and PyHive by creating Docker images. We assume that HiveServer2 uses Kerberos-based authentication, but the Docker images work okay even when it does not use Kerberos-based authentication. For using Python clients, the configuration key hive.server2.transport.mode should be set to binary in kubernetes/conf/hive-site.xml.

Using impyla when HiveServer2 runs with SSL enabled

Create a directory and copy krb5.conf that contains the information for Kerberos configuration. Create a Docker image impyla using the following Dockerfile.

$ vi Dockerfile

FROM python:2

ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y libsasl2-dev libsasl2-2 libsasl2-modules-gssapi-mit krb5-user
RUN pip install pandas 'six==1.12.0' 'bit_array==0.1.0' 'thrift==0.9.3' 'thrift_sasl==0.2.1' 'sasl==0.2.1' 'impyla==0.13.8'
# RUN pip install pandas 'six==1.14.0' 'bit_array==0.1.0' 'thrift==0.9.3' 'thrift_sasl==0.2.1' 'sasl==0.2.1' 'impyla==0.16.2'
COPY krb5.conf /etc/krb5.conf
RUN mkdir -p /opt/impyla

WORKDIR /opt/impyla
CMD ["python", "/opt/impyla/run.py"]

$ docker build -t impyla .

Create a working directory, e.g., /home/hive/impyla, and copy three files:

  • Kerberos keytab file for the user, e.g., hive.keytab
  • Certificate file mr3-ssl.pem created when enabling SSL
  • Python file run.py for communicating with HiveServer2

The user can execute run.py as follows:

$ ls /home/hive/impyla/
hive.keytab  mr3-ssl.pem  run.py
$ docker run -it --rm -v /home/hive/impyla:/opt/impyla impyla

Here is an example of run.py which communicates with HiveServer2 running at the address 192.168.10.1:9852.

  • We obtain a Kerberos ticket for the principal hive@PL for the user by executing kinit.
  • Since SSL is enabled, we should connect to HiveServer2 using a host name instead of an IP address. The host name must match the Common Name for the KeyStore file created by kubernetes/generate-hivemr3-ssl.sh, which in turn must match the host name in the service principal specified by HIVE_SERVER2_KERBEROS_PRINCIPAL in kubernetes/env.sh. In the following example, we use the host name gold7 which matches the Common Name ("CN=gold7") and the host name in the service principal (root/gold7@PL).
    $ grep CN kubernetes/generate-hivemr3-ssl.sh
        keytool -genkeypair -alias ssl-private-key -keyalg RSA -dname "CN=gold7" -keypass $PASSWORD \
    $ grep HIVE_SERVER2_KERBEROS_PRINCIPAL kubernetes/env.sh
    HIVE_SERVER2_KERBEROS_PRINCIPAL=root/gold7@PL
    
  • Once we obtain the host name, we add a new entry in /etc/hosts.
  • The parameter kerberos_service_name should match the user in the service principal name.
import os
import pandas
from impala.dbapi import connect
from impala.util import as_pandas

os.system("kinit -kt /opt/impyla/hive.keytab hive@PL")
os.system("echo '192.168.10.1    gold7' >> /etc/hosts")

# for parameters, see https://github.com/cloudera/impyla/blob/master/impala/dbapi.py
conn = connect(host='gold7',
  port=9852,
  database='tpcds_bin_partitioned_orc_1000',
  auth_mechanism='GSSAPI',
  use_ssl='true',
  ca_cert='/opt/impyla/mr3-ssl.pem',
  kerberos_service_name='root')

cursor = conn.cursor()
cursor = cursor.execute("show tables")
tables = as_pandas(cursor)
pandas.set_option("display.max_rows", None, "display.max_columns", 32)
print(tables)

Using PyHive when HiveServer runs with SSL disabled

Create a directory and copy krb5.conf that contains the information for Kerberos configuration. Create a Docker image pyhive using the following Dockerfile.

$ vi Dockerfile

FROM python:2

ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y libsasl2-dev libsasl2-2 libsasl2-modules-gssapi-mit krb5-user
RUN pip install pandas 'thrift==0.10.0' 'thrift_sasl==0.3.0' 'sasl==0.2.1' 'pyhive==0.6.3'
COPY krb5.conf /etc/krb5.conf
RUN mkdir -p /opt/pyhive

WORKDIR /opt/pyhive

CMD ["python", "/opt/pyhive/run.py"]

$ docker build -t pyhive .

Create a working directory, e.g., /home/hive/pyhive, and copy two files:

  • Kerberos keytab file for the user, e.g., hive.keytab
  • Python file run.py for communicating with HiveServer2

The user can execute run.py as follows:

$ ls /home/hive/pyhive/
hive.keytab  run.py
$ docker run -it --rm -v /home/hive/pyhive:/opt/pyhive pyhive

Here is an example of run.py which communicates with HiveServer2 running at the address 192.168.10.1:9852.

  • We obtain a Kerberos ticket for the principal hive@PL for the user by executing kinit.
  • The host name must match the host name in the service principal specified by HIVE_SERVER2_KERBEROS_PRINCIPAL in kubernetes/env.sh. In the following example, we use the host name gold7 which matches the host name in the service principal (root/gold7@PL).
    $ grep CN kubernetes/generate-hivemr3-ssl.sh
        keytool -genkeypair -alias ssl-private-key -keyalg RSA -dname "CN=gold7" -keypass $PASSWORD \
    $ grep HIVE_SERVER2_KERBEROS_PRINCIPAL kubernetes/env.sh
    HIVE_SERVER2_KERBEROS_PRINCIPAL=root/gold7@PL
    
  • Once we obtain the host name, we add a new entry in /etc/hosts.
  • The parameter kerberos_service_name should match the user in the service principal name.
import os
import pandas
from pyhive import hive

os.system("kinit -kt /opt/pyhive/hive.keytab hive@PL")
os.system("echo '192.168.10.1    gold7' >> /etc/hosts")

# for parameters, see https://github.com/dropbox/PyHive/blob/master/pyhive/hive.py
conn = hive.connect(host='gold7',
  port=9852,
  database='tpcds_bin_partitioned_orc_1000',
  auth='KERBEROS',
  kerberos_service_name='root')

cursor = conn.cursor()
cursor = cursor.execute("show tables")
result = cursor.fetchall()
print(result)