The user can access Hive on MR3 using a Python client impyla or PyHive.
We recommend the user to use impyla if HiveServer2 runs with SSL enabled,
and PyHive if HiveServer2 runs with SSL disabled.
Below we demonstrate the use of impyla and PyHive by creating Docker images.
We assume that HiveServer2 uses Kerberos-based authentication, but
the Docker images work okay even when it does not use Kerberos-based authentication.
For using Python clients, the configuration key hive.server2.transport.mode
should be set to binary
in kubernetes/conf/hive-site.xml
.
Using impyla when HiveServer2 runs with SSL enabled
Create a directory and copy krb5.conf
that contains the information for Kerberos configuration.
Create a Docker image impyla
using the following Dockerfile.
$ vi Dockerfile
FROM python:2
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y libsasl2-dev libsasl2-2 libsasl2-modules-gssapi-mit krb5-user
RUN pip install pandas 'six==1.12.0' 'bit_array==0.1.0' 'thrift==0.9.3' 'thrift_sasl==0.2.1' 'sasl==0.2.1' 'impyla==0.13.8'
# RUN pip install pandas 'six==1.14.0' 'bit_array==0.1.0' 'thrift==0.9.3' 'thrift_sasl==0.2.1' 'sasl==0.2.1' 'impyla==0.16.2'
COPY krb5.conf /etc/krb5.conf
RUN mkdir -p /opt/impyla
WORKDIR /opt/impyla
CMD ["python", "/opt/impyla/run.py"]
$ docker build -t impyla .
Create a working directory, e.g., /home/hive/impyla
, and copy three files:
- Kerberos keytab file for the user, e.g.,
hive.keytab
- Certificate file
mr3-ssl.pem
created when enabling SSL - Python file
run.py
for communicating with HiveServer2
The user can execute run.py
as follows:
$ ls /home/hive/impyla/
hive.keytab mr3-ssl.pem run.py
$ docker run -it --rm -v /home/hive/impyla:/opt/impyla impyla
Here is an example of run.py
which communicates with HiveServer2 running at the address 192.168.10.1:9852
.
- We obtain a Kerberos ticket for the principal
hive@PL
for the user by executingkinit
. - Since SSL is enabled, we should connect to HiveServer2 using a host name instead of an IP address.
The host name must match the Common Name for the KeyStore file created by
kubernetes/generate-hivemr3-ssl.sh
, which in turn must match the host name in the service principal specified byHIVE_SERVER2_KERBEROS_PRINCIPAL
inkubernetes/env.sh
. In the following example, we use the host namegold7
which matches the Common Name ("CN=gold7"
) and the host name in the service principal (root/gold7@PL
).$ grep CN kubernetes/generate-hivemr3-ssl.sh keytool -genkeypair -alias ssl-private-key -keyalg RSA -dname "CN=gold7" -keypass $PASSWORD \ $ grep HIVE_SERVER2_KERBEROS_PRINCIPAL kubernetes/env.sh HIVE_SERVER2_KERBEROS_PRINCIPAL=root/gold7@PL
- Once we obtain the host name, we add a new entry in
/etc/hosts
. - The parameter
kerberos_service_name
should match the user in the service principal name.
import os
import pandas
from impala.dbapi import connect
from impala.util import as_pandas
os.system("kinit -kt /opt/impyla/hive.keytab hive@PL")
os.system("echo '192.168.10.1 gold7' >> /etc/hosts")
# for parameters, see https://github.com/cloudera/impyla/blob/master/impala/dbapi.py
conn = connect(host='gold7',
port=9852,
database='tpcds_bin_partitioned_orc_1000',
auth_mechanism='GSSAPI',
use_ssl='true',
ca_cert='/opt/impyla/mr3-ssl.pem',
kerberos_service_name='root')
cursor = conn.cursor()
cursor = cursor.execute("show tables")
tables = as_pandas(cursor)
pandas.set_option("display.max_rows", None, "display.max_columns", 32)
print(tables)
Using PyHive when HiveServer runs with SSL disabled
Create a directory and copy krb5.conf
that contains the information for Kerberos configuration.
Create a Docker image pyhive
using the following Dockerfile.
$ vi Dockerfile
FROM python:2
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y libsasl2-dev libsasl2-2 libsasl2-modules-gssapi-mit krb5-user
RUN pip install pandas 'thrift==0.10.0' 'thrift_sasl==0.3.0' 'sasl==0.2.1' 'pyhive==0.6.3'
COPY krb5.conf /etc/krb5.conf
RUN mkdir -p /opt/pyhive
WORKDIR /opt/pyhive
CMD ["python", "/opt/pyhive/run.py"]
$ docker build -t pyhive .
Create a working directory, e.g., /home/hive/pyhive
, and copy two files:
- Kerberos keytab file for the user, e.g.,
hive.keytab
- Python file
run.py
for communicating with HiveServer2
The user can execute run.py
as follows:
$ ls /home/hive/pyhive/
hive.keytab run.py
$ docker run -it --rm -v /home/hive/pyhive:/opt/pyhive pyhive
Here is an example of run.py
which communicates with HiveServer2 running at the address 192.168.10.1:9852
.
- We obtain a Kerberos ticket for the principal
hive@PL
for the user by executingkinit
. - The host name must match the host name in the service principal specified by
HIVE_SERVER2_KERBEROS_PRINCIPAL
inkubernetes/env.sh
. In the following example, we use the host namegold7
which matches the host name in the service principal (root/gold7@PL
).$ grep CN kubernetes/generate-hivemr3-ssl.sh keytool -genkeypair -alias ssl-private-key -keyalg RSA -dname "CN=gold7" -keypass $PASSWORD \ $ grep HIVE_SERVER2_KERBEROS_PRINCIPAL kubernetes/env.sh HIVE_SERVER2_KERBEROS_PRINCIPAL=root/gold7@PL
- Once we obtain the host name, we add a new entry in
/etc/hosts
. - The parameter
kerberos_service_name
should match the user in the service principal name.
import os
import pandas
from pyhive import hive
os.system("kinit -kt /opt/pyhive/hive.keytab hive@PL")
os.system("echo '192.168.10.1 gold7' >> /etc/hosts")
# for parameters, see https://github.com/dropbox/PyHive/blob/master/pyhive/hive.py
conn = hive.connect(host='gold7',
port=9852,
database='tpcds_bin_partitioned_orc_1000',
auth='KERBEROS',
kerberos_service_name='root')
cursor = conn.cursor()
cursor = cursor.execute("show tables")
result = cursor.fetchall()
print(result)