Python: Load Data from Hive

To load data from Hive in Python, there are several approaches:

Use PySpark with Hive enabled to directly load data from Hive databases using Spark SQL: Read Data from Hive in Spark 1.x and 2.x.
Use ODBC or JDBC Hive drivers. Cloudera has implemented ODBC drivers for Hive and Impala.
Native Python libraries.

We will focus on the third approach in this article - using native Python libraries. The commonly used native libraries include Cloudera impyla and dropbox PyHive. The later will be used in the following examples. The sample code runs in a UNIX-alike system.

Install PyHive

Install SASL packages

First, ensure SASL (Simple Authentication Security Layer) is enabled in your system. Different systems require different packages to be installed to enable SASL support.

The following is an example to enable it on Ubuntu:

sudo apt-get install libsasl2-dev libsasl2-2 libsasl2-modules-gssapi-mit

Install PyHive package

Run the following command to install PyHive package:

pip3 install 'pyhive[hive]'

pip install 'pyhive[hive]'

Ensure you can view the following output:

Successfully built sasl
Installing collected packages: future, six, python-dateutil, sasl, thrift, thrift-sasl, pyhive
Successfully installed future-0.18.2 pyhive-0.6.3 python-dateutil-2.8.1 sasl-0.2.1 six-1.15.0 thrift-0.13.0 thrift-sasl-0.4.2

Read data from Hive via DB-API

The following code snippet will read data from a test table in Hive. The content of the table looks like the following screenshot:

HiveServer2 thrift:

Server: localhost
Port: 10000

Code snippet

from pyhive import hive

cursor = hive.connect('localhost').cursor()
cursor.execute('SELECT * FROM test_db.test_parquet')
print(cursor.fetchone())
print(cursor.fetchall())

Output:

$ python python-hive.py 
('Category A', 100, 'This is category A')
[('Category B', 120, 'This is category B'), ('Category C', 150, 'This is category C')]

Kerberos authentication

warning Ensure valid ktab or Kerberos authentication ticket is available for the running user.

To use Kerberos authentication, establish the connection using KERBEROS as authentication mechanism.

from pyhive import hive

#cursor = hive.connect('localhost').cursor()
conn = hive.Connection(host="localhost",
                       port=10000,
                       auth="KERBEROS",
                       database="test_db",
                       kerberos_service_name="hive")
cursor = conn.cursor()
cursor.execute('SELECT * FROM test_db.test_parquet')
print(cursor.fetchone())
print(cursor.fetchall())

Install PyHive

Read data from Hive via DB-API

Code snippet

Kerberos authentication

In this article