Python: Load Data from Hive

Raymond Raymond event 2021-01-06 visibility 5,368
more_vert

To load data from Hive in Python, there are several approaches:

  1. Use PySpark with Hive enabled to directly load data from Hive databases using Spark SQL: Read Data from Hive in Spark 1.x and 2.x.
  2. Use ODBC or JDBC Hive drivers. Cloudera has implemented ODBC drivers for Hive and Impala. 
  3. Native Python libraries. 

We will focus on the third approach in this article - using native Python libraries. The commonly used native libraries include Cloudera impyla and dropbox PyHive. The later will be used in the following examples. The sample code runs in a UNIX-alike system. 

Install PyHive

1) Install SASL packages

First, ensure SASL (Simple Authentication Security Layer) is enabled in your system. Different systems require different packages to be installed to enable SASL support. 

The following is an example to enable it on Ubuntu:

sudo apt-get install libsasl2-dev libsasl2-2 libsasl2-modules-gssapi-mit

2) Install PyHive package

Run the following command to install PyHive package:

pip3 install 'pyhive[hive]'

or

pip install 'pyhive[hive]'

Ensure you can view the following output:

Successfully built sasl
Installing collected packages: future, six, python-dateutil, sasl, thrift, thrift-sasl, pyhive
Successfully installed future-0.18.2 pyhive-0.6.3 python-dateutil-2.8.1 sasl-0.2.1 six-1.15.0 thrift-0.13.0 thrift-sasl-0.4.2

Read data from Hive via DB-API

The following code snippet will read data from a test table in Hive. The content of the table looks like the following screenshot:

20210108105008-image.png

HiveServer2 thrift: 

  • Server: localhost
  • Port: 10000

Code snippet

from pyhive import hive

cursor = hive.connect('localhost').cursor()
cursor.execute('SELECT * FROM test_db.test_parquet')
print(cursor.fetchone())
print(cursor.fetchall())

Output:

$ python python-hive.py 
('Category A', 100, 'This is category A')
[('Category B', 120, 'This is category B'), ('Category C', 150, 'This is category C')]

Kerberos authentication

warning Ensure valid ktab or Kerberos authentication ticket is available for the running user.

To use Kerberos authentication, establish the connection using KERBEROS as authentication mechanism. 

from pyhive import hive

#cursor = hive.connect('localhost').cursor()
conn = hive.Connection(host="localhost",
                       port=10000,
                       auth="KERBEROS",
                       database="test_db",
                       kerberos_service_name="hive")
cursor = conn.cursor()
cursor.execute('SELECT * FROM test_db.test_parquet')
print(cursor.fetchone())
print(cursor.fetchall())
More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts