access_time 7 months ago languageEnglish
more_vert

Python: Load Data from Hive

visibility 199 comment 0

To load data from Hive in Python, there are several approaches:

  1. Use PySpark with Hive enabled to directly load data from Hive databases using Spark SQL: Read Data from Hive in Spark 1.x and 2.x.
  2. Use ODBC or JDBC Hive drivers. Cloudera has implemented ODBC drivers for Hive and Impala. 
  3. Native Python libraries. 

We will focus on the third approach in this article - using native Python libraries. The commonly used native libraries include Cloudera impyla and dropbox PyHive. The later will be used in the following examples. The sample code runs in a UNIX-alike system. 

Install PyHive

1) Install SASL packages

First, ensure SASL (Simple Authentication Security Layer) is enabled in your system. Different systems require different packages to be installed to enable SASL support. 

The following is an example to enable it on Ubuntu:

sudo apt-get install libsasl2-dev libsasl2-2 libsasl2-modules-gssapi-mit

2) Install PyHive package

Run the following command to install PyHive package:

pip3 install 'pyhive[hive]'

or

pip install 'pyhive[hive]'

Ensure you can view the following output:

Successfully built sasl
Installing collected packages: future, six, python-dateutil, sasl, thrift, thrift-sasl, pyhive
Successfully installed future-0.18.2 pyhive-0.6.3 python-dateutil-2.8.1 sasl-0.2.1 six-1.15.0 thrift-0.13.0 thrift-sasl-0.4.2

Read data from Hive via DB-API

The following code snippet will read data from a test table in Hive. The content of the table looks like the following screenshot:

HiveServer2 thrift: 

  • Server: localhost
  • Port: 10000

Code snippet

from pyhive import hive

cursor = hive.connect('localhost').cursor()
cursor.execute('SELECT * FROM test_db.test_parquet')
print(cursor.fetchone())
print(cursor.fetchall())

Output:

$ python python-hive.py 
('Category A', 100, 'This is category A')
[('Category B', 120, 'This is category B'), ('Category C', 150, 'This is category C')]

Kerberos authentication

warning Ensure valid ktab or Kerberos authentication ticket is available for the running user.

To use Kerberos authentication, establish the connection using KERBEROS as authentication mechanism. 

from pyhive import hive

#cursor = hive.connect('localhost').cursor()
conn = hive.Connection(host="localhost",
                       port=10000,
                       auth="KERBEROS",
                       database="test_db",
                       kerberos_service_name="hive")
cursor = conn.cursor()
cursor.execute('SELECT * FROM test_db.test_parquet')
print(cursor.fetchone())
print(cursor.fetchall())
info Last modified by Raymond 7 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Follow Kontext

Get our latest updates on LinkedIn.

Want to contribute on Kontext to help others?

Learn more