Python: Load Data from Hive

access_time 7 days ago visibility8 comment 0

To load data from Hive in Python, there are several approaches:

  1. Use PySpark with Hive enabled to directly load data from Hive databases using Spark SQL: Read Data from Hive in Spark 1.x and 2.x.
  2. Use ODBC or JDBC Hive drivers. Cloudera has implemented ODBC drivers for Hive and Impala. 
  3. Native Python libraries. 

We will focus on the third approach in this article - using native Python libraries. The commonly used native libraries include Cloudera impyla and dropbox PyHive. The later will be used in the following examples. The sample code runs in a UNIX-alike system. 

Install PyHive

1) Install SASL packages

First, ensure SASL (Simple Authentication Security Layer) is enabled in your system. Different systems require different packages to be installed to enable SASL support. 

The following is an example to enable it on Ubuntu:

sudo apt-get install libsasl2-dev libsasl2-2 libsasl2-modules-gssapi-mit

2) Install PyHive package

Run the following command to install PyHive package:

pip3 install 'pyhive[hive]'


pip install 'pyhive[hive]'

Ensure you can view the following output:

Successfully built sasl
Installing collected packages: future, six, python-dateutil, sasl, thrift, thrift-sasl, pyhive
Successfully installed future-0.18.2 pyhive-0.6.3 python-dateutil-2.8.1 sasl-0.2.1 six-1.15.0 thrift-0.13.0 thrift-sasl-0.4.2

Read data from Hive via DB-API

The following code snippet will read data from a test table in Hive. The content of the table looks like the following screenshot:

HiveServer2 thrift: 

  • Server: localhost
  • Port: 10000

Code snippet

from pyhive import hive

cursor = hive.connect('localhost').cursor()
cursor.execute('SELECT * FROM test_db.test_parquet')


$ python 
('Category A', 100, 'This is category A')
[('Category B', 120, 'This is category B'), ('Category C', 150, 'This is category C')]

Kerberos authentication

warning Ensure valid ktab or Kerberos authentication ticket is available for the running user.

To use Kerberos authentication, establish the connection using KERBEROS as authentication mechanism. 

from pyhive import hive

#cursor = hive.connect('localhost').cursor()
conn = hive.Connection(host="localhost",
cursor = conn.cursor()
cursor.execute('SELECT * FROM test_db.test_parquet')
info Last modified by Raymond 7 days ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Follow Kontext

Get our latest updates on LinkedIn or Twitter.

Want to publish your article on Kontext?

Learn more

More from Kontext

visibility 668
thumb_up 0
access_time 5 months ago

This page shows how to create, drop, and truncate Hive tables via Hive SQL (HQL). Refer to  Differences between Hive External and Internal (Managed) Tables to understand the differences between managed and unmanaged tables in Hive.  Example: CREATE TABLE IF NOT EXISTS ...

Kafka Topic Partitions Walkthrough via Python
visibility 788
thumb_up 0
access_time 5 months ago

Partition is the parallelism unit in a Kafka cluster. Partitions are replicated in Kafka cluster (cluster of brokers) for fault tolerant and throughput. This articles show you how to work with Kafka partitions using Python as programming language. Package kafka-python will be used in the ...

Apache Hive 3.1.1 Installation on Windows 10 using Windows Subsystem for Linux
visibility 5674
thumb_up 1
access_time 2 years ago

Previously, I demonstrated how to configured Apache Hive 3.0.0 on Windows 10. Apache Hive 3.0.0 Installation on Windows 10 Step by Step Guide On this page, I’m going to show you how to install the latest version Apache Hive 3.1.1 on Windows 10 using Windows Subsystem for Linux (WSL) Ubuntu ...