Python: Load Data from Hive
To load data from Hive in Python, there are several approaches:
- Use PySpark with Hive enabled to directly load data from Hive databases using Spark SQL: Read Data from Hive in Spark 1.x and 2.x.
- Use ODBC or JDBC Hive drivers. Cloudera has implemented ODBC drivers for Hive and Impala.
- Native Python libraries.
We will focus on the third approach in this article - using native Python libraries. The commonly used native libraries include Cloudera impyla and dropbox PyHive. The later will be used in the following examples. The sample code runs in a UNIX-alike system.
Install PyHive
1) Install SASL packages
First, ensure SASL (Simple Authentication Security Layer) is enabled in your system. Different systems require different packages to be installed to enable SASL support.
The following is an example to enable it on Ubuntu:
sudo apt-get install libsasl2-dev libsasl2-2 libsasl2-modules-gssapi-mit
2) Install PyHive package
Run the following command to install PyHive package:
pip3 install 'pyhive[hive]'
or
pip install 'pyhive[hive]'
Ensure you can view the following output:
Successfully built sasl Installing collected packages: future, six, python-dateutil, sasl, thrift, thrift-sasl, pyhive Successfully installed future-0.18.2 pyhive-0.6.3 python-dateutil-2.8.1 sasl-0.2.1 six-1.15.0 thrift-0.13.0 thrift-sasl-0.4.2
Read data from Hive via DB-API
The following code snippet will read data from a test table in Hive. The content of the table looks like the following screenshot:
HiveServer2 thrift:
- Server: localhost
- Port: 10000
Code snippet
from pyhive import hive cursor = hive.connect('localhost').cursor() cursor.execute('SELECT * FROM test_db.test_parquet') print(cursor.fetchone()) print(cursor.fetchall())
Output:
$ python python-hive.py ('Category A', 100, 'This is category A') [('Category B', 120, 'This is category B'), ('Category C', 150, 'This is category C')]
Kerberos authentication
To use Kerberos authentication, establish the connection using KERBEROS as authentication mechanism.
from pyhive import hive #cursor = hive.connect('localhost').cursor() conn = hive.Connection(host="localhost", port=10000, auth="KERBEROS", database="test_db", kerberos_service_name="hive") cursor = conn.cursor() cursor.execute('SELECT * FROM test_db.test_parquet') print(cursor.fetchone()) print(cursor.fetchall())