PySpark - Read from Hive Tables

Spark provides flexible APIs to read data from various data sources including Hive databases. In article Spark - Save DataFrame to Hive Table, it provides guidance about writing Spark DataFrame to Hive tables; this article will provides you examples of reading data from Hive using PySpark.

Prerequisites

Environment

Spark- If you don't have Spark environment, you can follow these articles to install Spark in your machine.
Hive- Similarly, follow hive installation articles to install Hive.

Sample table

Create a sample Hive table using the following HQL:

create table test_db.test_table(id int, attr string);
insert into test_db.test_table(id, attr) values (1,'a'), (2,'b'),(3,'c');

The statements create a table with three records:

select * from test_db.test_table;
1       a
2       b
3       c

Read data from Hive

Now we can create a PySpark script (read-hive.py) to read from Hive table.

from pyspark.sql import SparkSession

appName = "PySpark Example - Read Hive"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .enableHiveSupport() \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

# Create DF by reading from Hive
df = spark.sql("select * from test_db.test_table")
print(df.schema)
df.show()

enableHiveSupport will force Spark to use Hive data data catalog instead of in-memory catalog. You will be able to see logs of connecting Hive metastore thrift service like the following:

2022-07-08T19:43:23,205 INFO [Thread-5] hive.metastore - Trying to connect to metastore with URI thrift://127.0.0.1:9083 2022-07-08T19:43:23,225 INFO [Thread-5] hive.metastore - Opened a connection to metastore, current connections: 1 2022-07-08T19:43:23,253 INFO [Thread-5] hive.metastore - Connected to metastore.

Run the script using the following command:

spark-submit read-hive.py

Output:

StructType([StructField('id', IntegerType(), True), StructField('attr', StringType(), True)])
+---+----+
| id|attr|
+---+----+
|  1|   a|
|  2|   b|
|  3|   c|
+---+----+

With Hive support enabled, you can also use Hive built-in SQL functions that don't exist in Spark SQL functions.