PySpark - Read from Hive Tables

Raymond Raymond visibility 2,381 event 2022-07-08 access_time 2 years ago language English
more_vert

Spark provides flexible APIs to read data from various data sources including Hive databases. In article Spark - Save DataFrame to Hive Table, it provides guidance about writing Spark DataFrame to Hive tables; this article will provides you examples of reading data from Hive using PySpark.

Prerequisites

Environment

  1. Spark - If you don't have Spark environment, you can follow these articles to install Spark in your machine. 
  2. Hive - Similarly, follow hive installation articles to install Hive. 

Sample table

Create a sample Hive table using the following HQL:

create table test_db.test_table(id int, attr string);
insert into test_db.test_table(id, attr) values (1,'a'), (2,'b'),(3,'c');

The statements create a table with three records:

select * from test_db.test_table;
1       a
2       b
3       c

Read data from Hive

Now we can create a PySpark script (read-hive.py) to read from Hive table.

from pyspark.sql import SparkSession

appName = "PySpark Example - Read Hive"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .enableHiveSupport() \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

# Create DF by reading from Hive
df = spark.sql("select * from test_db.test_table")
print(df.schema)
df.show()

enableHiveSupport will force Spark to use Hive data data catalog instead of in-memory catalog. You will be able to see logs of connecting Hive metastore thrift service like the following:

2022-07-08T19:43:23,205 INFO [Thread-5] hive.metastore - Trying to connect to metastore with URI thrift://127.0.0.1:9083
2022-07-08T19:43:23,225 INFO [Thread-5] hive.metastore - Opened a connection to metastore, current connections: 1
2022-07-08T19:43:23,253 INFO [Thread-5] hive.metastore - Connected to metastore.

Run the script using the following command:

spark-submit read-hive.py

Output:

StructType([StructField('id', IntegerType(), True), StructField('attr', StringType(), True)])
+---+----+
| id|attr|
+---+----+
|  1|   a|
|  2|   b|
|  3|   c|
+---+----+

With Hive support enabled, you can also use Hive built-in SQL functions that don't exist in Spark SQL functions. 

More from Kontext
copyright This page is subject to Site terms.
Like this article?
Share on
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts