Kontext Kontext / Code Snippets & Tips

PySpark - Save DataFrame into Hive Table using insertInto

event 2022-08-24 visibility 1,920 comment 0 insights
insights Stats

Code description

This code snippets provides one example of inserting data into Hive table using PySpark DataFrameWriter.insertInto API.

DataFrameWriter.insertInto(tableName: str, overwrite: Optional[bool] = None)

It takes two parameters: tableName - the table to insert data into; overwrite - whether to overwrite existing data. By default, it won't overwrite existing data.

This function uses position-based resolution for columns instead of column names. 

Code snippet

from pyspark.sql import SparkSession

appName = "PySpark Hive Bucketing Example"
master = "local"

# Create Spark session with Hive supported.
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .enableHiveSupport() \

# prepare sample data for inserting into hive table
data = []
countries = ['CN', 'AU']
for i in range(0, 1000):
    data.append([int(i),  'U'+str(i), countries[i % 2]])

df = spark.createDataFrame(data, ['user_id', 'key', 'country'])

# Save df to Hive table test_db.bucket_table

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts