PySpark - Save DataFrame into Hive Table using insertInto

event 2022-08-24 visibility 2,000 comment 0 insights
more_vert
insights Stats
Kontext Kontext Code Snippets & Tips

Code snippets and tips for various programming languages/frameworks. All code examples are under MIT or Apache 2.0 license unless specified otherwise. 

Code description

This code snippets provides one example of inserting data into Hive table using PySpark DataFrameWriter.insertInto API.

DataFrameWriter.insertInto(tableName: str, overwrite: Optional[bool] = None)

It takes two parameters: tableName - the table to insert data into; overwrite - whether to overwrite existing data. By default, it won't overwrite existing data.

This function uses position-based resolution for columns instead of column names. 


Code snippet

from pyspark.sql import SparkSession

appName = "PySpark Hive Bucketing Example"
master = "local"

# Create Spark session with Hive supported.
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .enableHiveSupport() \
    .getOrCreate()

# prepare sample data for inserting into hive table
data = []
countries = ['CN', 'AU']
for i in range(0, 1000):
    data.append([int(i),  'U'+str(i), countries[i % 2]])

df = spark.createDataFrame(data, ['user_id', 'key', 'country'])
df.show()

# Save df to Hive table test_db.bucket_table

df.write.mode('append').insertInto('test_db.bucket_table') 
More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts