pyspark hive

PySpark - Save DataFrame into Hive Table using insertInto

event 2022-08-24 visibility 2,171

more_vert

Code description

This code snippets provides one example of inserting data into Hive table using PySpark DataFrameWriter.insertInto API.

DataFrameWriter.insertInto(tableName: str, overwrite: Optional[bool] = None)

It takes two parameters: tableName - the table to insert data into; overwrite - whether to overwrite existing data. By default, it won't overwrite existing data.

This function uses position-based resolution for columns instead of column names.

Code snippet

from pyspark.sql import SparkSession

appName = "PySpark Hive Bucketing Example"
master = "local"

# Create Spark session with Hive supported.
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .enableHiveSupport() \
    .getOrCreate()

# prepare sample data for inserting into hive table
data = []
countries = ['CN', 'AU']
for i in range(0, 1000):
    data.append([int(i),  'U'+str(i), countries[i % 2]])

df = spark.createDataFrame(data, ['user_id', 'key', 'country'])
df.show()

# Save df to Hive table test_db.bucket_table

df.write.mode('append').insertInto('test_db.bucket_table')

info Last modified by Kontext 3 years ago copyright This page is subject to Site terms.

Code Snippets & Tips

Log in with external accounts

PySpark - Save DataFrame into Hive Table using insertInto

Code description

Code snippet

Log in with external accounts