Code description
This code snippets provides one example of inserting data into Hive table using PySpark DataFrameWriter.insertInto
API.
DataFrameWriter.insertInto(tableName: str, overwrite: Optional[bool] = None)
It takes two parameters: tableName
- the table to insert data into; overwrite
- whether to overwrite existing data. By default, it won't overwrite existing data.
**This function uses position-based resolution for columns instead of column names. **
Code snippet
from pyspark.sql import SparkSession
appName = "PySpark Hive Bucketing Example"
master = "local"
# Create Spark session with Hive supported.
spark = SparkSession.builder .appName(appName) .master(master) .enableHiveSupport() .getOrCreate()
# prepare sample data for inserting into hive table
data = []
countries = ['CN', 'AU']
for i in range(0, 1000):
data.append([int(i), 'U'+str(i), countries[i % 2]])
df = spark.createDataFrame(data, ['user_id', 'key', 'country'])
df.show()
# Save df to Hive table test_db.bucket_table
df.write.mode('append').insertInto('test_db.bucket_table')