PySpark: Convert Python Dictionary List to Spark DataFrame
This articles show you how to convert a Python dictionary list to a Spark DataFrame. The code snippets runs on Spark 2.x environments.
Input
The input data (dictionary list looks like the following):
data = [{"Category": 'Category A', 'ItemID': 1, 'Amount': 12.40}, {"Category": 'Category B', 'ItemID': 2, 'Amount': 30.10}, {"Category": 'Category C', 'ItemID': 3, 'Amount': 100.01}, {"Category": 'Category A', 'ItemID': 4, 'Amount': 110.01}, {"Category": 'Category B', 'ItemID': 5, 'Amount': 70.85} ]
Solution 1 - Infer schema
In Spark 2.x, DataFrame can be directly created from Python dictionary list and the schema will be inferred automatically.
def infer_schema(): # Create data frame df = spark.createDataFrame(data) print(df.schema) df.show()
StructType(List(StructField(Amount,DoubleType,true),StructField(Category,StringType,true),StructField(ItemID,LongType,true))) +------+----------+------+ |Amount| Category|ItemID| +------+----------+------+ | 12.4|Category A| 1| | 30.1|Category B| 2| |100.01|Category C| 3| |110.01|Category A| 4| | 70.85|Category B| 5| +------+----------+------+
Solution 2 - Explicit schema
Of course, you can also define the schema directly when creating the data frame:
def explicit_schema(): # Create a schema for the dataframe schema = StructType([ StructField('Category', StringType(), False), StructField('ItemID', IntegerType(), False), StructField('Amount', FloatType(), True) ]) # Create data frame df = spark.createDataFrame(data, schema) print(df.schema) df.show()
In this way, you can control the data types explicitly. The output looks like the following:
StructType(List(StructField(Category,StringType,false),StructField(ItemID,IntegerType,false),StructField(Amount,FloatType,true))) +----------+------+------+ | Category|ItemID|Amount| +----------+------+------+ |Category A| 1| 12.4| |Category B| 2| 30.1| |Category C| 3|100.01| |Category A| 4|110.01| |Category B| 5| 70.85| +----------+------+------+
Summary
You can easily convert Python list to Spark DataFrame in Spark 2.x.
Complete code
Code is available in GitHub:
https://github.com/FahaoTang/spark-examples/tree/master/python-dict-list