This articles show you how to convert a Python dictionary list to a Spark DataFrame. The code snippets runs on Spark 2.x environments.
Input
The input data (dictionary list looks like the following):
data = [{"Category": 'Category A', 'ItemID': 1, 'Amount': 12.40},
{"Category": 'Category B', 'ItemID': 2, 'Amount': 30.10},
{"Category": 'Category C', 'ItemID': 3, 'Amount': 100.01},
{"Category": 'Category A', 'ItemID': 4, 'Amount': 110.01},
{"Category": 'Category B', 'ItemID': 5, 'Amount': 70.85}
]
Solution 1 - Infer schema
In Spark 2.x, DataFrame can be directly created from Python dictionary list and the schema will be inferred automatically.
def infer_schema():
# Create data frame
df = spark.createDataFrame(data)
print(df.schema)
df.show()
The output looks like the following:
StructType(List(StructField(Amount,DoubleType,true),StructField(Category,StringType,true),StructField(ItemID,LongType,true)))
+------+----------+------+
|Amount| Category|ItemID|
+------+----------+------+
| 12.4|Category A| 1|
| 30.1|Category B| 2|
|100.01|Category C| 3|
|110.01|Category A| 4|
| 70.85|Category B| 5|
+------+----------+------+
Solution 2 - Explicit schema
Of course, you can also define the schema directly when creating the data frame:
def explicit_schema():
# Create a schema for the dataframe
schema = StructType([
StructField('Category', StringType(), False),
StructField('ItemID', IntegerType(), False),
StructField('Amount', FloatType(), True)
])
# Create data frame
df = spark.createDataFrame(data, schema)
print(df.schema)
df.show()
In this way, you can control the data types explicitly. The output looks like the following:
StructType(List(StructField(Category,StringType,false),StructField(ItemID,IntegerType,false),StructField(Amount,FloatType,true)))
+----------+------+------+
| Category|ItemID|Amount|
+----------+------+------+
|Category A| 1| 12.4|
|Category B| 2| 30.1|
|Category C| 3|100.01|
|Category A| 4|110.01|
|Category B| 5| 70.85|
+----------+------+------+
You will notice that the sequence of attributes is slightly different from the inferred one.
Summary
You can easily convert Python list to Spark DataFrame in Spark 2.x.
Complete code
Code is available in GitHub:
https://github.com/FahaoTang/spark-examples/tree/master/python-dict-list