PySpark: Convert Python Dictionary List to Spark DataFrame
This articles show you how to convert a Python dictionary list to a Spark DataFrame. The code snippets runs on Spark 2.x environments.
Input
The input data (dictionary list looks like the following):
data = [{"Category": 'Category A', 'ItemID': 1, 'Amount': 12.40}, {"Category": 'Category B', 'ItemID': 2, 'Amount': 30.10}, {"Category": 'Category C', 'ItemID': 3, 'Amount': 100.01}, {"Category": 'Category A', 'ItemID': 4, 'Amount': 110.01}, {"Category": 'Category B', 'ItemID': 5, 'Amount': 70.85} ]
Solution 1 - Infer schema
In Spark 2.x, DataFrame can be directly created from Python dictionary list and the schema will be inferred automatically.
def infer_schema(): # Create data frame df = spark.createDataFrame(data) print(df.schema) df.show()
The output looks like the following:
StructType(List(StructField(Amount,DoubleType,true),StructField(Category,StringType,true),StructField(ItemID,LongType,true))) +------+----------+------+ |Amount| Category|ItemID| +------+----------+------+ | 12.4|Category A| 1| | 30.1|Category B| 2| |100.01|Category C| 3| |110.01|Category A| 4| | 70.85|Category B| 5| +------+----------+------+
Solution 2 - Explicit schema
Of course, you can also define the schema directly when creating the data frame:
def explicit_schema(): # Create a schema for the dataframe schema = StructType([ StructField('Category', StringType(), False), StructField('ItemID', IntegerType(), False), StructField('Amount', FloatType(), True) ]) # Create data frame df = spark.createDataFrame(data, schema) print(df.schema) df.show()
In this way, you can control the data types explicitly. The output looks like the following:
StructType(List(StructField(Category,StringType,false),StructField(ItemID,IntegerType,false),StructField(Amount,FloatType,true))) +----------+------+------+ | Category|ItemID|Amount| +----------+------+------+ |Category A| 1| 12.4| |Category B| 2| 30.1| |Category C| 3|100.01| |Category A| 4|110.01| |Category B| 5| 70.85| +----------+------+------+
You will notice that the sequence of attributes is slightly different from the inferred one.
Summary
You can easily convert Python list to Spark DataFrame in Spark 2.x.
Complete code
Code is available in GitHub:
https://github.com/FahaoTang/spark-examples/tree/master/python-dict-list
info Last modified by Administrator 4 years ago
copyright
This page is subject to Site terms.
comment Comments
No comments yet.
Log in with external accounts
warning Please login first to view stats information.