PySpark: Convert Python Dictionary List to Spark DataFrame

visibility 11,729 event 2019-12-31 access_time 3 years ago language English
more_vert

This articles show you how to convert a Python dictionary list to a Spark DataFrame. The code snippets runs on Spark 2.x environments.

Input

The input data (dictionary list looks like the following):

data = [{"Category": 'Category A', 'ItemID': 1, 'Amount': 12.40},
        {"Category": 'Category B', 'ItemID': 2, 'Amount': 30.10},
        {"Category": 'Category C', 'ItemID': 3, 'Amount': 100.01},
        {"Category": 'Category A', 'ItemID': 4, 'Amount': 110.01},
        {"Category": 'Category B', 'ItemID': 5, 'Amount': 70.85}
        ]

Solution 1 - Infer schema

In Spark 2.x, DataFrame can be directly created from Python dictionary list and the schema will be inferred automatically. 

def infer_schema():
    # Create data frame
    df = spark.createDataFrame(data)
    print(df.schema)
    df.show()
The output looks like the following:
StructType(List(StructField(Amount,DoubleType,true),StructField(Category,StringType,true),StructField(ItemID,LongType,true)))
+------+----------+------+
|Amount|  Category|ItemID|
+------+----------+------+
|  12.4|Category A|     1|
|  30.1|Category B|     2|
|100.01|Category C|     3|
|110.01|Category A|     4|
| 70.85|Category B|     5|
+------+----------+------+

Solution 2 - Explicit schema

Of course, you can also define the schema directly when creating the data frame:

def explicit_schema():
    # Create a schema for the dataframe
    schema = StructType([
        StructField('Category', StringType(), False),
        StructField('ItemID', IntegerType(), False),
        StructField('Amount', FloatType(), True)
    ])

    # Create data frame
    df = spark.createDataFrame(data, schema)
    print(df.schema)
    df.show()

In this way, you can control the data types explicitly. The output looks like the following:

StructType(List(StructField(Category,StringType,false),StructField(ItemID,IntegerType,false),StructField(Amount,FloatType,true)))
+----------+------+------+
|  Category|ItemID|Amount|
+----------+------+------+
|Category A|     1|  12.4|
|Category B|     2|  30.1|
|Category C|     3|100.01|
|Category A|     4|110.01|
|Category B|     5| 70.85|
+----------+------+------+
You will notice that the sequence of attributes is slightly different from the inferred one.

Summary

You can easily convert Python list to Spark DataFrame in Spark 2.x. 

Complete code

Code is available in GitHub:

https://github.com/FahaoTang/spark-examples/tree/master/python-dict-list

info Last modified by Administrator 3 years ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts