access_time 2 years ago languageEnglish
more_vert

PySpark: Convert Python Dictionary List to Spark DataFrame

visibility 7,713 comment 0

This articles show you how to convert a Python dictionary list to a Spark DataFrame. The code snippets runs on Spark 2.x environments.

Input

The input data (dictionary list looks like the following):

data = [{"Category": 'Category A', 'ItemID': 1, 'Amount': 12.40},
        {"Category": 'Category B', 'ItemID': 2, 'Amount': 30.10},
        {"Category": 'Category C', 'ItemID': 3, 'Amount': 100.01},
        {"Category": 'Category A', 'ItemID': 4, 'Amount': 110.01},
        {"Category": 'Category B', 'ItemID': 5, 'Amount': 70.85}
        ]

Solution 1 - Infer schema

In Spark 2.x, DataFrame can be directly created from Python dictionary list and the schema will be inferred automatically. 

def infer_schema():
    # Create data frame
    df = spark.createDataFrame(data)
    print(df.schema)
    df.show()
The output looks like the following:
StructType(List(StructField(Amount,DoubleType,true),StructField(Category,StringType,true),StructField(ItemID,LongType,true)))
+------+----------+------+
|Amount|  Category|ItemID|
+------+----------+------+
|  12.4|Category A|     1|
|  30.1|Category B|     2|
|100.01|Category C|     3|
|110.01|Category A|     4|
| 70.85|Category B|     5|
+------+----------+------+

Solution 2 - Explicit schema

Of course, you can also define the schema directly when creating the data frame:

def explicit_schema():
    # Create a schema for the dataframe
    schema = StructType([
        StructField('Category', StringType(), False),
        StructField('ItemID', IntegerType(), False),
        StructField('Amount', FloatType(), True)
    ])

    # Create data frame
    df = spark.createDataFrame(data, schema)
    print(df.schema)
    df.show()

In this way, you can control the data types explicitly. The output looks like the following:

StructType(List(StructField(Category,StringType,false),StructField(ItemID,IntegerType,false),StructField(Amount,FloatType,true)))
+----------+------+------+
|  Category|ItemID|Amount|
+----------+------+------+
|Category A|     1|  12.4|
|Category B|     2|  30.1|
|Category C|     3|100.01|
|Category A|     4|110.01|
|Category B|     5| 70.85|
+----------+------+------+
You will notice that the sequence of attributes is slightly different from the inferred one.

Summary

You can easily convert Python list to Spark DataFrame in Spark 2.x. 

Complete code

Code is available in GitHub:

https://github.com/FahaoTang/spark-examples/tree/master/python-dict-list

info Last modified by Administrator 9 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts