PySpark: Convert Python Dictionary List to Spark DataFrame

access_time 2 years ago visibility5383 comment 0

This articles show you how to convert a Python dictionary list to a Spark DataFrame. The code snippets runs on Spark 2.x environments.

Input

The input data (dictionary list looks like the following):

data = [{"Category": 'Category A', 'ItemID': 1, 'Amount': 12.40},
        {"Category": 'Category B', 'ItemID': 2, 'Amount': 30.10},
        {"Category": 'Category C', 'ItemID': 3, 'Amount': 100.01},
        {"Category": 'Category A', 'ItemID': 4, 'Amount': 110.01},
        {"Category": 'Category B', 'ItemID': 5, 'Amount': 70.85}
        ]

Solution 1 - Infer schema

In Spark 2.x, DataFrame can be directly created from Python dictionary list and the schema will be inferred automatically. 

def infer_schema():
    # Create data frame
    df = spark.createDataFrame(data)
    print(df.schema)
    df.show()
The output looks like the following:
StructType(List(StructField(Amount,DoubleType,true),StructField(Category,StringType,true),StructField(ItemID,LongType,true)))
+------+----------+------+
|Amount|  Category|ItemID|
+------+----------+------+
|  12.4|Category A|     1|
|  30.1|Category B|     2|
|100.01|Category C|     3|
|110.01|Category A|     4|
| 70.85|Category B|     5|
+------+----------+------+

Solution 2 - Explicit schema

Of course, you can also define the schema directly when creating the data frame:

def explicit_schema():
    # Create a schema for the dataframe
    schema = StructType([
        StructField('Category', StringType(), False),
        StructField('ItemID', IntegerType(), False),
        StructField('Amount', FloatType(), True)
    ])

    # Create data frame
    df = spark.createDataFrame(data, schema)
    print(df.schema)
    df.show()

In this way, you can control the data types explicitly. The output looks like the following:

StructType(List(StructField(Category,StringType,false),StructField(ItemID,IntegerType,false),StructField(Amount,FloatType,true)))
+----------+------+------+
|  Category|ItemID|Amount|
+----------+------+------+
|Category A|     1|  12.4|
|Category B|     2|  30.1|
|Category C|     3|100.01|
|Category A|     4|110.01|
|Category B|     5| 70.85|
+----------+------+------+
You will notice that the sequence of attributes is slightly different from the inferred one.

Summary

You can easily convert Python list to Spark DataFrame in Spark 2.x. 

Complete code

Code is available in GitHub:

https://github.com/FahaoTang/spark-examples/tree/master/python-dict-list

info Last modified by Administrator 6 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

More from Kontext

visibility 1374
thumb_up 0
access_time 6 months ago

This article shows how to 'delete' column from Spark data frame using Python.  Follow article  Convert Python Dictionary List to PySpark DataFrame to construct a dataframe. +----------+---+------+ | Category| ID| Value| +----------+---+------+ |Category A| 1| 12.40| |Category B| ...

Spark Read from SQL Server Source using Windows/Kerberos Authentication
visibility 1472
thumb_up 0
access_time 12 months ago

In this article, I am going to show you how to use JDBC Kerberos authentication to connect to SQL Server sources in Spark (PySpark). I will use  Kerberos connection with principal names and password directly that requires  Microsoft JDBC Driver 6.2  or above. The sample code can run ...

visibility 1460
thumb_up 0
access_time 2 years ago

This code snippet shows how to convert string to date.