PySpark: Convert Python Dictionary List to Spark DataFrame

access_time 10 months ago visibility3103 comment 0

This articles show you how to convert a Python dictionary list to a Spark DataFrame. The code snippets runs on Spark 2.x environments.

Input

The input data (dictionary list looks like the following):

data = [{"Category": 'Category A', 'ItemID': 1, 'Amount': 12.40},
        {"Category": 'Category B', 'ItemID': 2, 'Amount': 30.10},
        {"Category": 'Category C', 'ItemID': 3, 'Amount': 100.01},
        {"Category": 'Category A', 'ItemID': 4, 'Amount': 110.01},
        {"Category": 'Category B', 'ItemID': 5, 'Amount': 70.85}
        ]

Solution 1 - Infer schema

In Spark 2.x, DataFrame can be directly created from Python dictionary list and the schema will be inferred automatically. 

def infer_schema():
    # Create data frame
    df = spark.createDataFrame(data)
    print(df.schema)
    df.show()
The output looks like the following:
StructType(List(StructField(Amount,DoubleType,true),StructField(Category,StringType,true),StructField(ItemID,LongType,true)))
+------+----------+------+
|Amount|  Category|ItemID|
+------+----------+------+
|  12.4|Category A|     1|
|  30.1|Category B|     2|
|100.01|Category C|     3|
|110.01|Category A|     4|
| 70.85|Category B|     5|
+------+----------+------+

Solution 2 - Explicit schema

Of course, you can also define the schema directly when creating the data frame:

def explicit_schema():
    # Create a schema for the dataframe
    schema = StructType([
        StructField('Category', StringType(), False),
        StructField('ItemID', IntegerType(), False),
        StructField('Amount', FloatType(), True)
    ])

    # Create data frame
    df = spark.createDataFrame(data, schema)
    print(df.schema)
    df.show()

In this way, you can control the data types explicitly. The output looks like the following:

StructType(List(StructField(Category,StringType,false),StructField(ItemID,IntegerType,false),StructField(Amount,FloatType,true)))
+----------+------+------+
|  Category|ItemID|Amount|
+----------+------+------+
|Category A|     1|  12.4|
|Category B|     2|  30.1|
|Category C|     3|100.01|
|Category A|     4|110.01|
|Category B|     5| 70.85|
+----------+------+------+
You will notice that the sequence of attributes is slightly different from the inferred one.

Summary

You can easily convert Python list to Spark DataFrame in Spark 2.x. 

Complete code

Code is available in GitHub:

https://github.com/FahaoTang/spark-examples/tree/master/python-dict-list

info Last modified by Administrator at 2 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Kontext Column

Created for everyone to publish data, programming and cloud related articles.
Follow three steps to create your columns.


Learn more arrow_forward

More from Kontext

PySpark Read Multiple Lines Records from CSV

local_offer pyspark local_offer spark-2-x local_offer python local_offer spark-file-operations

visibility 1288
thumb_up 0
access_time 7 months ago

CSV is a common format used when extracting and exchanging data between systems and platforms. Once CSV file is ingested into HDFS, you can easily read them as DataFrame in Spark. However there are a few options you need to pay attention to especially if you source file: Has records across ...

Pandas DataFrame Plot - Pie Chart

local_offer plot local_offer pandas local_offer jupyter-notebook local_offer python local_offer pandas-plot

visibility 3202
thumb_up 0
access_time 6 months ago

This article provides examples about plotting pie chart using  pandas.DataFrame.plot  function. The data I'm going to use is the same as the other article  Pandas DataFrame Plot - Bar Chart . I'm also using Jupyter Notebook to plot them. The DataFrame has 9 records: DATE TYPE ...

local_offer kafka local_offer python

visibility 17
thumb_up 0
access_time 22 days ago

Apache Kafka is written with Scala. Thus, the most natural way is to use Scala (or Java) to call Kafka APIs, for example, Consumer APIs and Producer APIs. For Python developers, there are open source packages available that function similar as official Java clients.  This article shows you ...

About column

Apache Spark installation guides, performance tuning tips, general tutorials, etc.

*Spark logo is a registered trademark of Apache Spark.

rss_feed Subscribe RSS