By using this site, you acknowledge that you have read and understand our Cookie policy, Privacy policy and Terms .

This article shows how to convert a Python dictionary list to a DataFrame in Spark using Python.

Example dictionary list

data = [{"Category": 'Category A', "ID": 1, "Value": 12.40},
        {"Category": 'Category B', "ID": 2, "Value": 30.10},
        {"Category": 'Category C', "ID": 3, "Value": 100.01}
        ]

The above dictionary list will be used as the input.

Solution 1 - Infer schema from dict

 In Spark 2.x, schema can be directly inferred from dictionary. The following code snippets directly create the data frame using SparkSession.createDataFrame function.

Code snippet

from pyspark.sql import SparkSession

appName = "Python Example - PySpark Parsing Dictionary as DataFrame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# List
data = [{"Category": 'Category A', "ID": 1, "Value": 12.40},
        {"Category": 'Category B', "ID": 2, "Value": 30.10},
        {"Category": 'Category C', "ID": 3, "Value": 100.01}
        ]

# Create data frame
df = spark.createDataFrame(data)
print(df.schema)
df.show()

Output

The following is the output from the above PySpark script. 

session.py:340: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
  warnings.warn("inferring schema from dict is deprecated,"
StructType(List(StructField(Category,StringType,true),StructField(ID,LongType,true),StructField(Value,DoubleType,true)))
+----------+---+------+
|  Category| ID| Value|
+----------+---+------+
|Category A|  1|  12.4|
|Category B|  2|  30.1|
|Category C|  3|100.01|
+----------+---+------+
The script created a DataFrame with inferred schema as:
StructType(List(StructField(Category,StringType,true),StructField(ID,LongType,true),StructField(Value,DoubleType,true)))

However there is one warning:

Warning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead

Solution 2 - Use pyspark.sql.Row

As the warning message suggests in solution 1, we are going to use pyspark.sql.Row in this solution.

Code snippet

from pyspark.sql import SparkSession, Row

appName = "Python Example - PySpark Parsing Dictionary as DataFrame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# List
data = [{"Category": 'Category A', "ID": 1, "Value": 12.40},
        {"Category": 'Category B', "ID": 2, "Value": 30.10},
        {"Category": 'Category C', "ID": 3, "Value": 100.01}
        ]

# Create data frame
df = spark.createDataFrame([Row(**i) for i in data])
print(df.schema)
df.show()
In this code snippet, we use pyspark.sql.Row to parse dictionary item. It also uses ** to unpack keywords in each dictionary.
The output is the same as solution 1.

Solution 3 - Explicit schema

Of course, we can explicitly define the schema for the DataFrame. 
In the following code snippet, we define the schema based on the data types in the dictionary:
schema = StructType([
    StructField('Category', StringType(), False),
    StructField('ID', IntegerType(), False),
    StructField('Value', DecimalType(scale=2), True)
])

Code snippet

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType, DecimalType
from decimal import Decimal
appName = "Python Example - PySpark Parsing Dictionary as DataFrame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# List
data = [{"Category": 'Category A', "ID": 1, "Value": Decimal(12.40)},
        {"Category": 'Category B', "ID": 2, "Value": Decimal(30.10)},
        {"Category": 'Category C', "ID": 3, "Value": Decimal(100.01)}
        ]

schema = StructType([
    StructField('Category', StringType(), False),
    StructField('ID', IntegerType(), False),
    StructField('Value', DecimalType(scale=2), True)
])

# Create data frame
df = spark.createDataFrame(data, schema)
print(df.schema)
df.show()

There are many different ways to achieve the same goal. Let me know if you have other options. 

* This page is subject to Site terms.

More from Kontext

local_offer sqlite local_offer python local_offer Java

visibility 4
thumb_up 0
access_time 4 hours ago

To read data from SQLite database in Python, you can use the built-in sqlite3 package . Another approach is to use SQLite JDBC driver via  ...

open_in_new View open_in_new Python Programming

local_offer python local_offer sqlite

visibility 2
thumb_up 0
access_time 5 hours ago

SQLite is one of the most commonly used embedded file databases. All the mainstream programming language/framework provides APIs to interact with SQLite database. In my previous article  ...

open_in_new View open_in_new Python Programming

local_offer Java local_offer python local_offer SQL Server

visibility 4
thumb_up 0
access_time 7 hours ago

In my previous article  Connect to SQL Server via JayDeBeApi in Python , I showed examples of u...

open_in_new View open_in_new Python Programming

Pandas DataFrame Plot - Scatter and Hexbin Chart

local_offer plot local_offer pandas local_offer jupyter-notebook local_offer python

visibility 11
thumb_up 0
access_time 6 days ago

 In this article I'm going to show you some examples about plotting scatter and hexbin chart with Pandas DataFrame. I'm using Jupyter Notebook as IDE/code execution environment.  Hexbin chart &nbs...

open_in_new View open_in_new Code snippets

info About author

Dark theme mode

Dark theme mode is available on Kontext.

Learn more arrow_forward
Kontext Column

Kontext Column

Created for everyone to publish data, programming and cloud related articles. Follow three steps to create your columns.

Learn more arrow_forward
info Follow us on Twitter to get the latest article updates. Follow us