Convert Python Dictionary List to PySpark DataFrame

access_time 10 months ago visibility5287 comment 0

This article shows how to convert a Python dictionary list to a DataFrame in Spark using Python.

Example dictionary list

data = [{"Category": 'Category A', "ID": 1, "Value": 12.40},
        {"Category": 'Category B', "ID": 2, "Value": 30.10},
        {"Category": 'Category C', "ID": 3, "Value": 100.01}
        ]

The above dictionary list will be used as the input.

Solution 1 - Infer schema from dict

 In Spark 2.x, schema can be directly inferred from dictionary. The following code snippets directly create the data frame using SparkSession.createDataFrame function.

Code snippet

from pyspark.sql import SparkSession

appName = "Python Example - PySpark Parsing Dictionary as DataFrame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# List
data = [{"Category": 'Category A', "ID": 1, "Value": 12.40},
        {"Category": 'Category B', "ID": 2, "Value": 30.10},
        {"Category": 'Category C', "ID": 3, "Value": 100.01}
        ]

# Create data frame
df = spark.createDataFrame(data)
print(df.schema)
df.show()

Output

The following is the output from the above PySpark script. 

session.py:340: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
  warnings.warn("inferring schema from dict is deprecated,"
StructType(List(StructField(Category,StringType,true),StructField(ID,LongType,true),StructField(Value,DoubleType,true)))
+----------+---+------+
|  Category| ID| Value|
+----------+---+------+
|Category A|  1|  12.4|
|Category B|  2|  30.1|
|Category C|  3|100.01|
+----------+---+------+
The script created a DataFrame with inferred schema as:
StructType(List(StructField(Category,StringType,true),StructField(ID,LongType,true),StructField(Value,DoubleType,true)))

However there is one warning:

Warning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead

Solution 2 - Use pyspark.sql.Row

As the warning message suggests in solution 1, we are going to use pyspark.sql.Row in this solution.

Code snippet

from pyspark.sql import SparkSession, Row

appName = "Python Example - PySpark Parsing Dictionary as DataFrame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# List
data = [{"Category": 'Category A', "ID": 1, "Value": 12.40},
        {"Category": 'Category B', "ID": 2, "Value": 30.10},
        {"Category": 'Category C', "ID": 3, "Value": 100.01}
        ]

# Create data frame
df = spark.createDataFrame([Row(**i) for i in data])
print(df.schema)
df.show()
In this code snippet, we use pyspark.sql.Row to parse dictionary item. It also uses ** to unpack keywords in each dictionary.
The output is the same as solution 1.

Solution 3 - Explicit schema

Of course, we can explicitly define the schema for the DataFrame. 
In the following code snippet, we define the schema based on the data types in the dictionary:
schema = StructType([
    StructField('Category', StringType(), False),
    StructField('ID', IntegerType(), False),
    StructField('Value', DecimalType(scale=2), True)
])

Code snippet

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType, DecimalType
from decimal import Decimal
appName = "Python Example - PySpark Parsing Dictionary as DataFrame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# List
data = [{"Category": 'Category A', "ID": 1, "Value": Decimal(12.40)},
        {"Category": 'Category B', "ID": 2, "Value": Decimal(30.10)},
        {"Category": 'Category C', "ID": 3, "Value": Decimal(100.01)}
        ]

schema = StructType([
    StructField('Category', StringType(), False),
    StructField('ID', IntegerType(), False),
    StructField('Value', DecimalType(scale=2), True)
])

# Create data frame
df = spark.createDataFrame(data, schema)
print(df.schema)
df.show()

There are many different ways to achieve the same goal. Let me know if you have other options. 

info Last modified by Administrator at 2 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Kontext Column

Created for everyone to publish data, programming and cloud related articles.
Follow three steps to create your columns.


Learn more arrow_forward

More from Kontext

local_offer spark

visibility 2183
thumb_up 0
access_time 3 years ago

This page summarizes the steps to install Spark 2.2.1 in your Windows environment. GIT Bash Command Prompt Windows 10 Download the latest binary from the following site: https://spark.apache.org/downloads.html In my case, I am saving the file to folder: F:\DataAnalytics. Open Git ...

local_offer python local_offer python-file-operations

visibility 674
thumb_up 0
access_time 5 months ago

CSV is a common data format used in many applications. It's also a common task for data workers to read and parse CSV and then save it into another storage such as RDBMS (Teradata, SQL Server, MySQL). In my previous article  PySpark Read Multiple Lines Records from CSV I demonstrated how to ...

Spark Read from SQL Server Source using Windows/Kerberos Authentication

local_offer pyspark local_offer SQL Server local_offer spark-2-x local_offer spark-database-connect

visibility 893
thumb_up 0
access_time 8 months ago

In this article, I am going to show you how to use JDBC Kerberos authentication to connect to SQL Server sources in Spark (PySpark). I will use  Kerberos connection with principal names and password directly that requires  Microsoft JDBC Driver 6.2  or above. The sample code can run ...

About column

Spark

Apache Spark installation guides, performance tuning tips, general tutorials, etc.

*Spark logo is a registered trademark of Apache Spark.

rss_feed Subscribe RSS