Convert Python Dictionary List to PySpark DataFrame

Raymond Raymond visibility 48,621 event 2019-12-25 access_time 3 years ago language English

This article shows how to convert a Python dictionary list to a DataFrame in Spark using Python.

Example dictionary list

data = [{"Category": 'Category A', "ID": 1, "Value": 12.40},
        {"Category": 'Category B', "ID": 2, "Value": 30.10},
        {"Category": 'Category C', "ID": 3, "Value": 100.01}
        ]

The above dictionary list will be used as the input.

Solution 1 - Infer schema from dict

 In Spark 2.x, schema can be directly inferred from dictionary. The following code snippets directly create the data frame using SparkSession.createDataFrame function.

Code snippet

from pyspark.sql import SparkSession

appName = "Python Example - PySpark Parsing Dictionary as DataFrame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# List
data = [{"Category": 'Category A', "ID": 1, "Value": 12.40},
        {"Category": 'Category B', "ID": 2, "Value": 30.10},
        {"Category": 'Category C', "ID": 3, "Value": 100.01}
        ]

# Create data frame
df = spark.createDataFrame(data)
print(df.schema)
df.show()

Output

The following is the output from the above PySpark script. 

session.py:340: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
  warnings.warn("inferring schema from dict is deprecated,"
StructType(List(StructField(Category,StringType,true),StructField(ID,LongType,true),StructField(Value,DoubleType,true)))
+----------+---+------+
|  Category| ID| Value|
+----------+---+------+
|Category A|  1|  12.4|
|Category B|  2|  30.1|
|Category C|  3|100.01|
+----------+---+------+
The script created a DataFrame with inferred schema as:
StructType(List(StructField(Category,StringType,true),StructField(ID,LongType,true),StructField(Value,DoubleType,true)))

However there is one warning:

Warning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead

Solution 2 - Use pyspark.sql.Row

As the warning message suggests in solution 1, we are going to use pyspark.sql.Row in this solution.

Code snippet

from pyspark.sql import SparkSession, Row

appName = "Python Example - PySpark Parsing Dictionary as DataFrame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# List
data = [{"Category": 'Category A', "ID": 1, "Value": 12.40},
        {"Category": 'Category B', "ID": 2, "Value": 30.10},
        {"Category": 'Category C', "ID": 3, "Value": 100.01}
        ]

# Create data frame
df = spark.createDataFrame([Row(**i) for i in data])
print(df.schema)
df.show()
In this code snippet, we use pyspark.sql.Row to parse dictionary item. It also uses ** to unpack keywords in each dictionary.
The output is the same as solution 1.

Solution 3 - Explicit schema

Of course, we can explicitly define the schema for the DataFrame. 
In the following code snippet, we define the schema based on the data types in the dictionary:
schema = StructType([
    StructField('Category', StringType(), False),
    StructField('ID', IntegerType(), False),
    StructField('Value', DecimalType(scale=2), True)
])

Code snippet

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType, DecimalType
from decimal import Decimal
appName = "Python Example - PySpark Parsing Dictionary as DataFrame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# List
data = [{"Category": 'Category A', "ID": 1, "Value": Decimal(12.40)},
        {"Category": 'Category B', "ID": 2, "Value": Decimal(30.10)},
        {"Category": 'Category C', "ID": 3, "Value": Decimal(100.01)}
        ]

schema = StructType([
    StructField('Category', StringType(), False),
    StructField('ID', IntegerType(), False),
    StructField('Value', DecimalType(scale=2), True)
])

# Create data frame
df = spark.createDataFrame(data, schema)
print(df.schema)
df.show()

There are many different ways to achieve the same goal. Let me know if you have other options. 

More from Kontext
info Last modified by Raymond 3 years ago copyright This page is subject to Site terms.
comment Comments
Raymond Raymond access_time 3 years ago more_vert
#339 Re: Convert Python Dictionary List to PySpark DataFrame

Correct that is more about a Python syntax rather than something special about Spark.

I feel like to explicitly specify attributes for each Row will make the code easier to read sometimes. 

format_quote

person Swapnil access_time 3 years ago
Re: Convert Python Dictionary List to PySpark DataFrame

like here:

I am reading  list with each  list item is a csv line  

rdd_f_n_cnt=['/usr/lcoal/app/,100,s3-xyz,emp.txt','/usr/lcoal/app/,100,s3-xyz,emp.txt']

and putting format like key=val

rdd_f_n_cnt_2 = rdd_f_n_cnt.map(lambda l:Row(path=l.split(",")[0],file_count=l.split(",")[1],folder_name=l.split(",")[2],file_name=l.split(",")[3]))


Indirectly you are doing same with **

S Swapnil access_time 3 years ago more_vert
#338 Re: Convert Python Dictionary List to PySpark DataFrame

like here:

I am reading  list with each  list item is a csv line  

rdd_f_n_cnt=['/usr/lcoal/app/,100,s3-xyz,emp.txt','/usr/lcoal/app/,100,s3-xyz,emp.txt']

and putting format like key=val

rdd_f_n_cnt_2 = rdd_f_n_cnt.map(lambda l:Row(path=l.split(",")[0],file_count=l.split(",")[1],folder_name=l.split(",")[2],file_name=l.split(",")[3]))


Indirectly you are doing same with **

format_quote

person Raymond access_time 3 years ago
Re: Convert Python Dictionary List to PySpark DataFrame

Hi Swapnil,

Is this a question or comment?

If I understand your question correctly, you were asking about the following?

**i

** (double asterisk) denotes a dictionary unpacking. It unpacks the dictionary contents as parameters for Row class construction. 

S Swapnil access_time 3 years ago more_vert
#337 Re: Convert Python Dictionary List to PySpark DataFrame

ohhk  got it

I thought it needs only  this below format:

key=val

Row(Category= 'Category A', ID= 1,Value=1) 


format_quote

person Raymond access_time 3 years ago
Re: Convert Python Dictionary List to PySpark DataFrame

Hi Swapnil,

Is this a question or comment?

If I understand your question correctly, you were asking about the following?

**i

** (double asterisk) denotes a dictionary unpacking. It unpacks the dictionary contents as parameters for Row class construction. 

S Swapnil access_time 3 years ago more_vert
#336 Re: Convert Python Dictionary List to PySpark DataFrame

Hi Raymond,

wonderful Article ,Was just confused at below line :

df = spark.createDataFrame([Row(**i) for i in data])

I assume Row class needs input like

row=Row(Category= 'Category A', ID= 1,Value=1)  so how this is getting translated  here..

 or is it like when we give input like a key ,val,it understands and creates schema  correctly ?

Raymond Raymond access_time 3 years ago more_vert
#335 Re: Convert Python Dictionary List to PySpark DataFrame

Hi Swapnil,

Is this a question or comment?

If I understand your question correctly, you were asking about the following?

**i

** (double asterisk) denotes a dictionary unpacking. It unpacks the dictionary contents as parameters for Row class construction. 

format_quote

Comment is deleted or blocked.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts