Convert Python Dictionary List to PySpark DataFrame
insights Stats
Apache Spark installation guides, performance tuning tips, general tutorials, etc.
*Spark logo is a registered trademark of Apache Spark.
This article shows how to convert a Python dictionary list to a DataFrame in Spark using Python.
Example dictionary list
data = [{"Category": 'Category A', "ID": 1, "Value": 12.40}, {"Category": 'Category B', "ID": 2, "Value": 30.10}, {"Category": 'Category C', "ID": 3, "Value": 100.01} ]
The above dictionary list will be used as the input.
Solution 1 - Infer schema from dict
In Spark 2.x, schema can be directly inferred from dictionary. The following code snippets directly create the data frame using SparkSession.createDataFrame function.
Code snippet
from pyspark.sql import SparkSession appName = "Python Example - PySpark Parsing Dictionary as DataFrame" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() # List data = [{"Category": 'Category A', "ID": 1, "Value": 12.40}, {"Category": 'Category B', "ID": 2, "Value": 30.10}, {"Category": 'Category C', "ID": 3, "Value": 100.01} ] # Create data frame df = spark.createDataFrame(data) print(df.schema) df.show()
Output
The following is the output from the above PySpark script.
session.py:340: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
warnings.warn("inferring schema from dict is deprecated,"
StructType(List(StructField(Category,StringType,true),StructField(ID,LongType,true),StructField(Value,DoubleType,true)))
+----------+---+------+
| Category| ID| Value|
+----------+---+------+
|Category A| 1| 12.4|
|Category B| 2| 30.1|
|Category C| 3|100.01|
+----------+---+------+
StructType(List(StructField(Category,StringType,true),StructField(ID,LongType,true),StructField(Value,DoubleType,true)))
However there is one warning:
Warning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
Solution 2 - Use pyspark.sql.Row
As the warning message suggests in solution 1, we are going to use pyspark.sql.Row in this solution.
Code snippet
from pyspark.sql import SparkSession, Row appName = "Python Example - PySpark Parsing Dictionary as DataFrame" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() # List data = [{"Category": 'Category A', "ID": 1, "Value": 12.40}, {"Category": 'Category B', "ID": 2, "Value": 30.10}, {"Category": 'Category C', "ID": 3, "Value": 100.01} ] # Create data frame df = spark.createDataFrame([Row(**i) for i in data]) print(df.schema) df.show()
Solution 3 - Explicit schema
schema = StructType([ StructField('Category', StringType(), False), StructField('ID', IntegerType(), False), StructField('Value', DecimalType(scale=2), True) ])
Code snippet
from pyspark.sql import SparkSession from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType, DecimalType from decimal import Decimal appName = "Python Example - PySpark Parsing Dictionary as DataFrame" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() # List data = [{"Category": 'Category A', "ID": 1, "Value": Decimal(12.40)}, {"Category": 'Category B', "ID": 2, "Value": Decimal(30.10)}, {"Category": 'Category C', "ID": 3, "Value": Decimal(100.01)} ] schema = StructType([ StructField('Category', StringType(), False), StructField('ID', IntegerType(), False), StructField('Value', DecimalType(scale=2), True) ]) # Create data frame df = spark.createDataFrame(data, schema) print(df.schema) df.show()
There are many different ways to achieve the same goal. Let me know if you have other options.
person Swapnil access_time 4 years ago
like here:
I am reading list with each list item is a csv line
rdd_f_n_cnt=['/usr/lcoal/app/,100,s3-xyz,emp.txt','/usr/lcoal/app/,100,s3-xyz,emp.txt']
and putting format like key=val
rdd_f_n_cnt_2 = rdd_f_n_cnt.map(lambda l:Row(path=l.split(",")[0],file_count=l.split(",")[1],folder_name=l.split(",")[2],file_name=l.split(",")[3]))
Indirectly you are doing same with **
like here:
I am reading list with each list item is a csv line
rdd_f_n_cnt=['/usr/lcoal/app/,100,s3-xyz,emp.txt','/usr/lcoal/app/,100,s3-xyz,emp.txt']
and putting format like key=val
rdd_f_n_cnt_2 = rdd_f_n_cnt.map(lambda l:Row(path=l.split(",")[0],file_count=l.split(",")[1],folder_name=l.split(",")[2],file_name=l.split(",")[3]))
Indirectly you are doing same with **
person Raymond access_time 4 years ago
Hi Swapnil,
Is this a question or comment?
If I understand your question correctly, you were asking about the following?
**i
** (double asterisk) denotes a dictionary unpacking. It unpacks the dictionary contents as parameters for Row class construction.
ohhk got it
I thought it needs only this below format:
key=val
Row(Category= 'Category A', ID= 1,Value=1)
person Raymond access_time 4 years ago
Hi Swapnil,
Is this a question or comment?
If I understand your question correctly, you were asking about the following?
**i
** (double asterisk) denotes a dictionary unpacking. It unpacks the dictionary contents as parameters for Row class construction.
Hi Raymond,
wonderful Article ,Was just confused at below line :
df = spark.createDataFrame([Row(**i) for i in data])
I assume Row class needs input like
row=Row(Category= 'Category A', ID= 1,Value=1) so how this is getting translated here..
or is it like when we give input like a key ,val,it understands and creates schema correctly ?
Hi Swapnil,
Is this a question or comment?
If I understand your question correctly, you were asking about the following?
**i
** (double asterisk) denotes a dictionary unpacking. It unpacks the dictionary contents as parameters for Row class construction.
Comment is deleted or blocked.
Correct that is more about a Python syntax rather than something special about Spark.
I feel like to explicitly specify attributes for each Row will make the code easier to read sometimes.