In Spark, SparkContext.parallelize function can be used to convert Python list to RDD and then RDD can be converted to DataFrame object. The following sample code is based on Spark 2.x.

In this page, I am going to show you how to convert the following list to a data frame:

data = [('Category A', 100, "This is category A"),
        ('Category B', 120, "This is category B"),
        ('Category C', 150, "This is category C")]

Import types

First, let’s import the data types we need for the data frame.

from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, StringType, IntegerType

We imported StringType and IntegerType because the sample data have three attributes, two are strings and one is integer.

Create Spark session

Create Spark session using the following code:

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType

appName = "PySpark Example - Python Array/List to Spark Data Frame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

Define the schema

Let’s now define a schema for the data frame based on the structure of the Python list.

# Create a schema for the dataframe
schema = StructType([
    StructField('Category', StringType(), True),
    StructField('Count', IntegerType(), True),
    StructField('Description', StringType(), True)
])

Convert the list to data frame

The list can be converted to RDD through parallelize function:

# Convert list to RDD
rdd = spark.sparkContext.parallelize(data)

# Create data frame
df = spark.createDataFrame(rdd,schema)
print(df.schema)
df.show()

Complete script

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType

appName = "PySpark Example - Python Array/List to Spark Data Frame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# List
data = [('Category A', 100, "This is category A"),
        ('Category B', 120, "This is category B"),
        ('Category C', 150, "This is category C")]

# Create a schema for the dataframe
schema = StructType([
    StructField('Category', StringType(), True),
    StructField('Count', IntegerType(), True),
    StructField('Description', StringType(), True)
])

# Convert list to RDD
rdd = spark.sparkContext.parallelize(data)

# Create data frame
df = spark.createDataFrame(rdd,schema)
print(df.schema)
df.show()

Sample output

StructType(List(StructField(Category,StringType,true),StructField(Count,IntegerType,true),StructField(Description,StringType,true)))
+----------+-----+------------------+
|  Category|Count|       Description|
+----------+-----+------------------+
|Category A|  100|This is category A|
|Category B|  120|This is category B|
|Category C|  150|This is category C|
+----------+-----+------------------+

Summary

For Python objects, we can convert them to RDD first and then use SparkSession.createDataFrame function to create the data frame based on the RDD.

The following data types are supported for defining the schema:

  • NullType
  • StringType
  • BinaryType
  • BooleanType
  • DateType
  • TimestampType
  • DecimalType
  • DoubleType
  • FloatType
  • ByteType
  • IntegerType
  • LongType
  • ShortType
  • ArrayType
  • MapType

For more information, please refer to the official API documentation pyspark.sql module.

info Last modified by Raymond at 3 months ago * This page is subject to Site terms.

More from Kontext

local_offer teradata local_offer python

visibility 198
thumb_up 0
access_time 1 month ago

Pandas is commonly used by Python users to perform data operations. In many scenarios, the results need to be saved to a storage like Teradata. This article shows you how to do that easily using JayDeBeApi or  ...

open_in_new View open_in_new Spark + PySpark

local_offer python

visibility 59
thumb_up 0
access_time 2 months ago

CSV is a common data format used in many applications. It's also a common task for data workers to read and parse CSV and then save it into another storage such as RDBMS (Teradata, SQL Server, MySQL). In my previous article  ...

open_in_new View open_in_new Python Programming

local_offer teradata local_offer python local_offer Java

visibility 126
thumb_up 0
access_time 2 months ago

Python JayDeBeApi module allows you to connect from Python to Teradata databases using Java JDBC drivers. In article Connect to Teradata database through Python , I showed ho...

open_in_new View open_in_new Python Programming

local_offer sqlite local_offer python local_offer Java

visibility 46
thumb_up 0
access_time 2 months ago

To read data from SQLite database in Python, you can use the built-in sqlite3 package . Another approach is to use SQLite JDBC driver via  ...

open_in_new View open_in_new Python Programming

info About author

Dark theme mode

Dark theme mode is available on Kontext.

Learn more arrow_forward

Kontext Column

Created for everyone to publish data, programming and cloud related articles. Follow three steps to create your columns.


Learn more arrow_forward