Convert List to Spark Data Frame in Python / Spark

visibility 9,884 comment 0 access_time 2y languageEnglish

In Spark, SparkContext.parallelize function can be used to convert list of objects to RDD and then RDD can be converted to DataFrame object through SparkSession.

In PySpark, we can convert a Python list to RDD using SparkContext.parallelize function.

+----------+-----+------------------+
|  Category|Count|       Description|
+----------+-----+------------------+
|Category A|  100|This is category A|
|Category B|  120|This is category B|
|Category C|  150|This is category C|
+----------+-----+------------------+

Code snippet

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType, DecimalType
from decimal import Decimal

appName = "PySpark Example - Python Array/List to Spark Data Frame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# List
data = [('Category A', Decimal(100), "This is category A"),
        ('Category B', Decimal(120), "This is category B"),
        ('Category C', Decimal(150), "This is category C")]

# Create a schema for the dataframe
schema = StructType([
    StructField('Category', StringType(), True),
    StructField('Count', DecimalType(), True),
    StructField('Description', StringType(), True)
])

# Convert list to RDD
rdd = spark.sparkContext.parallelize(data)

# Create data frame
df = spark.createDataFrame(rdd,schema)
print(df.schema)
df.show()
info Last modified by Raymond 2y copyright This page is subject to Site terms.
Related series

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Tags
More from Kontext