Convert List to Spark Data Frame in Python / Spark

visibility 11,024 access_time 2 years ago languageEnglish

In Spark, SparkContext.parallelize function can be used to convert list of objects to RDD and then RDD can be converted to DataFrame object through SparkSession.

In PySpark, we can convert a Python list to RDD using SparkContext.parallelize function.

|  Category|Count|       Description|
|Category A|  100|This is category A|
|Category B|  120|This is category B|
|Category C|  150|This is category C|

Code snippet

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType, DecimalType
from decimal import Decimal

appName = "PySpark Example - Python Array/List to Spark Data Frame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \

# List
data = [('Category A', Decimal(100), "This is category A"),
        ('Category B', Decimal(120), "This is category B"),
        ('Category C', Decimal(150), "This is category C")]

# Create a schema for the dataframe
schema = StructType([
    StructField('Category', StringType(), True),
    StructField('Count', DecimalType(), True),
    StructField('Description', StringType(), True)

# Convert list to RDD
rdd = spark.sparkContext.parallelize(data)

# Create data frame
df = spark.createDataFrame(rdd,schema)
info Last modified by Raymond 2 years ago copyright This page is subject to Site terms.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

timeline Stats
Page index 10.55
More from Kontext
Python: Load / Read Multiline CSV File
visibility 7,217
thumb_up 0
access_time 2 years ago
Python: Load Data from Hive
visibility 1,111
thumb_up 2
access_time 2 years ago