Raymond Raymond | Spark & PySpark

Scala: Convert List to Spark Data Frame

event 2020-12-13 visibility 6,090 comment 0 insights toc
insights Stats

In Spark 2.0 +, SparkSession can directly create Spark data frame using createDataFrame function. 

In this page, I am going to show you how to convert the following Scala list to a Spark data frame:

val data = 
Array(List("Category A", 100, "This is category A"),
List("Category B", 120, "This is category B"),
List("Category C", 150, "This is category C"))
infoThis is a structured documentation of article Convert List to Spark Data Frame in Scala / Spark 

Import types

First, let’s import the data types we need for the data frame.

import org.apache.spark.sql._
import org.apache.spark.sql.types._

Define the schema

Define a schema for the data frame based on the structure of the Python list.

// Create a schema for the dataframe
val schema =
StructType( StructField("Category", StringType, true) ::
StructField("Count", IntegerType, true) ::
StructField("Description", StringType, true) :: Nil)

Convert the list to data frame

The list can be converted to RDD through parallelize function:

// Convert list to List of Row
val rows = data.map(t=>Row(t(0),t(1),t(2))).toList

// Create RDD
val rdd = spark.sparkContext.parallelize(rows)

// Create data frame
val df = spark.createDataFrame(rdd,schema)

Sample output

scala> print(df.schema)
StructType(StructField(Category,StringType,true), StructField(Count,IntegerType,true), StructField(Description,StringType,true))
scala> df.show()
|  Category|Count|       Description|
|Category A|  100|This is category A|
|Category B|  120|This is category B|
|Category C|  150|This is category C|


Refer to the Scala API documentation for more information about SparkSession class:

Spark 3.0.1 ScalaDoc - org.apache.spark.sql.SparkSession

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts