Scala: Convert List to Spark Data Frame

Raymond Tang Raymond Tang 0 6634 4.00 index 12/13/2020

In Spark 2.0 +, SparkSession can directly create Spark data frame using createDataFrame function.

In this page, I am going to show you how to convert the following Scala list to a Spark data frame:

val data = 
Array(List("Category A", 100, "This is category A"),
List("Category B", 120, "This is category B"),
List("Category C", 150, "This is category C"))

infoThis is a structured documentation of article Convert List to Spark Data Frame in Scala / Spark 

Import types

First, let’s import the data types we need for the data frame.

import org.apache.spark.sql._
import org.apache.spark.sql.types._

Define the schema

Define a schema for the data frame based on the structure of the Python list.

// Create a schema for the dataframe
val schema =  StructType(
    StructField("Category", StringType, true) ::    StructField("Count", IntegerType, true) ::    StructField("Description", StringType, true) :: Nil)

Convert the list to data frame

The list can be converted to RDD through parallelize function:

// Convert list to List of Row
val rows = data.map(t=>Row(t(0),t(1),t(2))).toList

// Create RDD
val rdd = spark.sparkContext.parallelize(rows)

// Create data frame
val df = spark.createDataFrame(rdd,schema)
print(df.schema)
df.show()

Sample output

scala> print(df.schema)
StructType(StructField(Category,StringType,true), StructField(Count,IntegerType,true), StructField(Description,StringType,true))
scala> df.show()
+----------+-----+------------------+
|  Category|Count|       Description|
+----------+-----+------------------+
|Category A|  100|This is category A|
|Category B|  120|This is category B|
|Category C|  150|This is category C|
+----------+-----+------------------+

Reference

Refer to the Scala API documentation for more information about SparkSession class:

Spark 3.0.1 ScalaDoc - org.apache.spark.sql.SparkSession

scala spark

Join the Discussion

View or add your thoughts below

Comments