In Spark 2.0 +, SparkSession can directly create Spark data frame using createDataFrame function.
In this page, I am going to show you how to convert the following Scala list to a Spark data frame:
val data =
Array(List("Category A", 100, "This is category A"),
List("Category B", 120, "This is category B"),
List("Category C", 150, "This is category C"))
infoThis is a structured documentation of article Convert List to Spark Data Frame in Scala / Spark
Import types
First, let’s import the data types we need for the data frame.
import org.apache.spark.sql._ import org.apache.spark.sql.types._
Define the schema
Define a schema for the data frame based on the structure of the Python list.
// Create a schema for the dataframe
val schema = StructType(
StructField("Category", StringType, true) :: StructField("Count", IntegerType, true) :: StructField("Description", StringType, true) :: Nil)
Convert the list to data frame
The list can be converted to RDD through parallelize function:
// Convert list to List of Row
val rows = data.map(t=>Row(t(0),t(1),t(2))).toList
// Create RDD
val rdd = spark.sparkContext.parallelize(rows)
// Create data frame
val df = spark.createDataFrame(rdd,schema)
print(df.schema)
df.show()
Sample output
scala> print(df.schema) StructType(StructField(Category,StringType,true), StructField(Count,IntegerType,true), StructField(Description,StringType,true)) scala> df.show() +----------+-----+------------------+ | Category|Count| Description| +----------+-----+------------------+ |Category A| 100|This is category A| |Category B| 120|This is category B| |Category C| 150|This is category C| +----------+-----+------------------+
Reference
Refer to the Scala API documentation for more information about SparkSession class: