In Spark 2.0 +, SparkSession can directly create Spark data frame using createDataFrame function.
In this page, I am going to show you how to convert the following Scala list to a Spark data frame:
val data = Array(List("Category A", 100, "This is category A"), List("Category B", 120, "This is category B"), List("Category C", 150, "This is category C"))
Import types
First, let’s import the data types we need for the data frame.
import org.apache.spark.sql._ import org.apache.spark.sql.types._
Define the schema
Define a schema for the data frame based on the structure of the Python list.
// Create a schema for the dataframe val schema =
StructType( StructField("Category", StringType, true) ::
StructField("Count", IntegerType, true) ::
StructField("Description", StringType, true) :: Nil)
Convert the list to data frame
The list can be converted to RDD through parallelize function:
// Convert list to List of Row val rows = data.map(t=>Row(t(0),t(1),t(2))).toList // Create RDD val rdd = spark.sparkContext.parallelize(rows) // Create data frame val df = spark.createDataFrame(rdd,schema) print(df.schema) df.show()
Sample output
scala> print(df.schema) StructType(StructField(Category,StringType,true), StructField(Count,IntegerType,true), StructField(Description,StringType,true)) scala> df.show() +----------+-----+------------------+ | Category|Count| Description| +----------+-----+------------------+ |Category A| 100|This is category A| |Category B| 120|This is category B| |Category C| 150|This is category C| +----------+-----+------------------+
Reference
Refer to the Scala API documentation for more information about SparkSession class: