Convert List to Spark Data Frame in Scala / Spark

visibility 18,942 access_time 2 years ago languageEnglish

In Spark, SparkContext.parallelize function can be used to convert list of objects to RDD and then RDD can be converted to DataFrame object through SparkSession.

Similar to PySpark, we can use SparkContext.parallelize function to create RDD; alternatively we can also use SparkContext.makeRDD function to convert list to RDD.

The output looks like the following:

+----------+-----+------------------+

|  Category|Count|       Description|

+----------+-----+------------------+

|Category A|  100|This is category A|

|Category B|  120|This is category B|

|Category C|  150|This is category C|

+----------+-----+------------------+

Code snippet

import org.apache.spark.sql._
import org.apache.spark.sql.types._

val appName = "Scala Example - List to Spark Data Frame"
val master = "local"

/*Create Spark session with Hive supported.*/
val spark = SparkSession.builder.appName(appName).master(master).getOrCreate()

/* List */
val data = List(Row("Category A", 100, "This is category A"),
Row("Category B", 120, "This is category B"),
Row("Category C", 150, "This is category C"))

val schema = StructType(List(
  StructField("Category", StringType, true),
StructField("Count", IntegerType, true),
StructField("Description", StringType, true)
))

/* Convert list to RDD */
val rdd = spark.sparkContext.parallelize(data)

/* Create data frame */
val df = spark.createDataFrame(rdd, schema)
print(df.schema)
df.show()
info Last modified by Administrator 2 years ago copyright This page is subject to Site terms.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

timeline Stats
Page index 18.13
More from Kontext
Spark Scala: Load Data from SQL Server
visibility 1,758
thumb_up 0
access_time 2 years ago