Spark Dataset and DataFrame
Spark Dataset
Spark Dataset was introduced from Spark 1.6 that provides Spark SQL benefits for RDDs. It is a distributed collection of data.
Dataset API is available in Scala and Java and is not supported in Python or R due to the dynamic nature of those languages. However because the features of those languages, you can easily access columns too via DataFrame object in Python or R.
Spark DataFrame
Spark DataFrame is a Dataset of Rows with named columns. It is like a table in a relational database.
In Java, Spark DataFrame is a Dataset or Row type (i.e. Dataset<Row>). In Scala, DataFrame type is an alias for type Dataset[Row]. In Python and R, DataFrame type provides similar functions.
Spark Dataset example via Scala
The following code snippet provide examples of creating Dataset using Scala.
case class Person(var FirstName:String, var LastName:String)The code first create a case class named Person; it then creates a Dataset object from CSV file.
val ds = spark.read.format("csv").option("header","true").load("file:///F:/big-data/person.csv").as[Person]
import spark.implicits._
ds.select($"FirstName").show()
ds.where($"FirstName" === "Raymond").show()
ds.filter($"FirstName" === "Raymond").show()
The file content looks like the following:
FirstName,LastName "Raymond","Tang" "Kontext","Admin"
Output:
scala> ds.filter($"FirstName" === "Raymond").show()
+---------+--------+
|FirstName|LastName|
+---------+--------+
| Raymond| Tang|
+---------+--------+
Spark DataFrame examples
Kontext provides many examples about Spark DataFrame and transformations.
Refer to series: Spark DataFrame Transformation Tutorials
References
Scala: Read CSV File as Spark DataFrame