Spark Dataset and DataFrame

Spark Dataset

Spark Dataset was introduced from Spark 1.6 that provides Spark SQL benefits for RDDs. It is a distributed collection of data.

Dataset API is available in Scala and Java and is not supported in Python or R due to the dynamic nature of those languages. However because the features of those languages, you can easily access columns too via DataFrame object in Python or R.

Spark DataFrame

Spark DataFrame is a Dataset of Rows with named columns. It is like a table in a relational database.

In Java, Spark DataFrame is a Dataset or Row type (i.e. Dataset<Row>). In Scala, DataFrame type is an alias for type Dataset[Row]. In Python and R, DataFrame type provides similar functions.

Spark Dataset example via Scala

The following code snippet provide examples of creating Dataset using Scala.

case class Person(var FirstName:String, var LastName:String)val ds = spark.read.format("csv").option("header","true").load("file:///F:/big-data/person.csv").as[Person]import spark.implicits._ds.select($"FirstName").show()ds.where($"FirstName" === "Raymond").show()ds.filter($"FirstName" === "Raymond").show()

The code first create a case class named *Person;*it then creates a Dataset object from CSV file.

The file content looks like the following:

FirstName,LastName
"Raymond","Tang"
"Kontext","Admin"

Output:

scala> ds.filter($"FirstName" === "Raymond").show()+---------+--------+|FirstName|LastName|+---------+--------+|  Raymond|    Tang|+---------+--------+

Spark DataFrame examples

Kontext provides many examples about Spark DataFrame and transformations.

Refer to series: Spark DataFrame Transformation Tutorials

References

Scala: Read CSV File as Spark DataFrame