Spark Dataset and DataFrame

Raymond Raymond event 2021-10-13 visibility 576
more_vert

Spark Dataset

Spark Dataset was introduced from Spark 1.6 that provides Spark SQL benefits for RDDs. It is a distributed collection of data. 

Dataset API is available in Scala and Java and is not supported in Python or R due to the dynamic nature of those languages. However because the features of those languages, you can easily access columns too via DataFrame object in Python or R.

Spark DataFrame

Spark DataFrame is a Dataset of Rows with named columns. It is like a table in a relational database.

In Java, Spark DataFrame is a Dataset or Row type (i.e. Dataset<Row>). In Scala, DataFrame type is an alias for type Dataset[Row]. In Python and R, DataFrame type provides similar functions. 

Spark Dataset example via Scala

The following code snippet provide examples of creating Dataset using Scala.

case class Person(var FirstName:String, var LastName:String)
val ds = spark.read.format("csv").option("header","true").load("file:///F:/big-data/person.csv").as[Person]
import spark.implicits._
ds.select($"FirstName").show()
ds.where($"FirstName" === "Raymond").show()
ds.filter($"FirstName" === "Raymond").show()
The code first create a case class named Person; it then creates a Dataset object from CSV file. 

The file content looks like the following:

FirstName,LastName
"Raymond","Tang"
"Kontext","Admin"

Output:

scala> ds.filter($"FirstName" === "Raymond").show()
+---------+--------+
|FirstName|LastName|
+---------+--------+
| Raymond| Tang|
+---------+--------+

Spark DataFrame examples

Kontext provides many examples about Spark DataFrame and transformations.

Refer to series: Spark DataFrame Transformation Tutorials

References

Scala: Read CSV File as Spark DataFrame

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts