access_time 8 months ago languageEnglish
more_vert

Scala: Filter Spark DataFrame Columns with None or Null Values

visibility 1,171 comment 0

This article shows you how to filter NULL/None values from a Spark data frame using Scala. Function DataFrame.filter or DataFrame.where can be used to filter out null values. Function filter is alias name for where function. 

Code snippet

Let's first construct a data frame with None values in some column.

val data = 
Array(List("Category A", 100, "This is category A"),
List("Category B", 120, "This is category B"),
List("Category C", null, "This is category C"))

import org.apache.spark.sql._
import org.apache.spark.sql.types._

// Create a schema for the dataframe
val schema =
  StructType(
    StructField("Category", StringType, true) ::
    StructField("Count", IntegerType, true) ::
    StructField("Description", StringType, true) :: Nil)

// Convert list to List of Row
val rows = data.map(t=>Row(t(0),t(1),t(2))).toList

// Create RDD
val rdd = spark.sparkContext.parallelize(rows)

// Create data frame
val df = spark.createDataFrame(rdd,schema)
print(df.schema)
df.show()

The content of the data frame looks like this:

+----------+-----+------------------+
|  Category|Count|       Description|
+----------+-----+------------------+
|Category A|  100|This is category A|
|Category B|  120|This is category B|
|Category C| null|This is category C|
+----------+-----+------------------+

Filter using SQL expression

The following code filter columns using SQL:

df.filter("Count is not null").show()
df.where("Count is null").show()

Standard ANSI-SQL expressions IS NOT NULL and IS NULL are used.

Output:

scala> df.filter("Count is not null").show()
+----------+-----+------------------+
|  Category|Count|       Description|
+----------+-----+------------------+
|Category A|  100|This is category A|
|Category B|  120|This is category B|
+----------+-----+------------------+

scala> df.where("Count is null").show()
+----------+-----+------------------+
|  Category|Count|       Description|
+----------+-----+------------------+
|Category C| null|This is category C|
+----------+-----+------------------+

Filter using column

df.filter(isnull($"Count")).show()
df.filter(!isnull($"Count")).show()

The above code snippet pass in a type.BooleanType Column object to the filter or where function. If there is a boolean column existing in the data frame, you can directly pass it in as condition.

Output:

scala> df.filter(isnull($"Count")).show()
+----------+-----+------------------+
|  Category|Count|       Description|
+----------+-----+------------------+
|Category C| null|This is category C|
+----------+-----+------------------+

scala> df.filter(!isnull($"Count")).show()
+----------+-----+------------------+
|  Category|Count|       Description|
+----------+-----+------------------+
|Category A|  100|This is category A|
|Category B|  120|This is category B|
+----------+-----+------------------+

Run Spark code

You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet:

info Last modified by Raymond 8 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Follow Kontext

Get our latest updates on LinkedIn.

Want to contribute on Kontext to help others?

Learn more

More from Kontext

Spark 3.0.1: Connect to HBase 2.4.1
visibility 821
thumb_up 1
access_time 6 months ago
visibility 4301
thumb_up 0
access_time 4 years ago