Scala: Filter Spark DataFrame Columns with None or Null Values

access_time 2 months ago visibility15 comment 0

This article shows you how to filter NULL/None values from a Spark data frame using Scala. Function DataFrame.filter or DataFrame.where can be used to filter out null values. Function filter is alias name for where function. 

Code snippet

Let's first construct a data frame with None values in some column.

val data = 
Array(List("Category A", 100, "This is category A"),
List("Category B", 120, "This is category B"),
List("Category C", null, "This is category C"))

import org.apache.spark.sql._
import org.apache.spark.sql.types._

// Create a schema for the dataframe
val schema =
  StructType(
    StructField("Category", StringType, true) ::
    StructField("Count", IntegerType, true) ::
    StructField("Description", StringType, true) :: Nil)

// Convert list to List of Row
val rows = data.map(t=>Row(t(0),t(1),t(2))).toList

// Create RDD
val rdd = spark.sparkContext.parallelize(rows)

// Create data frame
val df = spark.createDataFrame(rdd,schema)
print(df.schema)
df.show()

The content of the data frame looks like this:

+----------+-----+------------------+
|  Category|Count|       Description|
+----------+-----+------------------+
|Category A|  100|This is category A|
|Category B|  120|This is category B|
|Category C| null|This is category C|
+----------+-----+------------------+

Filter using SQL expression

The following code filter columns using SQL:

df.filter("Count is not null").show()
df.where("Count is null").show()

Standard ANSI-SQL expressions IS NOT NULL and IS NULL are used.

Output:

scala> df.filter("Count is not null").show()
+----------+-----+------------------+
|  Category|Count|       Description|
+----------+-----+------------------+
|Category A|  100|This is category A|
|Category B|  120|This is category B|
+----------+-----+------------------+

scala> df.where("Count is null").show()
+----------+-----+------------------+
|  Category|Count|       Description|
+----------+-----+------------------+
|Category C| null|This is category C|
+----------+-----+------------------+

Filter using column

df.filter(isnull($"Count")).show()
df.filter(!isnull($"Count")).show()

The above code snippet pass in a type.BooleanType Column object to the filter or where function. If there is a boolean column existing in the data frame, you can directly pass it in as condition.

Output:

scala> df.filter(isnull($"Count")).show()
+----------+-----+------------------+
|  Category|Count|       Description|
+----------+-----+------------------+
|Category C| null|This is category C|
+----------+-----+------------------+

scala> df.filter(!isnull($"Count")).show()
+----------+-----+------------------+
|  Category|Count|       Description|
+----------+-----+------------------+
|Category A|  100|This is category A|
|Category B|  120|This is category B|
+----------+-----+------------------+

Run Spark code

You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet:

info Last modified by Raymond 1 month ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Follow Kontext

Get our latest updates on LinkedIn or Twitter.

Want to publish your article on Kontext?

Learn more

More from Kontext

visibility 1080
thumb_up 0
access_time 6 months ago

Spark is a robust framework with logging implemented in all modules. Sometimes it might get too verbose to show all the INFO logs. This article shows you how to hide those INFO logs in the console output. Log level can be setup using function pyspark.SparkContext.setLogLevel . The ...

visibility 15
thumb_up 0
access_time 25 days ago

This article provides the steps to locate Java JDK location on macOS. The system version is 11.1 (macOS Big Sur). Open Terminal . Change directory to /Library/Java/JavaVirtualMachines . cd /Library/Java/JavaVirtualMachines List JDKs: ls In my system, there are two SDKs (JDK8 and ...

visibility 52153
thumb_up 14
access_time 2 years ago

Data partitioning is critical to data processing performance especially for large volume of data processing in Spark. Partitions in Spark won’t span across nodes though one node can contains more than one partitions. When processing, Spark assigns one task for each partition and each worker threads ...