Scala: Filter Spark DataFrame Columns with None or Null Values

event 2020-12-14 visibility 6,530 comment 0 insights
more_vert
insights Stats
Raymond Raymond Spark & PySpark

Apache Spark installation guides, performance tuning tips, general tutorials, etc.

*Spark logo is a registered trademark of Apache Spark.


This article shows you how to filter NULL/None values from a Spark data frame using Scala. Function DataFrame.filter or DataFrame.where can be used to filter out null values. Function filter is alias name for where function. 

Code snippet

Let's first construct a data frame with None values in some column.

val data = 
Array(List("Category A", 100, "This is category A"),
List("Category B", 120, "This is category B"),
List("Category C", null, "This is category C"))

import org.apache.spark.sql._
import org.apache.spark.sql.types._

// Create a schema for the dataframe
val schema =
  StructType(
    StructField("Category", StringType, true) ::
    StructField("Count", IntegerType, true) ::
    StructField("Description", StringType, true) :: Nil)

// Convert list to List of Row
val rows = data.map(t=>Row(t(0),t(1),t(2))).toList

// Create RDD
val rdd = spark.sparkContext.parallelize(rows)

// Create data frame
val df = spark.createDataFrame(rdd,schema)
print(df.schema)
df.show()

The content of the data frame looks like this:

+----------+-----+------------------+
|  Category|Count|       Description|
+----------+-----+------------------+
|Category A|  100|This is category A|
|Category B|  120|This is category B|
|Category C| null|This is category C|
+----------+-----+------------------+

Filter using SQL expression

The following code filter columns using SQL:

df.filter("Count is not null").show()
df.where("Count is null").show()

Standard ANSI-SQL expressions IS NOT NULL and IS NULL are used.

Output:

scala> df.filter("Count is not null").show()
+----------+-----+------------------+
|  Category|Count|       Description|
+----------+-----+------------------+
|Category A|  100|This is category A|
|Category B|  120|This is category B|
+----------+-----+------------------+

scala> df.where("Count is null").show()
+----------+-----+------------------+
|  Category|Count|       Description|
+----------+-----+------------------+
|Category C| null|This is category C|
+----------+-----+------------------+

Filter using column

df.filter(isnull($"Count")).show()
df.filter(!isnull($"Count")).show()

The above code snippet pass in a type.BooleanType Column object to the filter or where function. If there is a boolean column existing in the data frame, you can directly pass it in as condition.

Output:

scala> df.filter(isnull($"Count")).show()
+----------+-----+------------------+
|  Category|Count|       Description|
+----------+-----+------------------+
|Category C| null|This is category C|
+----------+-----+------------------+

scala> df.filter(!isnull($"Count")).show()
+----------+-----+------------------+
|  Category|Count|       Description|
+----------+-----+------------------+
|Category A|  100|This is category A|
|Category B|  120|This is category B|
+----------+-----+------------------+

Run Spark code

You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet:

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts