This article shows you how to filter NULL/None values from a Spark data frame using Scala. Function DataFrame.filter or DataFrame.where can be used to filter out null values. Function filteris alias name for wherefunction.
Code snippet
Let's first construct a data frame with None values in some column.
val data =
Array(List("Category A", 100, "This is category A"),
List("Category B", 120, "This is category B"),
List("Category C", null, "This is category C"))
import org.apache.spark.sql._
import org.apache.spark.sql.types._
// Create a schema for the dataframe
val schema =
StructType(
StructField("Category", StringType, true) ::
StructField("Count", IntegerType, true) ::
StructField("Description", StringType, true) :: Nil)
// Convert list to List of Row
val rows = data.map(t=>Row(t(0),t(1),t(2))).toList
// Create RDD
val rdd = spark.sparkContext.parallelize(rows)
// Create data frame
val df = spark.createDataFrame(rdd,schema)
print(df.schema)
df.show()
The content of the data frame looks like this:
+----------+-----+------------------+
| Category|Count| Description|
+----------+-----+------------------+
|Category A| 100|This is category A|
|Category B| 120|This is category B|
|Category C| null|This is category C|
+----------+-----+------------------+
Filter using SQL expression
The following code filter columns using SQL:
df.filter("Count is not null").show()
df.where("Count is null").show()
Standard ANSI-SQL expressionsIS NOT NULL and IS NULL are used.
Output:
scala> df.filter("Count is not null").show()
+----------+-----+------------------+
| Category|Count| Description|
+----------+-----+------------------+
|Category A| 100|This is category A|
|Category B| 120|This is category B|
+----------+-----+------------------+
scala> df.where("Count is null").show()
+----------+-----+------------------+
| Category|Count| Description|
+----------+-----+------------------+
|Category C| null|This is category C|
+----------+-----+------------------+
Filter using column
df.filter(isnull($"Count")).show()
df.filter(!isnull($"Count")).show()
The above code snippet pass in a type.BooleanType Column object to the filteror wherefunction. If there is a boolean column existing in the data frame, you can directly pass it in as condition.
Output:
scala> df.filter(isnull($"Count")).show()
+----------+-----+------------------+
| Category|Count| Description|
+----------+-----+------------------+
|Category C| null|This is category C|
+----------+-----+------------------+
scala> df.filter(!isnull($"Count")).show()
+----------+-----+------------------+
| Category|Count| Description|
+----------+-----+------------------+
|Category A| 100|This is category A|
|Category B| 120|This is category B|
+----------+-----+------------------+
Run Spark code
You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet: