Scala: Filter Spark DataFrame Columns with None or Null Values
This article shows you how to filter NULL/None values from a Spark data frame using Scala. Function DataFrame.filter or DataFrame.where can be used to filter out null values. Function filter is alias name for where function.
Code snippet
Let's first construct a data frame with None values in some column.
val data = Array(List("Category A", 100, "This is category A"), List("Category B", 120, "This is category B"), List("Category C", null, "This is category C")) import org.apache.spark.sql._ import org.apache.spark.sql.types._ // Create a schema for the dataframe val schema = StructType( StructField("Category", StringType, true) :: StructField("Count", IntegerType, true) :: StructField("Description", StringType, true) :: Nil) // Convert list to List of Row val rows = data.map(t=>Row(t(0),t(1),t(2))).toList // Create RDD val rdd = spark.sparkContext.parallelize(rows) // Create data frame val df = spark.createDataFrame(rdd,schema) print(df.schema) df.show()
The content of the data frame looks like this:
+----------+-----+------------------+ | Category|Count| Description| +----------+-----+------------------+ |Category A| 100|This is category A| |Category B| 120|This is category B| |Category C| null|This is category C| +----------+-----+------------------+
Filter using SQL expression
The following code filter columns using SQL:
df.filter("Count is not null").show() df.where("Count is null").show()
Standard ANSI-SQL expressions IS NOT NULL and IS NULL are used.
Output:
scala> df.filter("Count is not null").show() +----------+-----+------------------+ | Category|Count| Description| +----------+-----+------------------+ |Category A| 100|This is category A| |Category B| 120|This is category B| +----------+-----+------------------+ scala> df.where("Count is null").show() +----------+-----+------------------+ | Category|Count| Description| +----------+-----+------------------+ |Category C| null|This is category C| +----------+-----+------------------+
Filter using column
df.filter(isnull($"Count")).show() df.filter(!isnull($"Count")).show()
The above code snippet pass in a type.BooleanType Column object to the filter or where function. If there is a boolean column existing in the data frame, you can directly pass it in as condition.
Output:
scala> df.filter(isnull($"Count")).show() +----------+-----+------------------+ | Category|Count| Description| +----------+-----+------------------+ |Category C| null|This is category C| +----------+-----+------------------+ scala> df.filter(!isnull($"Count")).show() +----------+-----+------------------+ | Category|Count| Description| +----------+-----+------------------+ |Category A| 100|This is category A| |Category B| 120|This is category B| +----------+-----+------------------+
Run Spark code
You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet: