Raymond Raymond

Filter Spark DataFrame Columns with None or Null Values

event 2020-08-10 visibility 41,489 comment 0 insights toc
insights Stats

This article shows you how to filter NULL/None values from a Spark data frame using Python. Function DataFrame.filter or DataFrame.where can be used to filter out null values. Function filter is alias name for where function. 

Code snippet

Let's first construct a data frame with None values in some column.

from pyspark.sql import SparkSession
from decimal import Decimal

appName = "Spark - Filter rows with null values"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \


# List
data = [{"Category": 'Category A', "ID": 1, "Value": Decimal(12.40)},
        {"Category": 'Category B', "ID": 2, "Value": Decimal(30.10)},
        {"Category": 'Category C', "ID": 3, "Value": None},
        {"Category": 'Category D', "ID": 4, "Value": Decimal(1.0)},

# Create data frame
df = spark.createDataFrame(data)

The content of the data frame looks like this:

|  Category| ID|               Value|
|Category A|  1|12.40000000000000...|
|Category B|  2|30.10000000000000...|
|Category C|  3|                null|
|Category D|  4|1.000000000000000000|

Filter using SQL expression

The following code filter columns using SQL:

df.filter("Value is not null").show()
df.where("Value is null").show()

Standard ANSI-SQL expressions IS NOT NULL and IS NULL are used.



Filter using column


The above code snippet pass in a type.BooleanType Column object to the filter or where function. If there is a boolean column existing in the data frame, you can directly pass it in as condition.



Run Spark code

You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet:

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts