Filter Spark DataFrame Columns with None or Null Values

access_time 6 months ago visibility8466 comment 0

This article shows you how to filter NULL/None values from a Spark data frame using Python. Function DataFrame.filter or DataFrame.where can be used to filter out null values. Function filter is alias name for where function. 

Code snippet

Let's first construct a data frame with None values in some column.

from pyspark.sql import SparkSession
from decimal import Decimal

appName = "Spark - Filter rows with null values"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

# List
data = [{"Category": 'Category A', "ID": 1, "Value": Decimal(12.40)},
        {"Category": 'Category B', "ID": 2, "Value": Decimal(30.10)},
        {"Category": 'Category C', "ID": 3, "Value": None},
        {"Category": 'Category D', "ID": 4, "Value": Decimal(1.0)},
        ]

# Create data frame
df = spark.createDataFrame(data)
df.show()

The content of the data frame looks like this:

+----------+---+--------------------+
|  Category| ID|               Value|
+----------+---+--------------------+
|Category A|  1|12.40000000000000...|
|Category B|  2|30.10000000000000...|
|Category C|  3|                null|
|Category D|  4|1.000000000000000000|
+----------+---+--------------------+

Filter using SQL expression

The following code filter columns using SQL:

df.filter("Value is not null").show()
df.where("Value is null").show()

Standard ANSI-SQL expressions IS NOT NULL and IS NULL are used.

Output:

Filter using column

df.filter(df['Value'].isNull()).show()
df.where(df.Value.isNotNull()).show()

The above code snippet pass in a type.BooleanType Column object to the filter or where function. If there is a boolean column existing in the data frame, you can directly pass it in as condition.

Output:


Run Spark code

You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet:

info Last modified by Raymond 2 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

More from Kontext

Improve PySpark Performance using Pandas UDF with Apache Arrow
visibility 5775
thumb_up 4
access_time 2 years ago

Apache Arrow is an in-memory columnar data format that can be used in Spark to efficiently transfer data between JVM and Python processes. This currently is most beneficial to Python users that work with Pandas/NumPy data. In this article, I'm going to show you how to utilise Pandas UDF in ...

visibility 15
thumb_up 0
access_time 2 months ago

This article shows you how to filter NULL/None values from a Spark data frame using Scala. Function DataFrame.filter or DataFrame.where can be used to filter out null values.

visibility 134
thumb_up 1
access_time 2 months ago

This article shows how to convert a JSON string to a Spark DataFrame using Scala. It can be used for processing small in memory JSON string.  The following sample JSON string will be used. It is a simple JSON array with three items in the array. For each item, there are two attributes named ...