Filter Spark DataFrame Columns with None or Null Values

access_time 3 months ago visibility1892 comment 0

This article shows you how to filter NULL/None values from a Spark data frame using Python. Function DataFrame.filter or DataFrame.where can be used to filter out null values. Function filter is alias name for where function. 

Code snippet

Let's first construct a data frame with None values in some column.

from pyspark.sql import SparkSession
from decimal import Decimal

appName = "Spark - Filter rows with null values"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

# List
data = [{"Category": 'Category A', "ID": 1, "Value": Decimal(12.40)},
        {"Category": 'Category B', "ID": 2, "Value": Decimal(30.10)},
        {"Category": 'Category C', "ID": 3, "Value": None},
        {"Category": 'Category D', "ID": 4, "Value": Decimal(1.0)},
        ]

# Create data frame
df = spark.createDataFrame(data)
df.show()

The content of the data frame looks like this:

+----------+---+--------------------+
|  Category| ID|               Value|
+----------+---+--------------------+
|Category A|  1|12.40000000000000...|
|Category B|  2|30.10000000000000...|
|Category C|  3|                null|
|Category D|  4|1.000000000000000000|
+----------+---+--------------------+

Filter using SQL expression

The following code filter columns using SQL:

df.filter("Value is not null").show()
df.where("Value is null").show()

Standard ANSI-SQL expressions IS NOT NULL and IS NULL are used.

Output:

Filter using column

df.filter(df['Value'].isNull()).show()
df.where(df.Value.isNotNull()).show()

The above code snippet pass in a type.BooleanType Column object to the filter or where function. If there is a boolean column existing in the data frame, you can directly pass it in as condition.

Output:


Run Spark code

You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet:

info Last modified by Administrator at 3 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Kontext Column

Created for everyone to publish data, programming and cloud related articles.
Follow three steps to create your columns.


Learn more arrow_forward

More from Kontext

local_offer .NET local_offer spark local_offer parquet local_offer hive local_offer dotnetcore

visibility 1734
thumb_up 0
access_time 2 years ago

I’ve been following Mobius project for a while and have been waiting for this day. .NET for Apache Spark v0.1.0 was just published on 2019-04-25 on GitHub. It provides high performance APIs for programming Apache Spark applications with C# and F#. It is .NET Standard complaint and can run in ...

local_offer python local_offer spark local_offer spark-file-operations

visibility 3868
thumb_up 0
access_time 12 months ago

Parquet is columnar store format published by Apache. It's commonly used in Hadoop ecosystem. There are many programming language APIs that have been implemented to support writing and reading parquet files. 

local_offer tutorial local_offer pyspark local_offer spark local_offer how-to local_offer spark-dataframe

visibility 1073
thumb_up 0
access_time 3 months ago

This article shows how to change column types of Spark DataFrame using Python. For example, convert StringType to DoubleType, StringType to Integer, StringType to DateType. Follow article  Convert Python Dictionary List to PySpark DataFrame to construct a dataframe.

About column

Apache Spark installation guides, performance tuning tips, general tutorials, etc.

*Spark logo is a registered trademark of Apache Spark.

rss_feed Subscribe RSS