"Delete" Rows (Data) from PySpark DataFrame

This article shows how to 'delete' rows/data from Spark data frame using Python. I added double quotes to word "Delete" because we are not really deleting the data. Because of Spark's lazy evaluation mechanism for transformations, it is very different from creating a data frame in memory with data and then physically deleting some rows from it.

Construct a dataframe

Follow article Convert Python Dictionary List to PySpark DataFrame to construct a dataframe.

+----------+---+------+
|  Category| ID| Value|
+----------+---+------+
|Category A|  1| 12.40|
|Category B|  2| 30.10|
|Category C|  3|100.01|
+----------+---+------+

'Delete' or 'Remove' one column

The word 'delete' or 'remove' can be misleading as Spark is lazy evaluated.

We can use where or filterfunction to 'remove' or 'delete' rows from a DataFrame.

from pyspark.sql import SparkSession

appName = "Python Example - 'Delete' Data from DataFrame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# List
data = [{"Category": 'Category A', "ID": 1, "Value": 12.40},
        {"Category": 'Category B', "ID": 2, "Value": 30.10},
        {"Category": 'Category C', "ID": 3, "Value": 100.01}
        ]

# Create data frame
df = spark.createDataFrame(data)
print(df.schema)

# Delete/Remove data from the dataframe 
df2 = df.where("Category <> 'Category B'")
df2.show()

Output:

+----------+---+------+
|  Category| ID| Value|
+----------+---+------+
|Category A|  1|  12.4|
|Category C|  3|100.01|
+----------+---+------+

Category B is removed from the DataFrame.

Execution plans

Use the following code to print out execution plans of the data frame:

df2.explain(True)

The plan tells us that it is not a 'remove' or 'delete' operation; the more accurate term would be 'filter'.

== Parsed Logical Plan ==
'Filter NOT ('Category = Category B)
+- LogicalRDD [Category#0, ID#1L, Value#2], false

== Analyzed Logical Plan ==
Category: string, ID: bigint, Value: double
Filter NOT (Category#0 = Category B)
+- LogicalRDD [Category#0, ID#1L, Value#2], false

== Optimized Logical Plan ==
Filter (isnotnull(Category#0) AND NOT (Category#0 = Category B))
+- LogicalRDD [Category#0, ID#1L, Value#2], false

== Physical Plan ==
*(1) Filter (isnotnull(Category#0) AND NOT (Category#0 = Category B))
+- *(1) Scan ExistingRDD[Category#0,ID#1L,Value#2]

Run Spark code

You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet:

Construct a dataframe

'Delete' or 'Remove' one column

Execution plans

Run Spark code

In this article