"Delete" Rows (Data) from PySpark DataFrame
This article shows how to 'delete' rows/data from Spark data frame using Python. I added double quotes to word "Delete" because we are not really deleting the data. Because of Spark's lazy evaluation mechanism for transformations, it is very different from creating a data frame in memory with data and then physically deleting some rows from it.
Construct a dataframe
Follow article Convert Python Dictionary List to PySpark DataFrame to construct a dataframe.
+----------+---+------+ | Category| ID| Value| +----------+---+------+ |Category A| 1| 12.40| |Category B| 2| 30.10| |Category C| 3|100.01| +----------+---+------+
'Delete' or 'Remove' one column
The word 'delete' or 'remove' can be misleading as Spark is lazy evaluated.
We can use where or filter function to 'remove' or 'delete' rows from a DataFrame.
from pyspark.sql import SparkSession appName = "Python Example - 'Delete' Data from DataFrame" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() # List data = [{"Category": 'Category A', "ID": 1, "Value": 12.40}, {"Category": 'Category B', "ID": 2, "Value": 30.10}, {"Category": 'Category C', "ID": 3, "Value": 100.01} ] # Create data frame df = spark.createDataFrame(data) print(df.schema) # Delete/Remove data from the dataframe df2 = df.where("Category <> 'Category B'") df2.show()
Output:
+----------+---+------+ | Category| ID| Value| +----------+---+------+ |Category A| 1| 12.4| |Category C| 3|100.01| +----------+---+------+
Category B is removed from the DataFrame.
Execution plans
Use the following code to print out execution plans of the data frame:
df2.explain(True)
The plan tells us that it is not a 'remove' or 'delete' operation; the more accurate term would be 'filter'.
== Parsed Logical Plan == 'Filter NOT ('Category = Category B) +- LogicalRDD [Category#0, ID#1L, Value#2], false == Analyzed Logical Plan == Category: string, ID: bigint, Value: double Filter NOT (Category#0 = Category B) +- LogicalRDD [Category#0, ID#1L, Value#2], false == Optimized Logical Plan == Filter (isnotnull(Category#0) AND NOT (Category#0 = Category B)) +- LogicalRDD [Category#0, ID#1L, Value#2], false == Physical Plan == *(1) Filter (isnotnull(Category#0) AND NOT (Category#0 = Category B)) +- *(1) Scan ExistingRDD[Category#0,ID#1L,Value#2]
Run Spark code
You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet: