Deduplicate Spark DataFrame via dropDuplicates() and distinct()

visibility 379 event 2021-03-08 access_time 10 months ago language English
more_vert

There are two functions can be used to remove duplicates from Spark DataFrame: distinct and dropDuplicates.

Sample DataFrame

The following code snippet creates a sample DataFrame with duplicates.

from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, StringType, StructField, StructType

appName = "PySpark Example"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# Sample data
data = [(1, 'A'), (1, 'A'), (3, 'C')]

# schema
schema = StructType([StructField("ID", IntegerType(), True),
                     StructField("Value", StringType(), True)])

# Create Spark DaraFrame from pandas DataFrame
df = spark.createDataFrame(data, schema)
df.show()

spark.stop()

Output:

+---+-----+
| ID|Value|
+---+-----+
|  1|    A|
|  1|    A|
|  3|    C|
+---+-----+

Function distinct

This function returns a new DataFrames with duplicated rows removed.

Code snippet

df.distinct().show()

Output:

+---+-----+
| ID|Value|
+---+-----+
|  3|    C|
|  1|    A|
+---+-----+

Function dropDuplicates

This function also has one argument that can be used to specify a subset of columns to be deduplicated.  It also has a alias drop_duplicates.

Code snippet

df.dropDuplicates().show()
df.drop_duplicates().show()
df.drop_duplicates(["ID"]).show()
df.dropDuplicates(["Value"]).show()

Output:

+---+-----+
| ID|Value|
+---+-----+
|  3|    C|
|  1|    A|
+---+-----+

+---+-----+
| ID|Value|
+---+-----+
|  3|    C|
|  1|    A|
+---+-----+

+---+-----+
| ID|Value|
+---+-----+
|  1|    A|
|  3|    C|
+---+-----+

+---+-----+
| ID|Value|
+---+-----+
|  3|    C|
|  1|    A|
+---+-----+
More from Kontext
info Last modified by Raymond 10 months ago copyright This page is subject to Site terms.
Like this article?
Share on
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts