Deduplicate Spark DataFrame via dropDuplicates() and distinct()

Raymond Raymond event 2021-03-08 visibility 656
more_vert

There are two functions can be used to remove duplicates from Spark DataFrame: distinct and dropDuplicates.

Sample DataFrame

The following code snippet creates a sample DataFrame with duplicates.

from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, StringType, StructField, StructType

appName = "PySpark Example"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# Sample data
data = [(1, 'A'), (1, 'A'), (3, 'C')]

# schema
schema = StructType([StructField("ID", IntegerType(), True),
                     StructField("Value", StringType(), True)])

# Create Spark DaraFrame from pandas DataFrame
df = spark.createDataFrame(data, schema)
df.show()

spark.stop()

Output:

+---+-----+
| ID|Value|
+---+-----+
|  1|    A|
|  1|    A|
|  3|    C|
+---+-----+

Function distinct

This function returns a new DataFrames with duplicated rows removed.

Code snippet

df.distinct().show()

Output:

+---+-----+
| ID|Value|
+---+-----+
|  3|    C|
|  1|    A|
+---+-----+

Function dropDuplicates

This function also has one argument that can be used to specify a subset of columns to be deduplicated.  It also has a alias drop_duplicates.

Code snippet

df.dropDuplicates().show()
df.drop_duplicates().show()
df.drop_duplicates(["ID"]).show()
df.dropDuplicates(["Value"]).show()

Output:

+---+-----+
| ID|Value|
+---+-----+
|  3|    C|
|  1|    A|
+---+-----+

+---+-----+
| ID|Value|
+---+-----+
|  3|    C|
|  1|    A|
+---+-----+

+---+-----+
| ID|Value|
+---+-----+
|  1|    A|
|  3|    C|
+---+-----+

+---+-----+
| ID|Value|
+---+-----+
|  3|    C|
|  1|    A|
+---+-----+
More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts