access_time 5 months ago languageEnglish
more_vert

Remove Duplicated Rows from Spark DataFrame

visibility 87 comment 0

There are two functions can be used to remove duplicates from Spark DataFrame: distinct and dropDuplicates.

Sample DataFrame

The following code snippet creates a sample DataFrame with duplicates.

from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, StringType, StructField, StructType

appName = "PySpark Example"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# Sample data
data = [(1, 'A'), (1, 'A'), (3, 'C')]

# schema
schema = StructType([StructField("ID", IntegerType(), True),
                     StructField("Value", StringType(), True)])

# Create Spark DaraFrame from pandas DataFrame
df = spark.createDataFrame(data, schema)
df.show()

spark.stop()

Output:

+---+-----+
| ID|Value|
+---+-----+
|  1|    A|
|  1|    A|
|  3|    C|
+---+-----+

Function distinct

This function returns a new DataFrames with duplicated rows removed.

Code snippet

df.distinct().show()

Output:

+---+-----+
| ID|Value|
+---+-----+
|  3|    C|
|  1|    A|
+---+-----+

Function dropDuplicates

This function also has one argument that can be used to specify a subset of columns to be deduplicated.  It also has a alias drop_duplicates.

Code snippet

df.dropDuplicates().show()
df.drop_duplicates().show()
df.drop_duplicates(["ID"]).show()
df.dropDuplicates(["Value"]).show()

Output:

+---+-----+
| ID|Value|
+---+-----+
|  3|    C|
|  1|    A|
+---+-----+

+---+-----+
| ID|Value|
+---+-----+
|  3|    C|
|  1|    A|
+---+-----+

+---+-----+
| ID|Value|
+---+-----+
|  1|    A|
|  3|    C|
+---+-----+

+---+-----+
| ID|Value|
+---+-----+
|  3|    C|
|  1|    A|
+---+-----+
info Last modified by Raymond 5 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Follow Kontext

Get our latest updates on LinkedIn.

Want to contribute on Kontext to help others?

Learn more

More from Kontext

Spark 3.0.1: Connect to HBase 2.4.1
visibility 820
thumb_up 1
access_time 6 months ago