There are two functions can be used to remove duplicates from Spark DataFrame: distinctand dropDuplicates.
Sample DataFrame
The following code snippet creates a sample DataFrame with duplicates.
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, StringType, StructField, StructType
appName = "PySpark Example"
master = "local"
# Create Spark session
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.getOrCreate()
# Sample data
data = [(1, 'A'), (1, 'A'), (3, 'C')]
# schema
schema = StructType([StructField("ID", IntegerType(), True),
StructField("Value", StringType(), True)])
# Create Spark DaraFrame from pandas DataFrame
df = spark.createDataFrame(data, schema)
df.show()
spark.stop()
Output:
+---+-----+
| ID|Value|
+---+-----+
| 1| A|
| 1| A|
| 3| C|
+---+-----+
Function distinct
This function returns a new DataFrames with duplicated rows removed.
Code snippet
df.distinct().show()
Output:
+---+-----+
| ID|Value|
+---+-----+
| 3| C|
| 1| A|
+---+-----+
Function dropDuplicates
This function also has one argument that can be used to specify a subset of columns to be deduplicated. It also has a alias drop_duplicates.
Code snippet
df.dropDuplicates().show()
df.drop_duplicates().show()
df.drop_duplicates(["ID"]).show()
df.dropDuplicates(["Value"]).show()
Output:
+---+-----+
| ID|Value|
+---+-----+
| 3| C|
| 1| A|
+---+-----+
+---+-----+
| ID|Value|
+---+-----+
| 3| C|
| 1| A|
+---+-----+
+---+-----+
| ID|Value|
+---+-----+
| 1| A|
| 3| C|
+---+-----+
+---+-----+
| ID|Value|
+---+-----+
| 3| C|
| 1| A|
+---+-----+