Deduplicate Spark DataFrame via dropDuplicates() and distinct()
more_vert
There are two functions can be used to remove duplicates from Spark DataFrame: distinct and dropDuplicates.
Sample DataFrame
The following code snippet creates a sample DataFrame with duplicates.
from pyspark.sql import SparkSession from pyspark.sql.types import IntegerType, StringType, StructField, StructType appName = "PySpark Example" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() # Sample data data = [(1, 'A'), (1, 'A'), (3, 'C')] # schema schema = StructType([StructField("ID", IntegerType(), True), StructField("Value", StringType(), True)]) # Create Spark DaraFrame from pandas DataFrame df = spark.createDataFrame(data, schema) df.show() spark.stop()
Output:
+---+-----+ | ID|Value| +---+-----+ | 1| A| | 1| A| | 3| C| +---+-----+
Function distinct
This function returns a new DataFrames with duplicated rows removed.
Code snippet
df.distinct().show()
Output:
+---+-----+ | ID|Value| +---+-----+ | 3| C| | 1| A| +---+-----+
Function dropDuplicates
This function also has one argument that can be used to specify a subset of columns to be deduplicated. It also has a alias drop_duplicates.
Code snippet
df.dropDuplicates().show() df.drop_duplicates().show() df.drop_duplicates(["ID"]).show() df.dropDuplicates(["Value"]).show()
Output:
+---+-----+ | ID|Value| +---+-----+ | 3| C| | 1| A| +---+-----+ +---+-----+ | ID|Value| +---+-----+ | 3| C| | 1| A| +---+-----+ +---+-----+ | ID|Value| +---+-----+ | 1| A| | 3| C| +---+-----+ +---+-----+ | ID|Value| +---+-----+ | 3| C| | 1| A| +---+-----+
info Last modified by Raymond 10 months ago
copyright
This page is subject to Site terms.
comment Comments
No comments yet.
Log in with external accounts
warning Please login first to view stats information.