Deduplicate Spark DataFrame via dropDuplicates() and distinct()
There are two functions can be used to remove duplicates from Spark DataFrame: distinct and dropDuplicates.
Sample DataFrame
The following code snippet creates a sample DataFrame with duplicates.
from pyspark.sql import SparkSession from pyspark.sql.types import IntegerType, StringType, StructField, StructType appName = "PySpark Example" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() # Sample data data = [(1, 'A'), (1, 'A'), (3, 'C')] # schema schema = StructType([StructField("ID", IntegerType(), True), StructField("Value", StringType(), True)]) # Create Spark DaraFrame from pandas DataFrame df = spark.createDataFrame(data, schema) df.show() spark.stop()
Output:
+---+-----+ | ID|Value| +---+-----+ | 1| A| | 1| A| | 3| C| +---+-----+
Function distinct
This function returns a new DataFrames with duplicated rows removed.
Code snippet
df.distinct().show()
Output:
+---+-----+ | ID|Value| +---+-----+ | 3| C| | 1| A| +---+-----+
Function dropDuplicates
This function also has one argument that can be used to specify a subset of columns to be deduplicated. It also has a alias drop_duplicates.
Code snippet
df.dropDuplicates().show() df.drop_duplicates().show() df.drop_duplicates(["ID"]).show() df.dropDuplicates(["Value"]).show()
Output:
+---+-----+ | ID|Value| +---+-----+ | 3| C| | 1| A| +---+-----+ +---+-----+ | ID|Value| +---+-----+ | 3| C| | 1| A| +---+-----+ +---+-----+ | ID|Value| +---+-----+ | 1| A| | 3| C| +---+-----+ +---+-----+ | ID|Value| +---+-----+ | 3| C| | 1| A| +---+-----+
info Last modified by Raymond 3 years ago
copyright
This page is subject to Site terms.
comment Comments
No comments yet.