Code description
PySpark DataFrame APIs provide two drop related methods: drop
and dropDuplicates
(or drop_duplicates
). The former is used to drop specified column(s) from a DataFrame while the latter is used to drop duplicated rows.
This code snippet utilizes these tow functions.
Outputs:
+----+------+
|ACCT|AMT |
+----+------+
|101 |10.01 |
|101 |10.01 |
|101 |102.01|
+----+------+
+----+----------+------+
|ACCT|TXN_DT |AMT |
+----+----------+------+
|101 |2021-01-01|102.01|
|101 |2021-01-01|10.01 |
+----+----------+------+
+----+----------+------+
|ACCT|TXN_DT |AMT |
+----+----------+------+
|101 |2021-01-01|102.01|
|101 |2021-01-01|10.01 |
+----+----------+------+
Code snippet
from pyspark.sql import SparkSession
appName = "PySpark drop and dropDuplicates"
master = "local"
spark = SparkSession.builder .appName(appName) .master(master) .getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
# Create a dataframe
df = spark.sql("""SELECT ACCT, TXN_DT, AMT FROM VALUES
(101,10.01, DATE'2021-01-01'),
(101,10.01, DATE'2021-01-01'),
(101,102.01, DATE'2021-01-01')
AS TXN(ACCT,AMT,TXN_DT)""")
print(df.schema)
# Use drop function
df.drop('TXN_DT').show(truncate=False)
# Use dropDuplicates function; drop_duplicates is the alias of dropDuplicates
df.drop_duplicates().show(truncate=False)
df.dropDuplicates().show(truncate=False)