PySpark DataFrame - drop and dropDuplicates
Code description
PySpark DataFrame APIs provide two drop related methods: drop
and dropDuplicates
(or drop_duplicates
). The former is used to drop specified column(s) from a DataFrame while the latter is used to drop duplicated rows.
This code snippet utilizes these tow functions.
Outputs:
+----+------+ |ACCT|AMT | +----+------+ |101 |10.01 | |101 |10.01 | |101 |102.01| +----+------+ +----+----------+------+ |ACCT|TXN_DT |AMT | +----+----------+------+ |101 |2021-01-01|102.01| |101 |2021-01-01|10.01 | +----+----------+------+ +----+----------+------+ |ACCT|TXN_DT |AMT | +----+----------+------+ |101 |2021-01-01|102.01| |101 |2021-01-01|10.01 | +----+----------+------+
Code snippet
from pyspark.sql import SparkSession appName = "PySpark drop and dropDuplicates" master = "local" spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() spark.sparkContext.setLogLevel("ERROR") # Create a dataframe df = spark.sql("""SELECT ACCT, TXN_DT, AMT FROM VALUES (101,10.01, DATE'2021-01-01'), (101,10.01, DATE'2021-01-01'), (101,102.01, DATE'2021-01-01') AS TXN(ACCT,AMT,TXN_DT)""") print(df.schema) # Use drop function df.drop('TXN_DT').show(truncate=False) # Use dropDuplicates function; drop_duplicates is the alias of dropDuplicates df.drop_duplicates().show(truncate=False) df.dropDuplicates().show(truncate=False)
copyright
This page is subject to Site terms.
comment Comments
No comments yet.