PySpark DataFrame APIs provide two drop related methods: drop
and dropDuplicates
(or drop_duplicates
). The former is used to drop specified column(s) from a DataFrame while the latter is used to drop duplicated rows.
This code snippet utilizes these tow functions.
Outputs:
+----+------+ |ACCT|AMT | +----+------+ |101 |10.01 | |101 |10.01 | |101 |102.01| +----+------+ +----+----------+------+ |ACCT|TXN_DT |AMT | +----+----------+------+ |101 |2021-01-01|102.01| |101 |2021-01-01|10.01 | +----+----------+------+ +----+----------+------+ |ACCT|TXN_DT |AMT | +----+----------+------+ |101 |2021-01-01|102.01| |101 |2021-01-01|10.01 | +----+----------+------+