PySpark DataFrame - drop and dropDuplicates

PySpark DataFrame APIs provide two drop related methods: `drop `and `dropDuplicates `(or `drop_duplicates`). The former is used to drop specified column(s) from a DataFrame while the latter is used to drop duplicated rows. This code snippet utilizes these tow functions. Outputs: ``` +----+------+ |ACCT|AMT | +----+------+ |101 |10.01 | |101 |10.01 | |101 |102.01| +----+------+ +----+----------+------+ |ACCT|TXN_DT |AMT | +----+----------+------+ |101 |2021-01-01|102.01| |101 |2021-01-01|10.01 | +----+----------+------+ +----+----------+------+ |ACCT|TXN_DT |AMT | +----+----------+------+ |101 |2021-01-01|102.01| |101 |2021-01-01|10.01 | +----+----------+------+ ```

Kontext Kontext 0 231 0.22 index 8/16/2022

Code description

PySpark DataFrame APIs provide two drop related methods: drop and dropDuplicates (or drop_duplicates). The former is used to drop specified column(s) from a DataFrame while the latter is used to drop duplicated rows. 

This code snippet utilizes these tow functions.

Outputs:

    +----+------+
    |ACCT|AMT   |
    +----+------+
    |101 |10.01 |
    |101 |10.01 |
    |101 |102.01|
    +----+------+
    
    +----+----------+------+
    |ACCT|TXN_DT    |AMT   |
    +----+----------+------+
    |101 |2021-01-01|102.01|
    |101 |2021-01-01|10.01 |
    +----+----------+------+
    
    +----+----------+------+
    |ACCT|TXN_DT    |AMT   |
    +----+----------+------+
    |101 |2021-01-01|102.01|
    |101 |2021-01-01|10.01 |
    +----+----------+------+  
    

Code snippet

    from pyspark.sql import SparkSession
    
    appName = "PySpark drop and dropDuplicates"
    master = "local"
    
    spark = SparkSession.builder         .appName(appName)         .master(master)         .getOrCreate()
    
    spark.sparkContext.setLogLevel("ERROR")
    
    # Create a dataframe
    df = spark.sql("""SELECT ACCT, TXN_DT, AMT FROM VALUES 
    (101,10.01, DATE'2021-01-01'),
    (101,10.01, DATE'2021-01-01'),
    (101,102.01, DATE'2021-01-01')
    AS TXN(ACCT,AMT,TXN_DT)""")
    
    print(df.schema)
    
    # Use drop function
    df.drop('TXN_DT').show(truncate=False)
    
    # Use dropDuplicates function; drop_duplicates is the alias of dropDuplicates
    df.drop_duplicates().show(truncate=False)
    df.dropDuplicates().show(truncate=False)
pyspark

Join the Discussion

View or add your thoughts below

Comments