PySpark DataFrame - Calculate Distinct Count of Column(s)

This code snippet provides an example of calculating distinct count of values in PySpark DataFrame using `countDistinct `PySpark SQL function. Output: ``` +---+-----+ | ID|Value| +---+-----+ |101| 56| |101| 67| |102| 70| |103| 93| |104| 70| +---+-----+ +-----------------+------------------+ |DistinctCountOfID|DistinctCountOfRow| +-----------------+------------------+ | 4| 5| +-----------------+------------------+ ```

Kontext Kontext 0 539 0.52 index 8/19/2022

Code description

This code snippet provides an example of calculating distinct count of values in PySpark DataFrame using countDistinct PySpark SQL function.

Output:

    +---+-----+
    | ID|Value|
    +---+-----+
    |101|   56|
    |101|   67|
    |102|   70|
    |103|   93|
    |104|   70|
    +---+-----+
    
    +-----------------+------------------+
    |DistinctCountOfID|DistinctCountOfRow|
    +-----------------+------------------+
    |                4|                 5|
    +-----------------+------------------+  
    

Code snippet

    from pyspark.sql import SparkSession
    from pyspark.sql import functions as F
    
    app_name = "PySpark countDistinct Example"
    master = "local"
    
    spark = SparkSession.builder         .appName(app_name)         .master(master)         .getOrCreate()
    
    spark.sparkContext.setLogLevel("WARN")
    
    data = [
        [101, 56],
        [101, 67],
        [102, 70],
        [103, 93],
        [104, 70]
    ]
    
    df = spark.createDataFrame(data, ['ID', 'Value'])
    
    df.show()
    
    df_agg = df.groupBy()         .agg(F.countDistinct('ID').alias('DistinctCountOfID'),
             F.countDistinct('ID', 'Value').alias('DistinctCountOfRow'))
    df_agg.show()
    
pyspark spark-sql

Join the Discussion

View or add your thoughts below

Comments