PySpark DataFrame - Calculate Distinct Count of Column(s)

Code description

This code snippet provides an example of calculating distinct count of values in PySpark DataFrame using countDistinct PySpark SQL function.

Output:

    +---+-----+
    | ID|Value|
    +---+-----+
    |101|   56|
    |101|   67|
    |102|   70|
    |103|   93|
    |104|   70|
    +---+-----+
    
    +-----------------+------------------+
    |DistinctCountOfID|DistinctCountOfRow|
    +-----------------+------------------+
    |                4|                 5|
    +-----------------+------------------+

Code snippet

    from pyspark.sql import SparkSession
    from pyspark.sql import functions as F
    
    app_name = "PySpark countDistinct Example"
    master = "local"
    
    spark = SparkSession.builder         .appName(app_name)         .master(master)         .getOrCreate()
    
    spark.sparkContext.setLogLevel("WARN")
    
    data = [
        [101, 56],
        [101, 67],
        [102, 70],
        [103, 93],
        [104, 70]
    ]
    
    df = spark.createDataFrame(data, ['ID', 'Value'])
    
    df.show()
    
    df_agg = df.groupBy()         .agg(F.countDistinct('ID').alias('DistinctCountOfID'),
             F.countDistinct('ID', 'Value').alias('DistinctCountOfRow'))
    df_agg.show()

Code description

Code snippet

In this article