pyspark spark-sql

PySpark DataFrame - Calculate Distinct Count of Column(s)

event 2022-08-19 visibility 528

more_vert

Code description

This code snippet provides an example of calculating distinct count of values in PySpark DataFrame using countDistinct PySpark SQL function.

Output:

+---+-----+
| ID|Value|
+---+-----+
|101|   56|
|101|   67|
|102|   70|
|103|   93|
|104|   70|
+---+-----+

+-----------------+------------------+
|DistinctCountOfID|DistinctCountOfRow|
+-----------------+------------------+
|                4|                 5|
+-----------------+------------------+

Code snippet

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

app_name = "PySpark countDistinct Example"
master = "local"

spark = SparkSession.builder \
    .appName(app_name) \
    .master(master) \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

data = [
    [101, 56],
    [101, 67],
    [102, 70],
    [103, 93],
    [104, 70]
]

df = spark.createDataFrame(data, ['ID', 'Value'])

df.show()

df_agg = df.groupBy() \
    .agg(F.countDistinct('ID').alias('DistinctCountOfID'),
         F.countDistinct('ID', 'Value').alias('DistinctCountOfRow'))
df_agg.show()

copyright This page is subject to Site terms.

Code Snippets & Tips

Log in with external accounts

PySpark DataFrame - Calculate Distinct Count of Column(s)

Code description

Code snippet

Log in with external accounts