Code description
This code snippet provides an example of calculating distinct count of values in PySpark DataFrame using countDistinct
PySpark SQL function.
Output:
+---+-----+
| ID|Value|
+---+-----+
|101| 56|
|101| 67|
|102| 70|
|103| 93|
|104| 70|
+---+-----+
+-----------------+------------------+
|DistinctCountOfID|DistinctCountOfRow|
+-----------------+------------------+
| 4| 5|
+-----------------+------------------+
Code snippet
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
app_name = "PySpark countDistinct Example"
master = "local"
spark = SparkSession.builder .appName(app_name) .master(master) .getOrCreate()
spark.sparkContext.setLogLevel("WARN")
data = [
[101, 56],
[101, 67],
[102, 70],
[103, 93],
[104, 70]
]
df = spark.createDataFrame(data, ['ID', 'Value'])
df.show()
df_agg = df.groupBy() .agg(F.countDistinct('ID').alias('DistinctCountOfID'),
F.countDistinct('ID', 'Value').alias('DistinctCountOfRow'))
df_agg.show()