PySpark DataFrame - Calculate Distinct Count of Column(s)

event 2022-08-19 visibility 497 comment 0 insights
more_vert
insights Stats
Kontext Kontext Code Snippets & Tips

Code snippets and tips for various programming languages/frameworks. All code examples are under MIT or Apache 2.0 license unless specified otherwise. 

Code description

This code snippet provides an example of calculating distinct count of values in PySpark DataFrame using countDistinct PySpark SQL function.

Output:

+---+-----+
| ID|Value|
+---+-----+
|101|   56|
|101|   67|
|102|   70|
|103|   93|
|104|   70|
+---+-----+

+-----------------+------------------+
|DistinctCountOfID|DistinctCountOfRow|
+-----------------+------------------+
|                4|                 5|
+-----------------+------------------+

Code snippet

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

app_name = "PySpark countDistinct Example"
master = "local"

spark = SparkSession.builder \
    .appName(app_name) \
    .master(master) \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

data = [
    [101, 56],
    [101, 67],
    [102, 70],
    [103, 93],
    [104, 70]
]

df = spark.createDataFrame(data, ['ID', 'Value'])

df.show()

df_agg = df.groupBy() \
    .agg(F.countDistinct('ID').alias('DistinctCountOfID'),
         F.countDistinct('ID', 'Value').alias('DistinctCountOfRow'))
df_agg.show()
More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts