PySpark DataFrame - Calculate Distinct Count of Column(s)
Code description
This code snippet provides an example of calculating distinct count of values in PySpark DataFrame using countDistinct
PySpark SQL function.
Output:
+---+-----+ | ID|Value| +---+-----+ |101| 56| |101| 67| |102| 70| |103| 93| |104| 70| +---+-----+ +-----------------+------------------+ |DistinctCountOfID|DistinctCountOfRow| +-----------------+------------------+ | 4| 5| +-----------------+------------------+
Code snippet
from pyspark.sql import SparkSession from pyspark.sql import functions as F app_name = "PySpark countDistinct Example" master = "local" spark = SparkSession.builder \ .appName(app_name) \ .master(master) \ .getOrCreate() spark.sparkContext.setLogLevel("WARN") data = [ [101, 56], [101, 67], [102, 70], [103, 93], [104, 70] ] df = spark.createDataFrame(data, ['ID', 'Value']) df.show() df_agg = df.groupBy() \ .agg(F.countDistinct('ID').alias('DistinctCountOfID'), F.countDistinct('ID', 'Value').alias('DistinctCountOfRow')) df_agg.show()
copyright
This page is subject to Site terms.
comment Comments
No comments yet.