PySpark DataFrame - Calculate Distinct Count of Column(s)
insights Stats
warning Please login first to view stats information.
Kontext
Code Snippets & Tips
Code snippets and tips for various programming languages/frameworks. All code examples are under MIT or Apache 2.0 license unless specified otherwise.
Code description
This code snippet provides an example of calculating distinct count of values in PySpark DataFrame using countDistinct
PySpark SQL function.
Output:
+---+-----+ | ID|Value| +---+-----+ |101| 56| |101| 67| |102| 70| |103| 93| |104| 70| +---+-----+ +-----------------+------------------+ |DistinctCountOfID|DistinctCountOfRow| +-----------------+------------------+ | 4| 5| +-----------------+------------------+
Code snippet
from pyspark.sql import SparkSession from pyspark.sql import functions as F app_name = "PySpark countDistinct Example" master = "local" spark = SparkSession.builder \ .appName(app_name) \ .master(master) \ .getOrCreate() spark.sparkContext.setLogLevel("WARN") data = [ [101, 56], [101, 67], [102, 70], [103, 93], [104, 70] ] df = spark.createDataFrame(data, ['ID', 'Value']) df.show() df_agg = df.groupBy() \ .agg(F.countDistinct('ID').alias('DistinctCountOfID'), F.countDistinct('ID', 'Value').alias('DistinctCountOfRow')) df_agg.show()
copyright
This page is subject to Site terms.
comment Comments
No comments yet.