PySpark DataFrame - Calculate sum and avg with groupBy

This code snippet provides an example of calculating aggregated values after grouping data in PySpark DataFrame. To group data, `DataFrame.groupby` or `DataFrame.groupBy` can be used; then `GroupedData.agg` method can be used to aggregate data for each group. Built-in aggregation functions like `sum`, `avg`, `max`, `min `and others can be used. Customized aggregation functions can also be used. Output: ``` +----------+--------+ |TotalScore|AvgScore| +----------+--------+ | 392| 78.4| +----------+--------+ ```

Kontext Kontext 0 1298 1.24 index 8/19/2022

Code description

This code snippet provides an example of calculating aggregated values after grouping data in PySpark DataFrame. To group data, DataFrame.groupby or DataFrame.groupBy can be used; then GroupedData.agg method can be used to aggregate data for each group. Built-in aggregation functions like sum, avg, max, min and others can be used. Customized aggregation functions can also be used.

Output:

    +----------+--------+
    |TotalScore|AvgScore|
    +----------+--------+
    |       392|    78.4|
    +----------+--------+  
    

Code snippet

    from pyspark.sql import SparkSession
    from pyspark.sql import functions as F
    
    app_name = "PySpark sum and avg Examples"
    master = "local"
    
    spark = SparkSession.builder         .appName(app_name)         .master(master)         .getOrCreate()
    
    spark.sparkContext.setLogLevel("WARN")
    
    data = [
        [101, 56],
        [102, 78],
        [103, 70],
        [104, 93],
        [105, 95]
    ]
    
    df = spark.createDataFrame(data, ['Student', 'Score'])
    
    df_agg = df.groupBy().agg(F.sum('Score').alias(
        'TotalScore'), F.avg('Score').alias('AvgScore'))
    
    df_agg.show()
    
pyspark spark-sql

Join the Discussion

View or add your thoughts below

Comments