PySpark DataFrame - Calculate sum and avg with groupBy

Kontext Kontext event 2022-08-19 visibility 1,227
more_vert

Code description

This code snippet provides an example of calculating aggregated values after grouping data in PySpark DataFrame. To group data, DataFrame.groupby or DataFrame.groupBy can be used; then GroupedData.agg method can be used to aggregate data for each group. Built-in aggregation functions like sum, avg, max, min and others can be used. Customized aggregation functions can also be used.

Output:

+----------+--------+
|TotalScore|AvgScore|
+----------+--------+
|       392|    78.4|
+----------+--------+

Code snippet

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

app_name = "PySpark sum and avg Examples"
master = "local"

spark = SparkSession.builder \
    .appName(app_name) \
    .master(master) \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

data = [
    [101, 56],
    [102, 78],
    [103, 70],
    [104, 93],
    [105, 95]
]

df = spark.createDataFrame(data, ['Student', 'Score'])

df_agg = df.groupBy().agg(F.sum('Score').alias(
    'TotalScore'), F.avg('Score').alias('AvgScore'))

df_agg.show()
More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts