PySpark DataFrame - Calculate sum and avg with groupBy
Code description
This code snippet provides an example of calculating aggregated values after grouping data in PySpark DataFrame. To group data, DataFrame.groupby
or DataFrame.groupBy
can be used; then GroupedData.agg
method can be used to aggregate data for each group. Built-in aggregation functions like sum
, avg
, max
, min
and others can be used. Customized aggregation functions can also be used.
Output:
+----------+--------+ |TotalScore|AvgScore| +----------+--------+ | 392| 78.4| +----------+--------+
Code snippet
from pyspark.sql import SparkSession from pyspark.sql import functions as F app_name = "PySpark sum and avg Examples" master = "local" spark = SparkSession.builder \ .appName(app_name) \ .master(master) \ .getOrCreate() spark.sparkContext.setLogLevel("WARN") data = [ [101, 56], [102, 78], [103, 70], [104, 93], [105, 95] ] df = spark.createDataFrame(data, ['Student', 'Score']) df_agg = df.groupBy().agg(F.sum('Score').alias( 'TotalScore'), F.avg('Score').alias('AvgScore')) df_agg.show()
copyright
This page is subject to Site terms.
comment Comments
No comments yet.