Code description
This code snippet provides an example of calculating aggregated values after grouping data in PySpark DataFrame. To group data, DataFrame.groupby
or DataFrame.groupBy
can be used; then GroupedData.agg
method can be used to aggregate data for each group. Built-in aggregation functions like sum
, avg
, max
, min
and others can be used. Customized aggregation functions can also be used.
Output:
+----------+--------+
|TotalScore|AvgScore|
+----------+--------+
| 392| 78.4|
+----------+--------+
Code snippet
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
app_name = "PySpark sum and avg Examples"
master = "local"
spark = SparkSession.builder .appName(app_name) .master(master) .getOrCreate()
spark.sparkContext.setLogLevel("WARN")
data = [
[101, 56],
[102, 78],
[103, 70],
[104, 93],
[105, 95]
]
df = spark.createDataFrame(data, ['Student', 'Score'])
df_agg = df.groupBy().agg(F.sum('Score').alias(
'TotalScore'), F.avg('Score').alias('AvgScore'))
df_agg.show()