Spark SQL - Group By

event 2021-09-24 visibility 1,108 comment 0 insights
more_vert
insights Stats
Raymond Raymond Code Snippets & Tips

Code snippets and tips for various programming languages/frameworks. All code examples are under MIT or Apache 2.0 license unless specified otherwise. 

Spark provides flexible APIs to perform group by operations against a data set. You can either use Spark SQL or fluent APIs to implement it.

Spark SQL - group by

The follow code snippet shows you how to use GROUP BY directly via Spark SQL. You can run the query against Hive databases or directly in a Spark-SQL shell.

from pyspark.sql import SparkSession

appName = "PySpark GroupBy Example"
master = "local"

# Create Spark session with Hive supported.
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

# GroupBY 
df = spark.sql("""SELECT ACCT, TXN_DT, SUM(AMT) AS TOTAL_AMOUNT FROM VALUES 
(101,10.01, DATE'2021-01-01'),
(101,102.01, DATE'2021-01-01'),
(102,93., DATE'2021-01-01'),
(103,913.1, DATE'2021-01-02'),
(102,913.1, DATE'2021-01-02'),
(101,900.56, DATE'2021-01-03')
AS TXN(ACCT,AMT, TXN_DT)
GROUP BY ACCT, TXN_DT""")

df.show()

Result:

+----+----------+------------+
|ACCT| TXN_DT|TOTAL_AMOUNT|
+----+----------+------------+
| 102|2021-01-02| 913.10|
| 103|2021-01-02| 913.10|
| 102|2021-01-01| 93.00|
| 101|2021-01-03| 900.56|
| 101|2021-01-01| 112.02|
+----+----------+------------+

Use groupBy API

The above example can also be changed to use groupBy API directly. This is useful is you already have an dataframe and if you don't want to use Spark SQL:

# GroupBY 
df = spark.sql("""SELECT ACCT, TXN_DT, AMT FROM VALUES 
(101,10.01, DATE'2021-01-01'),
(101,102.01, DATE'2021-01-01'),
(102,93., DATE'2021-01-01'),
(103,913.1, DATE'2021-01-02'),
(102,913.1, DATE'2021-01-02'),
(101,900.56, DATE'2021-01-03')
AS TXN(ACCT,AMT, TXN_DT)""")

df.groupBy("ACCT", "TXN_DT").agg(sum("AMT").alias("TOTAL_AMT")).show()

The result will be the same.

Remember to import sum function:

from pyspark.sql.functions import sum

Otherwise you may encounter the following error:

TypeError: unsupported operand type(s) for +: 'int' and 'str'
More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts