Hive SQL - Analytics with GROUP BY and GROUPING SETS, Cubes, Rollups

Kontext Kontext event 2022-07-23 visibility 3,172
more_vert
Hive SQL - Analytics with GROUP BY and GROUPING SETS, Cubes, Rollups

Hive SQL provides convenient and powerful features with GROUP BY clause to do aggregations easily with GROUPING SETS, CUBE and ROLLUP. Cubes and grouping sets can be used to aggregate on different possible combination of dimensions (specified in GROUP BY) while ROLLUP can be used to aggregate at hierarchy levels of a dimension. This article shows details about how to use those features.

Sample table

In the following examples, we will query a sample table I've created in my local Hive environment. If you have no Hive environment to work with, try follow our tutorials to configure one: Hive installation.

CREATE TABLE `transactions`(
  `acct` int,
  `amount` decimal(18,2),
  `txn_date` date)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://localhost:9000/user/hive/warehouse/hivesql.db/transactions'
TBLPROPERTIES (
  'bucketing_version'='2',
  'transient_lastDdlTime'='1657437896')

This transactions table only has a few records:

hive> select * from transactions;
OK
101     10.01   2021-01-01
101     102.01  2021-01-01
102     93.00   2021-01-01
103     913.10  2021-01-02
102     913.10  2021-01-02
101     900.56  2021-01-03

About GROUPING SETS

GROUPING SETS clause in GROUP BY can be used to specify more than one GROUP BY in the same record set. The result set will be the same as using UNION to union all records for each GROUP BY option.

The following is one sample HQL to query the transactions table using GROUPING SETS.

SELECT acct, txn_date, SUM(amount) FROM transactions GROUP BY acct, txn_date GROUPING SETS ( (acct, txn_date), acct, txn_date, ( ) );

The result set looks like the following:

NULL    NULL            2931.78
NULL    2021-01-01      205.02
NULL    2021-01-02      1826.20
NULL    2021-01-03      900.56
101     NULL            1012.58
101     2021-01-01      112.02
101     2021-01-03      900.56
102     NULL            1006.10
102     2021-01-01      93.00
102     2021-01-02      913.10
103     NULL            913.10
103     2021-01-02      913.10

The above query is the same as the following one using UNION:

SELECT acct, txn_date, SUM(amount) FROM transactions GROUP BY acct, txn_date 
UNION
SELECT acct, NULL, SUM(amount) FROM transactions GROUP BY acct
UNION
SELECT NULL, txn_date, SUM(amount) FROM transactions GROUP BY txn_date
UNION
SELECT NULL, NULL, SUM(amount) FROM transactions;

The main difference from exaction perspective is that the UNION statements will launch 7 jobs instead of one and can potentially take longer to execute.

WITH CUBE

Cubes can be used to calculate subtotal of all possible combinations of the set of column in GROUP BY.  For instance, GROUP BY acct, txn_date WITH CUBE is equivalent to GROUP BY acct, txn_date GROUPING SETS ( (acct, txn_date), acct, txn_date, ( )).

The following is one example of using CUBE with GROUP BY.

SELECT acct, txn_date, SUM(amount) FROM transactions GROUP BY acct, txn_date WITH CUBE;

The result is the same as the previous query with GROUPING SETS.

WITH ROLLUP

ROLLUP clause is used with GROUP BY to compute the aggregations at the hierarchy levels of a dimension. For example, GROUP BY a, b, c with ROLLUP assumes that the hierarchy is "a" drilling down to "b" drilling down to "c". GROUP BY a, b, c, WITH ROLLUP is equivalent to GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ( )). 

The following provides one sample HQL that uses ROLLUP:

SELECT acct, txn_date, SUM(amount) FROM transactions GROUP BY acct, txn_date WITH ROLLUP;

The result looks like the following:

NULL    NULL            2931.78
101     NULL            1012.58
101     2021-01-01      112.02
101     2021-01-03      900.56
102     NULL            1006.10
102     2021-01-01      93.00
102     2021-01-02      913.10
103     NULL            913.10
103     2021-01-02      913.10

Use GROUPING__ID

As mentioned in Hive SQL - Virtual Columns, GROUPING__ID is one of the available Hive virtual columns. It can be used to identify the group of each record in the result set.

SELECT acct, txn_date, GROUPING__ID, SUM(amount) FROM transactions GROUP BY acct, txn_date WITH ROLLUP;

The above query returns on extra column. 

NULL    NULL            3       2931.78
101     NULL            1       1012.58
101     2021-01-01      0       112.02
101     2021-01-03      0       900.56
102     NULL            1       1006.10
102     2021-01-01      0       93.00
102     2021-01-02      0       913.10
103     NULL            1       913.10
103     2021-01-02      0       913.10

The grouping function indicates whether an expression in a GROUP BY clause is aggregated or not for a given row. The value 0 represents a column that is part of the grouping set, while the value 1 represents a column that is not part of the grouping set. 

Summary

HQL provides GROUPING SETS, CUBE, ROLLUP with GROUP BY to improve the efficiency of writing analytical queries with aggregations. 

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts