Hive SQL - Analytics with GROUP BY and GROUPING SETS, Cubes, Rollups
Hive SQL provides convenient and powerful features with GROUP BY clause to do aggregations easily with GROUPING SETS, CUBE and ROLLUP. Cubes and grouping sets can be used to aggregate on different possible combination of dimensions (specified in GROUP BY) while ROLLUP can be used to aggregate at hierarchy levels of a dimension. This article shows details about how to use those features.
Sample table
In the following examples, we will query a sample table I've created in my local Hive environment. If you have no Hive environment to work with, try follow our tutorials to configure one: Hive installation.
CREATE TABLE `transactions`( `acct` int, `amount` decimal(18,2), `txn_date` date) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://localhost:9000/user/hive/warehouse/hivesql.db/transactions' TBLPROPERTIES ( 'bucketing_version'='2', 'transient_lastDdlTime'='1657437896')
This transactions
table only has a few records:
hive> select * from transactions; OK 101 10.01 2021-01-01 101 102.01 2021-01-01 102 93.00 2021-01-01 103 913.10 2021-01-02 102 913.10 2021-01-02 101 900.56 2021-01-03
About GROUPING SETS
GROUPING SETS clause in GROUP BY can be used to specify more than one GROUP BY in the same record set. The result set will be the same as using UNION to union all records for each GROUP BY option.
The following is one sample HQL to query the transactions table using GROUPING SETS.
SELECT acct, txn_date, SUM(amount) FROM transactions GROUP BY acct, txn_date GROUPING SETS ( (acct, txn_date), acct, txn_date, ( ) );
The result set looks like the following:
NULL NULL 2931.78 NULL 2021-01-01 205.02 NULL 2021-01-02 1826.20 NULL 2021-01-03 900.56 101 NULL 1012.58 101 2021-01-01 112.02 101 2021-01-03 900.56 102 NULL 1006.10 102 2021-01-01 93.00 102 2021-01-02 913.10 103 NULL 913.10 103 2021-01-02 913.10
The above query is the same as the following one using UNION:
SELECT acct, txn_date, SUM(amount) FROM transactions GROUP BY acct, txn_date UNION SELECT acct, NULL, SUM(amount) FROM transactions GROUP BY acct UNION SELECT NULL, txn_date, SUM(amount) FROM transactions GROUP BY txn_date UNION SELECT NULL, NULL, SUM(amount) FROM transactions;
The main difference from exaction perspective is that the UNION statements will launch 7 jobs instead of one and can potentially take longer to execute.
WITH CUBE
Cubes can be used to calculate subtotal of all possible combinations of the set of column in GROUP BY. For instance, GROUP BY acct, txn_date WITH CUBE
is equivalent to GROUP BY acct, txn_date GROUPING SETS ( (acct, txn_date), acct, txn_date, ( ))
.
The following is one example of using CUBE with GROUP BY.
SELECT acct, txn_date, SUM(amount) FROM transactions GROUP BY acct, txn_date WITH CUBE;
The result is the same as the previous query with GROUPING SETS.
WITH ROLLUP
ROLLUP clause is used with GROUP BY to compute the aggregations at the hierarchy levels of a dimension. For example, GROUP BY a, b, c with ROLLUP
assumes that the hierarchy is "a" drilling down to "b" drilling down to "c". GROUP BY a, b, c, WITH ROLLUP
is equivalent to GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ( ))
.
The following provides one sample HQL that uses ROLLUP:
SELECT acct, txn_date, SUM(amount) FROM transactions GROUP BY acct, txn_date WITH ROLLUP;
The result looks like the following:
NULL NULL 2931.78 101 NULL 1012.58 101 2021-01-01 112.02 101 2021-01-03 900.56 102 NULL 1006.10 102 2021-01-01 93.00 102 2021-01-02 913.10 103 NULL 913.10 103 2021-01-02 913.10
Use GROUPING__ID
As mentioned in Hive SQL - Virtual Columns, GROUPING__ID is one of the available Hive virtual columns. It can be used to identify the group of each record in the result set.
SELECT acct, txn_date, GROUPING__ID, SUM(amount) FROM transactions GROUP BY acct, txn_date WITH ROLLUP;
The above query returns on extra column.
NULL NULL 3 2931.78 101 NULL 1 1012.58 101 2021-01-01 0 112.02 101 2021-01-03 0 900.56 102 NULL 1 1006.10 102 2021-01-01 0 93.00 102 2021-01-02 0 913.10 103 NULL 1 913.10 103 2021-01-02 0 913.10
The grouping function indicates whether an expression in a GROUP BY clause is aggregated or not for a given row. The value 0 represents a column that is part of the grouping set, while the value 1 represents a column that is not part of the grouping set.
Summary
HQL provides GROUPING SETS, CUBE, ROLLUP with GROUP BY to improve the efficiency of writing analytical queries with aggregations.