Hive SQL - Aggregate Functions Overview with Examples

Apache Hive is a data warehouse product based on Hadoop. Similar as other database engines, Hive provides a number of built-in aggregation functions for data analysis, including LEAD, LAG, FIRST_VALUE, LAST_VALUE, COUNT (w/ or wo/ DISTINCT), SUM, MIN, MAX, AVG, RANK, ROW_NUMBER, DENSE_RANK, CUME_DIST, PERCENT_RANK, NTILE, and so on.

This articles show you how to use these aggregation functions in Hive.

Prerequisites

To run the SQL statements in this article, you need a Hive environment. Follow one of the following articles to configure a Hive instance on your computer if you don't have one to work with yet.

Create a sample table

Once Hive is installed, let's create a sample table using beeline. This table will be used in the following sections. Beeline is the successor of hive CLI.

Start beeline using the following command:

$HIVE_HOME/bin/beeline -u jdbc:hive2://

Create a database named hivesql:

0: jdbc:hive2://> create database hivesql;
OK
No rows affected (1.183 seconds)

Create table named transactions in this database:

0: jdbc:hive2://> create table hivesql.transactions(acct int, amount decimal(18,2), txn_date date);
OK
No rows affected (2.096 seconds)

Seed the table with some data using the following INSERT statements:

insert into hivesql.transactions(acct,amount,txn_date) 
values 
(101,10.01, DATE'2021-01-01'),
(101,102.01, DATE'2021-01-01'),
(102,93., DATE'2021-01-01'),
(103,913.1, DATE'2021-01-02'),
(102,913.1, DATE'2021-01-02'),
(101,900.56, DATE'2021-01-03');

Query the table and it prints out the following data:

0: jdbc:hive2://> select * from hivesql.transactions;
OK
+--------------------+----------------------+------------------------+
| transactions.acct  | transactions.amount  | transactions.txn_date  |
+--------------------+----------------------+------------------------+
| 101                | 10.01                | 2021-01-01             |
| 101                | 102.01               | 2021-01-01             |
| 102                | 93.00                | 2021-01-01             |
| 103                | 913.10               | 2021-01-02             |
| 102                | 913.10               | 2021-01-02             |
| 101                | 900.56               | 2021-01-03             |
+--------------------+----------------------+------------------------+
6 rows selected (0.38 seconds)

Now let's use this sample data to practice Hive aggregation functions. For most of the functions, we will show sample queries with and without PARTITION BY clause.

LEAD function

LEAD function provides access to a row at a given offset that follows the current row in a window. This analytic function can be used in a SELECT statement to compare values in the current row with values in a following row.

Examples

SELECT TXN.*, LEAD(amount,1) OVER (PARTITION BY acct ORDER BY txn_date) AS amt_sub 
FROM hivesql.transactions as TXN;

Result:

+-----------+-------------+---------------+----------+
| txn.acct  | txn.amount  | txn.txn_date  | amt_sub  |
+-----------+-------------+---------------+----------+
| 101       | 102.01      | 2021-01-01    | 10.01    |
| 101       | 10.01       | 2021-01-01    | 900.56   |
| 101       | 900.56      | 2021-01-03    | NULL     |
| 102       | 93.00       | 2021-01-01    | 913.10   |
| 102       | 913.10      | 2021-01-02    | NULL     |
| 103       | 913.10      | 2021-01-02    | NULL     |
+-----------+-------------+---------------+----------+

LAG function

LAG function provides access to a row at a given offset that comes before the current row in the windows. This function can be used in a SELECT statement to compare values in the current row with values in a previous row.

Examples

SELECT TXN.*, LAG(amount,1) OVER (PARTITION BY acct ORDER BY txn_date) AS amt_pre 
FROM hivesql.transactions as TXN;

Results:

+-----------+-------------+---------------+----------+
| txn.acct  | txn.amount  | txn.txn_date  | amt_pre  |
+-----------+-------------+---------------+----------+
| 101       | 102.01      | 2021-01-01    | NULL     |
| 101       | 10.01       | 2021-01-01    | 102.01   |
| 101       | 900.56      | 2021-01-03    | 10.01    |
| 102       | 93.00       | 2021-01-01    | NULL     |
| 102       | 913.10      | 2021-01-02    | 93.00    |
| 103       | 913.10      | 2021-01-02    | NULL     |
+-----------+-------------+---------------+----------+

FIRST\_VALUE and LAST\_VALUE functions

In Hive QL, function FIRST_VALUE and LAST_VALUE can be used to to find the first or the last value of given column or expression for a group of rows.

Examples

SELECT TXN.*, 
FIRST_VALUE(amount) OVER (ORDER BY txn_date), 
LAST_VALUE(amount) OVER (ORDER BY txn_date)
FROM hivesql.transactions as TXN;

Results:

+-----------+-------------+---------------+-----------------------+----------------------+
| txn.acct  | txn.amount  | txn.txn_date  | FIRST_VALUE_window_0  | LAST_VALUE_window_1  |
+-----------+-------------+---------------+-----------------------+----------------------+
| 102       | 93.00       | 2021-01-01    | 93.00                 | 10.01                |
| 101       | 102.01      | 2021-01-01    | 93.00                 | 10.01                |
| 101       | 10.01       | 2021-01-01    | 93.00                 | 10.01                |
| 102       | 913.10      | 2021-01-02    | 93.00                 | 913.10               |
| 103       | 913.10      | 2021-01-02    | 93.00                 | 913.10               |
| 101       | 900.56      | 2021-01-03    | 93.00                 | 900.56               |
+-----------+-------------+---------------+-----------------------+----------------------+

COUNT and COUNT DISTINCT function

Examples

SELECT COUNT(*) as total_records,
COUNT(distinct acct) as total_accounts
FROM hivesql.transactions as TXN;

Results:

+----------------+-----------------+
| total_records  | total_accounts  |
+----------------+-----------------+
| 6              | 3               |
+----------------+-----------------+

SUM function

SUM is used to calculate the summary of numeric values.

Examples

SELECT acct, SUM(amount) as total_amount
FROM hivesql.transactions as TXN
GROUP BY acct;

Results:

+-------+---------------+
| acct  | total_amount  |
+-------+---------------+
| 101   | 1012.58       |
| 102   | 1006.10       |
| 103   | 913.10        |
+-------+---------------+

MAX and MIN functions

Examples

SELECT acct, MAX(amount) as max_amount,
MIN(amount) as min_amount
FROM hivesql.transactions as TXN
GROUP BY acct;

Results:

+-------+-------------+-------------+
| acct  | max_amount  | min_amount  |
+-------+-------------+-------------+
| 101   | 900.56      | 10.01       |
| 102   | 913.10      | 93.00       |
| 103   | 913.10      | 913.10      |
+-------+-------------+-------------+

AVG function

AVG function is used to calculate the average of numeric values.

Examples

SELECT acct, AVG(amount) as avg_amount
FROM hivesql.transactions as TXN
GROUP BY acct;

Results:

+-------+-------------------+
| acct  |    avg_amount     |
+-------+-------------------+
| 101   | 337.526666666667  |
| 102   | 503.050000000000  |
| 103   | 913.100000000000  |
+-------+-------------------+

RANK function

RANK function calculates the rank of a value in a group of values. It returns one plus the number of rows proceeding or equals to the current row in the ordering of a partition. The returned values are not sequential.

Examples

SELECT TXN.*,
RANK() OVER (ORDER BY txn_date) as row_rank
FROM hivesql.transactions as TXN;

Results:

+-----------+-------------+---------------+-----------+
| txn.acct  | txn.amount  | txn.txn_date  | row_rank  |
+-----------+-------------+---------------+-----------+
| 102       | 93.00       | 2021-01-01    | 1         |
| 101       | 102.01      | 2021-01-01    | 1         |
| 101       | 10.01       | 2021-01-01    | 1         |
| 102       | 913.10      | 2021-01-02    | 4         |
| 103       | 913.10      | 2021-01-02    | 4         |
| 101       | 900.56      | 2021-01-03    | 6         |
+-----------+-------------+---------------+-----------+

DENSE\_RANK function

This function calculates the rank of a value in a group of values. It returns one plus the number of rows proceeding or equals to the current row in the ordering of a partition. The returned values are sequential in each window thus no gaps will be generated.

Examples

SELECT TXN.*,
DENSE_RANK() OVER (ORDER BY txn_date) as row_rank
FROM hivesql.transactions as TXN;

Results:

+-----------+-------------+---------------+-----------+
| txn.acct  | txn.amount  | txn.txn_date  | row_rank  |
+-----------+-------------+---------------+-----------+
| 102       | 93.00       | 2021-01-01    | 1         |
| 101       | 102.01      | 2021-01-01    | 1         |
| 101       | 10.01       | 2021-01-01    | 1         |
| 102       | 913.10      | 2021-01-02    | 2         |
| 103       | 913.10      | 2021-01-02    | 2         |
| 101       | 900.56      | 2021-01-03    | 3         |
+-----------+-------------+---------------+-----------+

ROW\_NUMBER function

ROW_NUMBER in Hive QL assigns a unique sequential number (starting from 1) to each record based on the ordering of rows in each window partition. It is commonly used to deduplicate data.

Examples

SELECT TXN.*,
ROW_NUMBER() OVER (ORDER BY txn_date) as row_num
FROM hivesql.transactions as TXN;

Results:

+-----------+-------------+---------------+-----------+
| txn.acct  | txn.amount  | txn.txn_date  | row_num  |
+-----------+-------------+---------------+-----------+
| 102       | 93.00       | 2021-01-01    | 1         |
| 101       | 102.01      | 2021-01-01    | 2         |
| 101       | 10.01       | 2021-01-01    | 3         |
| 102       | 913.10      | 2021-01-02    | 4         |
| 103       | 913.10      | 2021-01-02    | 5         |
| 101       | 900.56      | 2021-01-03    | 6         |
+-----------+-------------+---------------+-----------+

PERCENT\_RANK function

PERCENT_RANK in HQL returns the percentile of rows within a window partition.

Examples

SELECT TXN.*,
PERCENT_RANK() OVER (ORDER BY amount) as percentile
FROM hivesql.transactions as TXN;

Results:

+-----------+-------------+---------------+-------------+
| txn.acct  | txn.amount  | txn.txn_date  | percentile  |
+-----------+-------------+---------------+-------------+
| 101       | 10.01       | 2021-01-01    | 0.0         |
| 102       | 93.00       | 2021-01-01    | 0.2         |
| 101       | 102.01      | 2021-01-01    | 0.4         |
| 101       | 900.56      | 2021-01-03    | 0.6         |
| 102       | 913.10      | 2021-01-02    | 0.8         |
| 103       | 913.10      | 2021-01-02    | 0.8         |
+-----------+-------------+---------------+-------------+

NTILE function

NTILE function divides the rows in each window to 'n' buckets ranging from 1 to at most 'n' (n is the specified parameter).

Examples

SELECT TXN.*,
NTILE(3) OVER (PARTITION BY txn_date ORDER BY amount) as ntile
FROM hivesql.transactions as TXN;

Results:

| txn.acct  | txn.amount  | txn.txn_date  | ntile  |
+-----------+-------------+---------------+--------+
| 101       | 10.01       | 2021-01-01    | 1      |
| 102       | 93.00       | 2021-01-01    | 2      |
| 101       | 102.01      | 2021-01-01    | 3      |
| 102       | 913.10      | 2021-01-02    | 1      |
| 103       | 913.10      | 2021-01-02    | 2      |
| 101       | 900.56      | 2021-01-03    | 1      |
+-----------+-------------+---------------+--------+

CUME\_DIST function

This function calculates the cumulative distribution of a value within a group of values. In other words, CUME_DIST calculates the relative position of a specified value in a group of values. Assuming ascending ordering, the CUME_DIST of a value in row r is defined as the number of rows with values less than or equal to that value in row r, divided by the number of rows evaluated in the partition or query result set. CUME_DIST is similar to the PERCENT_RANK function.

Examples

SELECT TXN.*,
CUME_DIST() OVER (PARTITION BY txn_date ORDER BY amount) as cume_dist
FROM hivesql.transactions as TXN;

Results:

+-----------+-------------+---------------+---------------------+
| txn.acct  | txn.amount  | txn.txn_date  |      cume_dist      |
+-----------+-------------+---------------+---------------------+
| 101       | 10.01       | 2021-01-01    | 0.3333333333333333  |
| 102       | 93.00       | 2021-01-01    | 0.6666666666666666  |
| 101       | 102.01      | 2021-01-01    | 1.0                 |
| 102       | 913.10      | 2021-01-02    | 1.0                 |
| 103       | 913.10      | 2021-01-02    | 1.0                 |
| 101       | 900.56      | 2021-01-03    | 1.0                 |
+-----------+-------------+---------------+---------------------+

Summary

These functions are very similar as Spark SQL aggregation functions. The main difference is that all the above SQL statements are executed as MapReduce jobs in Hadoop via YARN. Hope you now are familiar with these Hive aggregation functions.

Prerequisites

Create a sample table

LEAD function

Examples

LAG function

Examples

FIRST\_VALUE and LAST\_VALUE functions

Examples

COUNT and COUNT DISTINCT function

Examples

SUM function

Examples

MAX and MIN functions

Examples

AVG function

Examples

RANK function

Examples

DENSE\_RANK function

Examples

ROW\_NUMBER function

Examples

PERCENT\_RANK function

Examples

NTILE function

Examples

CUME\_DIST function

Examples

Summary

In this article