Hive SQL - Aggregate Functions Overview with Examples

Kontext Kontext event 2022-07-10 visibility 4,564
more_vert
Hive SQL - Aggregate Functions Overview with Examples

Apache Hive is a data warehouse product based on Hadoop. Similar as other database engines, Hive provides a number of built-in aggregation functions for data analysis, including LEAD, LAG, FIRST_VALUE, LAST_VALUE, COUNT (w/ or wo/ DISTINCT), SUM, MIN, MAX, AVG, RANK, ROW_NUMBER, DENSE_RANK, CUME_DIST, PERCENT_RANK, NTILE, and so on.

This articles show you how to use these aggregation functions in Hive.

Prerequisites

To run the SQL statements in this article, you need a Hive environment. Follow one of the following articles to configure a Hive instance on your computer if you don't have one to work with yet.

Create a sample table

Once Hive is installed, let's create a sample table using beeline. This table will be used in the following sections. Beeline is the successor of hive CLI.

Start beeline using the following command:

$HIVE_HOME/bin/beeline -u jdbc:hive2://

Create a database named hivesql:

0: jdbc:hive2://> create database hivesql;
OK
No rows affected (1.183 seconds)

Create table named transactions in this database:

0: jdbc:hive2://> create table hivesql.transactions(acct int, amount decimal(18,2), txn_date date);
OK
No rows affected (2.096 seconds)

Seed the table with some data using the following INSERT statements:

insert into hivesql.transactions(acct,amount,txn_date) 
values 
(101,10.01, DATE'2021-01-01'),
(101,102.01, DATE'2021-01-01'),
(102,93., DATE'2021-01-01'),
(103,913.1, DATE'2021-01-02'),
(102,913.1, DATE'2021-01-02'),
(101,900.56, DATE'2021-01-03');

Query the table and it prints out the following data:

0: jdbc:hive2://> select * from hivesql.transactions;
OK
+--------------------+----------------------+------------------------+
| transactions.acct  | transactions.amount  | transactions.txn_date  |
+--------------------+----------------------+------------------------+
| 101                | 10.01                | 2021-01-01             |
| 101                | 102.01               | 2021-01-01             |
| 102                | 93.00                | 2021-01-01             |
| 103                | 913.10               | 2021-01-02             |
| 102                | 913.10               | 2021-01-02             |
| 101                | 900.56               | 2021-01-03             |
+--------------------+----------------------+------------------------+
6 rows selected (0.38 seconds)

Now let's use this sample data to practice Hive aggregation functions. For most of the functions, we will show sample queries with and without PARTITION BY clause.

LEAD function

LEAD function provides access to a row at a given offset that follows the current row in a window. This analytic function can be used in a SELECT statement to compare values in the current row with values in a following row. 

Examples

SELECT TXN.*, LEAD(amount,1) OVER (PARTITION BY acct ORDER BY txn_date) AS amt_sub 
FROM hivesql.transactions as TXN;

Result:

+-----------+-------------+---------------+----------+
| txn.acct  | txn.amount  | txn.txn_date  | amt_sub  |
+-----------+-------------+---------------+----------+
| 101       | 102.01      | 2021-01-01    | 10.01    |
| 101       | 10.01       | 2021-01-01    | 900.56   |
| 101       | 900.56      | 2021-01-03    | NULL     |
| 102       | 93.00       | 2021-01-01    | 913.10   |
| 102       | 913.10      | 2021-01-02    | NULL     |
| 103       | 913.10      | 2021-01-02    | NULL     |
+-----------+-------------+---------------+----------+

LAG function

LAG function provides access to a row at a given offset that comes before the current row in the windows. This function can be used in a SELECT statement to compare values in the current row with values in a previous row.

Examples

SELECT TXN.*, LAG(amount,1) OVER (PARTITION BY acct ORDER BY txn_date) AS amt_pre 
FROM hivesql.transactions as TXN;

Results:

+-----------+-------------+---------------+----------+
| txn.acct  | txn.amount  | txn.txn_date  | amt_pre  |
+-----------+-------------+---------------+----------+
| 101       | 102.01      | 2021-01-01    | NULL     |
| 101       | 10.01       | 2021-01-01    | 102.01   |
| 101       | 900.56      | 2021-01-03    | 10.01    |
| 102       | 93.00       | 2021-01-01    | NULL     |
| 102       | 913.10      | 2021-01-02    | 93.00    |
| 103       | 913.10      | 2021-01-02    | NULL     |
+-----------+-------------+---------------+----------+

FIRST_VALUE and LAST_VALUE functions

In Hive QL, function FIRST_VALUE and LAST_VALUE can be used to to find the first or the last value of given column or expression for a group of rows.

Examples

SELECT TXN.*, 
FIRST_VALUE(amount) OVER (ORDER BY txn_date), 
LAST_VALUE(amount) OVER (ORDER BY txn_date)
FROM hivesql.transactions as TXN;

Results:

+-----------+-------------+---------------+-----------------------+----------------------+
| txn.acct  | txn.amount  | txn.txn_date  | FIRST_VALUE_window_0  | LAST_VALUE_window_1  |
+-----------+-------------+---------------+-----------------------+----------------------+
| 102       | 93.00       | 2021-01-01    | 93.00                 | 10.01                |
| 101       | 102.01      | 2021-01-01    | 93.00                 | 10.01                |
| 101       | 10.01       | 2021-01-01    | 93.00                 | 10.01                |
| 102       | 913.10      | 2021-01-02    | 93.00                 | 913.10               |
| 103       | 913.10      | 2021-01-02    | 93.00                 | 913.10               |
| 101       | 900.56      | 2021-01-03    | 93.00                 | 900.56               |
+-----------+-------------+---------------+-----------------------+----------------------+

COUNT and COUNT DISTINCT function

Examples

SELECT COUNT(*) as total_records,
COUNT(distinct acct) as total_accounts
FROM hivesql.transactions as TXN;

Results:

+----------------+-----------------+
| total_records  | total_accounts  |
+----------------+-----------------+
| 6              | 3               |
+----------------+-----------------+

SUM function

SUM is used to calculate the summary of numeric values. 

Examples

SELECT acct, SUM(amount) as total_amount
FROM hivesql.transactions as TXN
GROUP BY acct;

Results:

+-------+---------------+
| acct  | total_amount  |
+-------+---------------+
| 101   | 1012.58       |
| 102   | 1006.10       |
| 103   | 913.10        |
+-------+---------------+

MAX and MIN functions

Examples

SELECT acct, MAX(amount) as max_amount,
MIN(amount) as min_amount
FROM hivesql.transactions as TXN
GROUP BY acct;

Results:

+-------+-------------+-------------+
| acct  | max_amount  | min_amount  |
+-------+-------------+-------------+
| 101   | 900.56      | 10.01       |
| 102   | 913.10      | 93.00       |
| 103   | 913.10      | 913.10      |
+-------+-------------+-------------+

AVG function

AVG function is used to calculate the average of numeric values.

Examples

SELECT acct, AVG(amount) as avg_amount
FROM hivesql.transactions as TXN
GROUP BY acct;

Results:

+-------+-------------------+
| acct  |    avg_amount     |
+-------+-------------------+
| 101   | 337.526666666667  |
| 102   | 503.050000000000  |
| 103   | 913.100000000000  |
+-------+-------------------+

RANK function

RANK function calculates the rank of a value in a group of values. It returns one plus the number of rows proceeding or equals to the current row in the ordering of a partition. The returned values are not sequential.  

Examples

SELECT TXN.*,
RANK() OVER (ORDER BY txn_date) as row_rank
FROM hivesql.transactions as TXN;

Results:

+-----------+-------------+---------------+-----------+
| txn.acct  | txn.amount  | txn.txn_date  | row_rank  |
+-----------+-------------+---------------+-----------+
| 102       | 93.00       | 2021-01-01    | 1         |
| 101       | 102.01      | 2021-01-01    | 1         |
| 101       | 10.01       | 2021-01-01    | 1         |
| 102       | 913.10      | 2021-01-02    | 4         |
| 103       | 913.10      | 2021-01-02    | 4         |
| 101       | 900.56      | 2021-01-03    | 6         |
+-----------+-------------+---------------+-----------+

DENSE_RANK function

This function calculates the rank of a value in a group of values. It returns one plus the number of rows proceeding or equals to the current row in the ordering of a partition. The returned values are sequential in each window thus no gaps will be generated. 

Examples

SELECT TXN.*,
DENSE_RANK() OVER (ORDER BY txn_date) as row_rank
FROM hivesql.transactions as TXN;

Results:

+-----------+-------------+---------------+-----------+
| txn.acct  | txn.amount  | txn.txn_date  | row_rank  |
+-----------+-------------+---------------+-----------+
| 102       | 93.00       | 2021-01-01    | 1         |
| 101       | 102.01      | 2021-01-01    | 1         |
| 101       | 10.01       | 2021-01-01    | 1         |
| 102       | 913.10      | 2021-01-02    | 2         |
| 103       | 913.10      | 2021-01-02    | 2         |
| 101       | 900.56      | 2021-01-03    | 3         |
+-----------+-------------+---------------+-----------+

ROW_NUMBER function

ROW_NUMBER in Hive QL assigns a unique sequential number (starting from 1) to each record based on the ordering of rows in each window partition. It is commonly used to deduplicate data.

Examples

SELECT TXN.*,
ROW_NUMBER() OVER (ORDER BY txn_date) as row_num
FROM hivesql.transactions as TXN;
Results:
+-----------+-------------+---------------+-----------+
| txn.acct  | txn.amount  | txn.txn_date  | row_num  |
+-----------+-------------+---------------+-----------+
| 102       | 93.00       | 2021-01-01    | 1         |
| 101       | 102.01      | 2021-01-01    | 2         |
| 101       | 10.01       | 2021-01-01    | 3         |
| 102       | 913.10      | 2021-01-02    | 4         |
| 103       | 913.10      | 2021-01-02    | 5         |
| 101       | 900.56      | 2021-01-03    | 6         |
+-----------+-------------+---------------+-----------+

PERCENT_RANK function

PERCENT_RANK in HQL returns the percentile of rows within a window partition.

Examples

SELECT TXN.*,
PERCENT_RANK() OVER (ORDER BY amount) as percentile
FROM hivesql.transactions as TXN;

Results:

+-----------+-------------+---------------+-------------+
| txn.acct  | txn.amount  | txn.txn_date  | percentile  |
+-----------+-------------+---------------+-------------+
| 101       | 10.01       | 2021-01-01    | 0.0         |
| 102       | 93.00       | 2021-01-01    | 0.2         |
| 101       | 102.01      | 2021-01-01    | 0.4         |
| 101       | 900.56      | 2021-01-03    | 0.6         |
| 102       | 913.10      | 2021-01-02    | 0.8         |
| 103       | 913.10      | 2021-01-02    | 0.8         |
+-----------+-------------+---------------+-------------+

NTILE function

NTILE function divides the rows in each window to 'n' buckets ranging from 1 to at most 'n' (n is the specified parameter). 

Examples

SELECT TXN.*,
NTILE(3) OVER (PARTITION BY txn_date ORDER BY amount) as ntile
FROM hivesql.transactions as TXN;
Results:

| txn.acct  | txn.amount  | txn.txn_date  | ntile  |
+-----------+-------------+---------------+--------+
| 101       | 10.01       | 2021-01-01    | 1      |
| 102       | 93.00       | 2021-01-01    | 2      |
| 101       | 102.01      | 2021-01-01    | 3      |
| 102       | 913.10      | 2021-01-02    | 1      |
| 103       | 913.10      | 2021-01-02    | 2      |
| 101       | 900.56      | 2021-01-03    | 1      |
+-----------+-------------+---------------+--------+

CUME_DIST function

This function calculates the cumulative distribution of a value within a group of values. In other words, CUME_DIST calculates the relative position of a specified value in a group of values. Assuming ascending ordering, the CUME_DIST of a value in row r is defined as the number of rows with values less than or equal to that value in row r, divided by the number of rows evaluated in the partition or query result set. CUME_DIST is similar to the PERCENT_RANK function. 

Examples

SELECT TXN.*,
CUME_DIST() OVER (PARTITION BY txn_date ORDER BY amount) as cume_dist
FROM hivesql.transactions as TXN;
Results:

+-----------+-------------+---------------+---------------------+
| txn.acct  | txn.amount  | txn.txn_date  |      cume_dist      |
+-----------+-------------+---------------+---------------------+
| 101       | 10.01       | 2021-01-01    | 0.3333333333333333  |
| 102       | 93.00       | 2021-01-01    | 0.6666666666666666  |
| 101       | 102.01      | 2021-01-01    | 1.0                 |
| 102       | 913.10      | 2021-01-02    | 1.0                 |
| 103       | 913.10      | 2021-01-02    | 1.0                 |
| 101       | 900.56      | 2021-01-03    | 1.0                 |
+-----------+-------------+---------------+---------------------+

Summary

These functions are very similar as Spark SQL aggregation functions. The main difference is that all the above SQL statements are executed as MapReduce jobs in Hadoop via YARN. Hope you now are familiar with these Hive aggregation functions. 

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts