Hive SQL - Aggregate Functions Overview with Examples
insights Stats
Articles about Apache Hadoop, Hive and HBase installation, performance tuning and general tutorials.
*The yellow elephant logo is a registered trademark of Apache Hadoop.
- Prerequisites
- Create a sample table
- LEAD function
- Examples
- LAG function
- Examples
- FIRST_VALUE and LAST_VALUE functions
- Examples
- COUNT and COUNT DISTINCT function
- Examples
- SUM function
- Examples
- MAX and MIN functions
- Examples
- AVG function
- Examples
- RANK function
- Examples
- DENSE_RANK function
- Examples
- ROW_NUMBER function
- Examples
- PERCENT_RANK function
- Examples
- NTILE function
- Examples
- CUME_DIST function
- Examples
- Summary
Apache Hive is a data warehouse product based on Hadoop. Similar as other database engines, Hive provides a number of built-in aggregation functions for data analysis, including LEAD, LAG, FIRST_VALUE, LAST_VALUE, COUNT (w/ or wo/ DISTINCT), SUM, MIN, MAX, AVG, RANK, ROW_NUMBER, DENSE_RANK, CUME_DIST, PERCENT_RANK, NTILE, and so on.
This articles show you how to use these aggregation functions in Hive.
Prerequisites
To run the SQL statements in this article, you need a Hive environment. Follow one of the following articles to configure a Hive instance on your computer if you don't have one to work with yet.
- Apache Hive 3.1.1 Installation on Windows 10 using Windows Subsystem for Linux
- Apache Hive 3.1.2 Installation on Linux Guide
- Apache Hive 3.1.2 Installation on Windows 10
Create a sample table
Once Hive is installed, let's create a sample table using beeline. This table will be used in the following sections. Beeline is the successor of hive CLI.
Start beeline using the following command:
$HIVE_HOME/bin/beeline -u jdbc:hive2://
Create a database named hivesql:
0: jdbc:hive2://> create database hivesql; OK No rows affected (1.183 seconds)
Create table named transactions in this database:
0: jdbc:hive2://> create table hivesql.transactions(acct int, amount decimal(18,2), txn_date date); OK No rows affected (2.096 seconds)
Seed the table with some data using the following INSERT statements:
insert into hivesql.transactions(acct,amount,txn_date) values (101,10.01, DATE'2021-01-01'), (101,102.01, DATE'2021-01-01'), (102,93., DATE'2021-01-01'), (103,913.1, DATE'2021-01-02'), (102,913.1, DATE'2021-01-02'), (101,900.56, DATE'2021-01-03');
Query the table and it prints out the following data:
0: jdbc:hive2://> select * from hivesql.transactions; OK +--------------------+----------------------+------------------------+ | transactions.acct | transactions.amount | transactions.txn_date | +--------------------+----------------------+------------------------+ | 101 | 10.01 | 2021-01-01 | | 101 | 102.01 | 2021-01-01 | | 102 | 93.00 | 2021-01-01 | | 103 | 913.10 | 2021-01-02 | | 102 | 913.10 | 2021-01-02 | | 101 | 900.56 | 2021-01-03 | +--------------------+----------------------+------------------------+ 6 rows selected (0.38 seconds)
Now let's use this sample data to practice Hive aggregation functions. For most of the functions, we will show sample queries with and without PARTITION BY clause.
LEAD function
LEAD function provides access to a row at a given offset that follows the current row in a window. This analytic function can be used in a SELECT statement to compare values in the current row with values in a following row.
Examples
SELECT TXN.*, LEAD(amount,1) OVER (PARTITION BY acct ORDER BY txn_date) AS amt_sub FROM hivesql.transactions as TXN;
Result:
+-----------+-------------+---------------+----------+ | txn.acct | txn.amount | txn.txn_date | amt_sub | +-----------+-------------+---------------+----------+ | 101 | 102.01 | 2021-01-01 | 10.01 | | 101 | 10.01 | 2021-01-01 | 900.56 | | 101 | 900.56 | 2021-01-03 | NULL | | 102 | 93.00 | 2021-01-01 | 913.10 | | 102 | 913.10 | 2021-01-02 | NULL | | 103 | 913.10 | 2021-01-02 | NULL | +-----------+-------------+---------------+----------+
LAG function
LAG function provides access to a row at a given offset that comes before the current row in the windows. This function can be used in a SELECT statement to compare values in the current row with values in a previous row.
Examples
SELECT TXN.*, LAG(amount,1) OVER (PARTITION BY acct ORDER BY txn_date) AS amt_pre FROM hivesql.transactions as TXN;
Results:
+-----------+-------------+---------------+----------+ | txn.acct | txn.amount | txn.txn_date | amt_pre | +-----------+-------------+---------------+----------+ | 101 | 102.01 | 2021-01-01 | NULL | | 101 | 10.01 | 2021-01-01 | 102.01 | | 101 | 900.56 | 2021-01-03 | 10.01 | | 102 | 93.00 | 2021-01-01 | NULL | | 102 | 913.10 | 2021-01-02 | 93.00 | | 103 | 913.10 | 2021-01-02 | NULL | +-----------+-------------+---------------+----------+
FIRST_VALUE and LAST_VALUE functions
In Hive QL, function FIRST_VALUE and LAST_VALUE can be used to to find the first or the last value of given column or expression for a group of rows.
Examples
SELECT TXN.*, FIRST_VALUE(amount) OVER (ORDER BY txn_date), LAST_VALUE(amount) OVER (ORDER BY txn_date) FROM hivesql.transactions as TXN;
Results:
+-----------+-------------+---------------+-----------------------+----------------------+ | txn.acct | txn.amount | txn.txn_date | FIRST_VALUE_window_0 | LAST_VALUE_window_1 | +-----------+-------------+---------------+-----------------------+----------------------+ | 102 | 93.00 | 2021-01-01 | 93.00 | 10.01 | | 101 | 102.01 | 2021-01-01 | 93.00 | 10.01 | | 101 | 10.01 | 2021-01-01 | 93.00 | 10.01 | | 102 | 913.10 | 2021-01-02 | 93.00 | 913.10 | | 103 | 913.10 | 2021-01-02 | 93.00 | 913.10 | | 101 | 900.56 | 2021-01-03 | 93.00 | 900.56 | +-----------+-------------+---------------+-----------------------+----------------------+
COUNT and COUNT DISTINCT function
Examples
SELECT COUNT(*) as total_records, COUNT(distinct acct) as total_accounts FROM hivesql.transactions as TXN;
Results:
+----------------+-----------------+ | total_records | total_accounts | +----------------+-----------------+ | 6 | 3 | +----------------+-----------------+
SUM function
SUM is used to calculate the summary of numeric values.
Examples
SELECT acct, SUM(amount) as total_amount FROM hivesql.transactions as TXN GROUP BY acct;
Results:
+-------+---------------+ | acct | total_amount | +-------+---------------+ | 101 | 1012.58 | | 102 | 1006.10 | | 103 | 913.10 | +-------+---------------+
MAX and MIN functions
Examples
SELECT acct, MAX(amount) as max_amount, MIN(amount) as min_amount FROM hivesql.transactions as TXN GROUP BY acct;
Results:
+-------+-------------+-------------+ | acct | max_amount | min_amount | +-------+-------------+-------------+ | 101 | 900.56 | 10.01 | | 102 | 913.10 | 93.00 | | 103 | 913.10 | 913.10 | +-------+-------------+-------------+
AVG function
AVG function is used to calculate the average of numeric values.
Examples
SELECT acct, AVG(amount) as avg_amount FROM hivesql.transactions as TXN GROUP BY acct;
Results:
+-------+-------------------+ | acct | avg_amount | +-------+-------------------+ | 101 | 337.526666666667 | | 102 | 503.050000000000 | | 103 | 913.100000000000 | +-------+-------------------+
RANK function
RANK function calculates the rank of a value in a group of values. It returns one plus the number of rows proceeding or equals to the current row in the ordering of a partition. The returned values are not sequential.
Examples
SELECT TXN.*, RANK() OVER (ORDER BY txn_date) as row_rank FROM hivesql.transactions as TXN;
Results:
+-----------+-------------+---------------+-----------+ | txn.acct | txn.amount | txn.txn_date | row_rank | +-----------+-------------+---------------+-----------+ | 102 | 93.00 | 2021-01-01 | 1 | | 101 | 102.01 | 2021-01-01 | 1 | | 101 | 10.01 | 2021-01-01 | 1 | | 102 | 913.10 | 2021-01-02 | 4 | | 103 | 913.10 | 2021-01-02 | 4 | | 101 | 900.56 | 2021-01-03 | 6 | +-----------+-------------+---------------+-----------+
DENSE_RANK function
This function calculates the rank of a value in a group of values. It returns one plus the number of rows proceeding or equals to the current row in the ordering of a partition. The returned values are sequential in each window thus no gaps will be generated.
Examples
SELECT TXN.*, DENSE_RANK() OVER (ORDER BY txn_date) as row_rank FROM hivesql.transactions as TXN;
Results:
+-----------+-------------+---------------+-----------+ | txn.acct | txn.amount | txn.txn_date | row_rank | +-----------+-------------+---------------+-----------+ | 102 | 93.00 | 2021-01-01 | 1 | | 101 | 102.01 | 2021-01-01 | 1 | | 101 | 10.01 | 2021-01-01 | 1 | | 102 | 913.10 | 2021-01-02 | 2 | | 103 | 913.10 | 2021-01-02 | 2 | | 101 | 900.56 | 2021-01-03 | 3 | +-----------+-------------+---------------+-----------+
ROW_NUMBER function
ROW_NUMBER in Hive QL assigns a unique sequential number (starting from 1) to each record based on the ordering of rows in each window partition. It is commonly used to deduplicate data.
Examples
SELECT TXN.*, ROW_NUMBER() OVER (ORDER BY txn_date) as row_num FROM hivesql.transactions as TXN;
+-----------+-------------+---------------+-----------+ | txn.acct | txn.amount | txn.txn_date | row_num | +-----------+-------------+---------------+-----------+ | 102 | 93.00 | 2021-01-01 | 1 | | 101 | 102.01 | 2021-01-01 | 2 | | 101 | 10.01 | 2021-01-01 | 3 | | 102 | 913.10 | 2021-01-02 | 4 | | 103 | 913.10 | 2021-01-02 | 5 | | 101 | 900.56 | 2021-01-03 | 6 | +-----------+-------------+---------------+-----------+
PERCENT_RANK function
PERCENT_RANK in HQL returns the percentile of rows within a window partition.
Examples
SELECT TXN.*, PERCENT_RANK() OVER (ORDER BY amount) as percentile FROM hivesql.transactions as TXN;
Results:
+-----------+-------------+---------------+-------------+ | txn.acct | txn.amount | txn.txn_date | percentile | +-----------+-------------+---------------+-------------+ | 101 | 10.01 | 2021-01-01 | 0.0 | | 102 | 93.00 | 2021-01-01 | 0.2 | | 101 | 102.01 | 2021-01-01 | 0.4 | | 101 | 900.56 | 2021-01-03 | 0.6 | | 102 | 913.10 | 2021-01-02 | 0.8 | | 103 | 913.10 | 2021-01-02 | 0.8 | +-----------+-------------+---------------+-------------+
NTILE function
NTILE function divides the rows in each window to 'n' buckets ranging from 1 to at most 'n' (n is the specified parameter).
Examples
SELECT TXN.*, NTILE(3) OVER (PARTITION BY txn_date ORDER BY amount) as ntile FROM hivesql.transactions as TXN;Results:
| txn.acct | txn.amount | txn.txn_date | ntile | +-----------+-------------+---------------+--------+ | 101 | 10.01 | 2021-01-01 | 1 | | 102 | 93.00 | 2021-01-01 | 2 | | 101 | 102.01 | 2021-01-01 | 3 | | 102 | 913.10 | 2021-01-02 | 1 | | 103 | 913.10 | 2021-01-02 | 2 | | 101 | 900.56 | 2021-01-03 | 1 | +-----------+-------------+---------------+--------+
CUME_DIST function
This function calculates the cumulative distribution of a value within a group of values. In other words, CUME_DIST calculates the relative position of a specified value in a group of values. Assuming ascending ordering, the CUME_DIST of a value in row r is defined as the number of rows with values less than or equal to that value in row r, divided by the number of rows evaluated in the partition or query result set. CUME_DIST is similar to the PERCENT_RANK function.
Examples
SELECT TXN.*, CUME_DIST() OVER (PARTITION BY txn_date ORDER BY amount) as cume_dist FROM hivesql.transactions as TXN;Results:
+-----------+-------------+---------------+---------------------+ | txn.acct | txn.amount | txn.txn_date | cume_dist | +-----------+-------------+---------------+---------------------+ | 101 | 10.01 | 2021-01-01 | 0.3333333333333333 | | 102 | 93.00 | 2021-01-01 | 0.6666666666666666 | | 101 | 102.01 | 2021-01-01 | 1.0 | | 102 | 913.10 | 2021-01-02 | 1.0 | | 103 | 913.10 | 2021-01-02 | 1.0 | | 101 | 900.56 | 2021-01-03 | 1.0 | +-----------+-------------+---------------+---------------------+
Summary
These functions are very similar as Spark SQL aggregation functions. The main difference is that all the above SQL statements are executed as MapReduce jobs in Hadoop via YARN. Hope you now are familiar with these Hive aggregation functions.