Hive SQL - Cluster By and Distribute By

Hive provides two clauses CLUSTER BY and DISTRIBUTE BY that are not available in most of other databases. Hive uses the columns in DISTRIBUTE BY to distribute the rows among reducers. All rows with the same DISTRIBUTE BY columns will be sent to the same reducer. DISTRIBUTE BY does not guarantee clustering or sorting properties on the distributed keys. CLUSTER BY is a shortcut for both DISTRIBUTE BY and SORT BY.

Syntax of CLUSTER BY and DISRIBUTE BY

For DISTRIBUTE BY, the syntax is defined as below:

DISTRIBUTE BY colName (',' colName)*

For CLUSTER BY, the syntax is very similar:

CLUSTER BY colName (',' colName)*

Examples

The following are some examples of using these two clauses. The table used was created in Hive SQL - Aggregate Functions Overview with Examples.

Use DISTRIBUTE BY

set mapreduce.job.reduces=2;
select * from hivesql.transactions distribute by acct;

Results:

+--------------------+----------------------+------------------------+
| transactions.acct  | transactions.amount  | transactions.txn_date  |
+--------------------+----------------------+------------------------+
| 102                | 913.10               | 2021-01-02             |
| 102                | 93.00                | 2021-01-01             |
| 101                | 900.56               | 2021-01-03             |
| 103                | 913.10               | 2021-01-02             |
| 101                | 102.01               | 2021-01-01             |
| 101                | 10.01                | 2021-01-01             |
+--------------------+----------------------+------------------------+

For account 101, the result is not ordered.

Use CLUSTER BY

select * from hivesql.transactions CLUSTER BY acct;

Results:

+--------------------+----------------------+------------------------+
| transactions.acct  | transactions.amount  | transactions.txn_date  |
+--------------------+----------------------+------------------------+
| 102                | 913.10               | 2021-01-02             |
| 102                | 93.00                | 2021-01-01             |
| 101                | 900.56               | 2021-01-03             |
| 101                | 102.01               | 2021-01-01             |
| 101                | 10.01                | 2021-01-01             |
| 103                | 913.10               | 2021-01-02             |
+--------------------+----------------------+------------------------+

The results for account 101 is together.

infoCluster By and Distribute By are used mainly with the Transform/Map-Reduce Scripts. However it is useful in SELECT statements if there is a need to partition and sort the output of a query for subsequent queries. This can potentially improve query performance.

Syntax of CLUSTER BY and DISRIBUTE BY

Examples

Use DISTRIBUTE BY

Use CLUSTER BY

In this article