Hive SQL - Cluster By and Distribute By
Hive provides two clauses CLUSTER BY and DISTRIBUTE BY that are not available in most of other databases. Hive uses the columns in DISTRIBUTE BY to distribute the rows among reducers. All rows with the same DISTRIBUTE BY columns will be sent to the same reducer. DISTRIBUTE BY does not guarantee clustering or sorting properties on the distributed keys. CLUSTER BY is a shortcut for both DISTRIBUTE BY and SORT BY.
Syntax of CLUSTER BY and DISRIBUTE BY
For DISTRIBUTE BY, the syntax is defined as below:
DISTRIBUTE BY colName (',' colName)*
For CLUSTER BY, the syntax is very similar:
CLUSTER BY colName (',' colName)*
Examples
The following are some examples of using these two clauses. The table used was created in Hive SQL - Aggregate Functions Overview with Examples.
Use DISTRIBUTE BY
set mapreduce.job.reduces=2; select * from hivesql.transactions distribute by acct;
Results:
+--------------------+----------------------+------------------------+ | transactions.acct | transactions.amount | transactions.txn_date | +--------------------+----------------------+------------------------+ | 102 | 913.10 | 2021-01-02 | | 102 | 93.00 | 2021-01-01 | | 101 | 900.56 | 2021-01-03 | | 103 | 913.10 | 2021-01-02 | | 101 | 102.01 | 2021-01-01 | | 101 | 10.01 | 2021-01-01 | +--------------------+----------------------+------------------------+
For account 101, the result is not ordered.
Use CLUSTER BY
select * from hivesql.transactions CLUSTER BY acct;
Results:
+--------------------+----------------------+------------------------+ | transactions.acct | transactions.amount | transactions.txn_date | +--------------------+----------------------+------------------------+ | 102 | 913.10 | 2021-01-02 | | 102 | 93.00 | 2021-01-01 | | 101 | 900.56 | 2021-01-03 | | 101 | 102.01 | 2021-01-01 | | 101 | 10.01 | 2021-01-01 | | 103 | 913.10 | 2021-01-02 | +--------------------+----------------------+------------------------+
The results for account 101 is together.