Hive SQL - Cluster By and Distribute By

event 2022-07-10 visibility 1,516 comment 0 insights
more_vert
insights Stats
Kontext Kontext Hadoop, Hive & HBase

Articles about Apache Hadoop, Hive and HBase installation, performance tuning and general tutorials.

*The yellow elephant logo is a registered trademark of Apache Hadoop.

Hive provides two clauses CLUSTER BY and DISTRIBUTE BY that are not available in most of other databases. Hive uses the columns in DISTRIBUTE BY to distribute the rows among reducers. All rows with the same DISTRIBUTE BY columns will be sent to the same reducer. DISTRIBUTE BY does not guarantee clustering or sorting properties on the distributed keys. CLUSTER BY is a shortcut for both DISTRIBUTE BY and SORT BY.

Syntax of CLUSTER BY and DISRIBUTE BY

For DISTRIBUTE BY, the syntax is defined as below:

DISTRIBUTE BY colName (',' colName)*

For CLUSTER BY, the syntax is very similar:

CLUSTER BY colName (',' colName)*

Examples

The following are some examples of using these two clauses. The table used was created in Hive SQL - Aggregate Functions Overview with Examples.

Use DISTRIBUTE BY

set mapreduce.job.reduces=2;
select * from hivesql.transactions distribute by acct;

Results:

+--------------------+----------------------+------------------------+
| transactions.acct  | transactions.amount  | transactions.txn_date  |
+--------------------+----------------------+------------------------+
| 102                | 913.10               | 2021-01-02             |
| 102                | 93.00                | 2021-01-01             |
| 101                | 900.56               | 2021-01-03             |
| 103                | 913.10               | 2021-01-02             |
| 101                | 102.01               | 2021-01-01             |
| 101                | 10.01                | 2021-01-01             |
+--------------------+----------------------+------------------------+

For account 101, the result is not ordered. 

Use CLUSTER BY

select * from hivesql.transactions CLUSTER BY acct;

Results:

+--------------------+----------------------+------------------------+
| transactions.acct  | transactions.amount  | transactions.txn_date  |
+--------------------+----------------------+------------------------+
| 102                | 913.10               | 2021-01-02             |
| 102                | 93.00                | 2021-01-01             |
| 101                | 900.56               | 2021-01-03             |
| 101                | 102.01               | 2021-01-01             |
| 101                | 10.01                | 2021-01-01             |
| 103                | 913.10               | 2021-01-02             |
+--------------------+----------------------+------------------------+

The results for account 101 is together.

infoCluster By and Distribute By are used mainly with the Transform/Map-Reduce Scripts. However it is useful in SELECT statements if there is a need to partition and sort the output of a query for subsequent queries. This can potentially improve query performance.
More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts