Hive SQL - Differences between Order By and Sort By
Hive provides two clause ORDER BY and SORT BY to sort results a query. However they are slightly different from each other.
Syntax of ORDER BY and SORT BY
For ORDER BY, the syntax is defined as below:
colOrder: ( ASC | DESC )
colNullOrder: (NULLS FIRST | NULLS LAST)
orderBy: ORDER BY colName colOrder? colNullOrder? (',' colName colOrder? colNullOrder?)*
query: SELECT expression (',' expression)* FROM src orderBy
For SORT BY, the syntax is very similar:
colOrder: ( ASC | DESC )
sortBy: SORT BY colName colOrder? (',' colName colOrder?)*
query: SELECT expression (',' expression)* FROM src sortBy
From syntax perspective, SORT BY doesn't support NULL ORDER sub clause. As other database engines, NULL cannot be compared with other non-null values, hence HQL also provides the NULL order clause (NULLS FIRST or NULLS LAST, available from Hive 2.1.0) to decide the sequence of NULL values.
Differences
Hive sorts the rows using columns specified in SORT BY before sending the data to reducer. Thus SORT BY only guarantees the order per reducer while ORDER BY will decide the total order of records in the final output.
Examples
The following are some examples of using these two clauses. The table used was created in Hive SQL - Aggregate Functions Overview with Examples.
Use ORDER BY
select * from hivesql.transactions ORDER BY acct ASC, txn_date;
Results:
+--------------------+----------------------+------------------------+ | transactions.acct | transactions.amount | transactions.txn_date | +--------------------+----------------------+------------------------+ | 101 | 102.01 | 2021-01-01 | | 101 | 10.01 | 2021-01-01 | | 101 | 900.56 | 2021-01-03 | | 102 | 93.00 | 2021-01-01 | | 102 | 913.10 | 2021-01-02 | | 103 | 913.10 | 2021-01-02 | +--------------------+----------------------+------------------------+
Use SORT BY
select * from hivesql.transactions SORT BY acct ASC, txn_date;
Results:
+--------------------+----------------------+------------------------+ | transactions.acct | transactions.amount | transactions.txn_date | +--------------------+----------------------+------------------------+ | 101 | 102.01 | 2021-01-01 | | 101 | 10.01 | 2021-01-01 | | 101 | 900.56 | 2021-01-03 | | 102 | 93.00 | 2021-01-01 | | 102 | 913.10 | 2021-01-02 | | 103 | 913.10 | 2021-01-02 | +--------------------+----------------------+------------------------+
The above result is exactly the same as ORDER BY as there is only on reducer.
Let's increase reducer to 2 and then run the script again:
set mapreduce.job.reduces=2; select * from hivesql.transactions SORT BY acct ASC, txn_date;
Results:
+--------------------+----------------------+------------------------+ | transactions.acct | transactions.amount | transactions.txn_date | +--------------------+----------------------+------------------------+ | 101 | 10.01 | 2021-01-01 | | 102 | 93.00 | 2021-01-01 | | 102 | 913.10 | 2021-01-02 | | 103 | 913.10 | 2021-01-02 | | 101 | 102.01 | 2021-01-01 | | 101 | 900.56 | 2021-01-03 | +--------------------+----------------------+------------------------+
The results are now in different sequence as SORT BY will only guarantee orders of records in each reducer. However, the result will stay as the same if using ORDER BY.