Hive SQL - Data Sampling using TABLESAMPLE
insights Stats
Code snippets and tips for various programming languages/frameworks. All code examples are under MIT or Apache 2.0 license unless specified otherwise.
Hive QL provides a TABLESAMPLE clause to sample data. It provides very flex sampling approaches, for example, return certain number of records, certain percentage of records, sample by total length of data to read, sample by buckets, etc.
The following sections provide code snippets of each of the method. The input table in the examples are very small thus the sampling result is even smaller. The purpose is to demonstrate the usage of these different clauses.
TABLESAMPLE (n ROWS)
This clause can be use to sample certain number of records.
hive> select * from transactions TABLESAMPLE(3 ROWS); OK 101 10.01 2021-01-01 101 102.01 2021-01-01 102 93.00 2021-01-01
TABLESAMPLE (ByteLengthLiteral)
For ByteLengthLiteral
, it is provided using format: (Digit)+ ('b' | 'B' | 'k' | 'K' | 'm' | 'M' | 'g' | 'G'). For example, 100b means an input size of 100 bytes or more will be used to do sampling.
hive> select * from transactions TABLESAMPLE(10b); OK 101 10.01 2021-01-01
Only one record is returned as the input size is only 10 bytes.
TABLESAMPLE (n PERCENT)
This clause is used to sample at least n% of data size. If the sample data size is lower than the cluster block size, the sampling will be done with one block size.
hive> select * from transactions TABLESAMPLE(30 percent); OK 101 10.01 2021-01-01 101 102.01 2021-01-01
TABLESAMPLE (BUCKET x OUT OF y [ON colname])
We can also use buckets to sample data. colname
can be one of the non-partition columns in the table or rand()
indicating sampling on the entire row. The rows of the table are 'bucketed' on the colname
randomly into y buckets numbered 1 through y
. Rows in bucket x
are returned.
hive> select * from transactions TABLESAMPLE(bucket 2 out of 5 on acct); OK 101 10.01 2021-01-01 101 102.01 2021-01-01 102 93.00 2021-01-01 102 913.10 2021-01-02 101 900.56 2021-01-03
The above query returns the rows that belong to bucket 2. When we use the rand() function, the second bucket will only include one record (as the whole table only includes 6 records).
hive> select * from transactions TABLESAMPLE(bucket 2 out of 5 on rand()); OK 103 913.10 2021-01-02