Hive SQL - Data Sampling using TABLESAMPLE

Kontext Kontext event 2022-07-23 visibility 1,680
more_vert

Hive QL provides a TABLESAMPLE clause to sample data. It provides very flex sampling approaches, for example, return certain number of records, certain percentage of records, sample by total length of data to read, sample by buckets, etc.

The following sections provide code snippets of each of the method. The input table in the examples are very small thus the sampling result is even smaller. The purpose is to demonstrate the usage of these different clauses. 

TABLESAMPLE (n ROWS)

This clause can be use to sample certain number of records.

hive> select * from transactions TABLESAMPLE(3 ROWS);
OK
101     10.01   2021-01-01
101     102.01  2021-01-01
102     93.00   2021-01-01

TABLESAMPLE (ByteLengthLiteral)

For ByteLengthLiteral, it is provided using format: (Digit)+ ('b' | 'B' | 'k' | 'K' | 'm' | 'M' | 'g' | 'G'). For example, 100b means an input size of 100 bytes or more will be used to do sampling.

hive> select * from transactions TABLESAMPLE(10b);
OK
101     10.01   2021-01-01

Only one record is returned as the input size is only 10 bytes.

TABLESAMPLE (n PERCENT)

This clause is used to sample at least n% of data size. If the sample data size is lower than the cluster block size, the sampling will be done with one block size.

hive> select * from transactions TABLESAMPLE(30 percent);
OK
101     10.01   2021-01-01
101     102.01  2021-01-01

TABLESAMPLE (BUCKET x OUT OF y [ON colname])

We can also use buckets to sample data. colname can be one of the non-partition columns in the table or rand() indicating sampling on the entire row. The rows of the table are 'bucketed' on the colname randomly into y buckets numbered 1 through y. Rows in bucket x are returned.

hive> select * from transactions TABLESAMPLE(bucket 2 out of 5 on acct);
OK
101     10.01   2021-01-01
101     102.01  2021-01-01
102     93.00   2021-01-01
102     913.10  2021-01-02
101     900.56  2021-01-03

The above query returns the rows that belong to bucket 2. When we use the rand() function, the second bucket will only include one record (as the whole table only includes 6 records).

hive> select * from transactions TABLESAMPLE(bucket 2 out of 5 on rand());
OK
103     913.10  2021-01-02
More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts