Random Sampling in Google BigQuery

visibility 471 access_time 2 years ago languageEnglish

In databases like Teradata, SAMPLE clause can be used to sample data from a large data table. Google BigQuery SQL didn't support SAMPLE clause at the very beginning however now TABLESAMPLE clause is added to support querying random sample subset of a large data set.

Use RAND function

Before TABLESAMPLE is added, RAND function is used as an alternative to retrieve random sample subset. The querying cost is big as the whole table will be scanned to generate one random number for each record. 

Example

SELECT col1, col2, RAND(5) AS rnd FROM `myproj.mydataset.mytable` ORDER BY rnd LIMIT 10;

The above query returns 10 random records.

Use TABLESAMPLE clause

TABLESAMPLE clause is used to sampling a subset from a large data set. 

Example

SELECT * FROM `myproj.mydataset.mytable` TABLESAMPLE SYSTEM (5 PERCENT)

The above example samples 5 percent of the data available in the specified table.  The clause doesn't support specifying records volume. For more information about table sampling, refer to Table sampling  |  BigQuery  |  Google Cloud.

info Last modified by Raymond 2 years ago copyright This page is subject to Site terms.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

timeline Stats
Page index 1.28
More from Kontext
BigQuery - Convert Bytes to BASE64 or HEX String
visibility 595
thumb_up 0
access_time 9 months ago
BigQuery - Convert Bytes to BASE64 or HEX String
BigQuery - Generate Unique Values
visibility 426
thumb_up 0
access_time 2 years ago
Load JSON File into BigQuery
visibility 2,744
thumb_up 0
access_time 2 years ago
Load JSON File into BigQuery