Spark SQL - window Function

Spark SQL has built-in function `window `to bucketize rows into one or more time windows given a timestamp specifying column. The syntax of the function looks like the following: > window(timeColumn: ColumnOrName, windowDuration: str, slideDuration: Optional[str] = None, startTime: Optional[str] = None) This function is available from Spark 2.0.0. `slideDuration `must be less than or equal to `windowDuration`. \*These SQL statements can be directly used in PySpark DataFrame APIs too via `spark.sql` function. This code snippet prints out the following outputs: Query 1: ``` 2022-08-01 12:01:00 {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00} 2022-08-01 12:15:00 {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00} 2022-08-01 12:31:01 {"start":2022-08-01 12:30:00,"end":2022-08-01 13:00:00} ``` The first two rows are in the same window [00:00, 00:30). Query 2: ``` 2022-08-01 12:01:00     {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00}2022-08-01 12:01:00     {"start":2022-08-01 11:45:00,"end":2022-08-01 12:15:00}2022-08-01 12:15:00     {"start":2022-08-01 12:15:00,"end":2022-08-01 12:45:00}2022-08-01 12:15:00     {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00}2022-08-01 12:31:01     {"start":2022-08-01 12:30:00,"end":2022-08-01 13:00:00}2022-08-01 12:31:01     {"start":2022-08-01 12:15:00,"end":2022-08-01 12:45:00} ```

Kontext Kontext 0 425 0.40 index 8/16/2022

Code description

Spark SQL has built-in function window to bucketize rows into one or more time windows given a timestamp specifying column. The syntax of the function looks like the following:

window(timeColumn: ColumnOrName, windowDuration: str, slideDuration: Optional[str] = None, startTime: Optional[str] = None) 

This function is available from Spark 2.0.0. slideDuration must be less than or equal to windowDuration.

*These SQL statements can be directly used in PySpark DataFrame APIs too via spark.sql function.

This code snippet prints out the following outputs:

Query 1:

    2022-08-01 12:01:00     {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00}
    2022-08-01 12:15:00     {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00}
    2022-08-01 12:31:01     {"start":2022-08-01 12:30:00,"end":2022-08-01 13:00:00}

The first two rows are in the same window [00:00, 00:30).

Query 2:

    2022-08-01 12:01:00     {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00}  
    2022-08-01 12:01:00     {"start":2022-08-01 11:45:00,"end":2022-08-01 12:15:00}  
    2022-08-01 12:15:00     {"start":2022-08-01 12:15:00,"end":2022-08-01 12:45:00}  
    2022-08-01 12:15:00     {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00}  
    2022-08-01 12:31:01     {"start":2022-08-01 12:30:00,"end":2022-08-01 13:00:00}  
    2022-08-01 12:31:01     {"start":2022-08-01 12:15:00,"end":2022-08-01 12:45:00}

Code snippet

    with t as (
    select timestamp('2022-08-01 12:01:00') as ts 
    UNION ALL select timestamp('2022-08-01 12:15:00') 
    UNION ALL select timestamp('2022-08-01 12:31:01')
    )
    select t.ts, window(t.ts, '30 minutes') from t;
    
    with t as (
    select timestamp('2022-08-01 12:01:00') as ts 
    UNION ALL select timestamp('2022-08-01 12:15:00') 
    UNION ALL select timestamp('2022-08-01 12:31:01')
    )
    select t.ts, window(t.ts, '30 minutes', '15 minutes') from t;
spark-sql spark-sql-function streaming-analytics

Join the Discussion

View or add your thoughts below

Comments