Spark SQL - LAG Window Function

access_time 9 days ago visibility8 comment 0

About LAG function

Spark LAG function provides access to a row at a given offset that comes before the current row in the windows. This function can be used in a SELECT statement to compare values in the current row with values in a previous row.

Function signature

lag(input[, offset[, default]])	OVER ([PARYITION BY ..] ORDER BY ...)
  • offset: the default value of parameter `offset` is 1.
  • default: the default value of `default` is null.

If the value of `input` at the `offset`th row is null, null is returned. If there is no such offset row (e.g., when the offset is 1, the first row of the window does not have any previous row), `default` is returned.

Example

The following sample SQL uses LAG function to find the previous transaction record's amount based on DATE for each account. No default value is specified. 

SELECT TXN.*, LAG(AMT,1) OVER (PARTITION BY ACCT ORDER BY TXN_DT) AS AMT_PREV FROM VALUES 
(101,10.01, DATE'2021-01-01'),
(101,102.01, DATE'2021-01-01'),
(102,93., DATE'2021-01-01'),
(103,913.1, DATE'2021-01-02'),
(102,913.1, DATE'2021-01-02'),
(101,900.56, DATE'2021-01-03')
AS TXN(ACCT,AMT, TXN_DT);

Result:

ACCT    AMT     TXN_DT  AMT_PREV
101     10.01   2021-01-01      NULL
101     102.01  2021-01-01      10.01
101     900.56  2021-01-03      102.01
103     913.10  2021-01-02      NULL
102     93.00   2021-01-01      NULL
102     913.10  2021-01-02      93.00
infoPARTITION BY is not mandatory; if it is not specified, all the records will be moved to one single partition which can cause performance issues. 

Example table

The virtual table/data frame is cited from SQL - Construct Table using Literals.

copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Follow Kontext

Get our latest updates on LinkedIn or Twitter.

Want to publish your article on Kontext?

Learn more

More from Kontext

visibility 18
thumb_up 0
access_time 19 days ago

In Spark-SQL CLI tool, the result print will omit headings (column names) by default. To display columns, we need to update Spark setting spark.hadoop.hive.cli.print.header. To make the changes for all spark-sql sessions, edit file $SPARK_HOME/conf/spark-defaults.conf . Add the following ...

visibility 6
thumb_up 0
access_time 7 days ago

Spark SQL function from_json(jsonStr, schema[, options]) returns a struct value with the given JSON string and format. Parameter options is used to control how the json is parsed. It accepts the same options as the  json data source in Spark DataFrame reader APIs. The following code ...

visibility 5
thumb_up 0
access_time 5 days ago

In Spark SQL, function FIRST_VALUE (FIRST) and LAST_VALUE (LAST) can be used to to find the first or the last value of given column or expression for a group of rows. If parameter `isIgnoreNull` is specified as true, they return only non-null values (unless all values are null). first(expr[ ...

About column

Apache Spark installation guides, performance tuning tips, general tutorials, etc.

*Spark logo is a registered trademark of Apache Spark.