About ROW_NUMBER function
ROW_NUMBER in Spark assigns a unique sequential number (starting from 1) to each record based on the ordering of rows in each window partition. It is commonly used to deduplicate data.
ROW_NUMBER without partition
The following sample SQL uses ROW_NUMBER function without PARTITION BY clause:
SELECT TXN.*, ROW_NUMBER() OVER (ORDER BY TXN_DT) AS ROWNUM FROM VALUES (101,10.01, DATE'2021-01-01'), (101,102.01, DATE'2021-01-01'), (102,93., DATE'2021-01-01'), (103,913.1, DATE'2021-01-02'), (101,900.56, DATE'2021-01-03') AS TXN(ACCT,AMT, TXN_DT);
Result:
ACCT AMT TXN_DT ROWNUM 101 10.01 2021-01-01 1 101 102.01 2021-01-01 2 102 93.00 2021-01-01 3 103 913.10 2021-01-02 4 101 900.56 2021-01-03 5
Each record has a unique number starting from 1.
ROW_NUMBER with partition
The following sample SQL returns a unique number for only records in each window (defined by PARTITION BY):
SELECT TXN.*, ROW_NUMBER() OVER (PARTITION BY ACCT ORDER BY TXN_DT) AS ROWNUM FROM VALUES (101,10.01, DATE'2021-01-01'), (101,102.01, DATE'2021-01-01'), (102,93., DATE'2021-01-01'), (103,913.1, DATE'2021-01-02'), (101,900.56, DATE'2021-01-03') AS TXN(ACCT,AMT, TXN_DT);
Result:
ACCT AMT TXN_DT ROWNUM 101 10.01 2021-01-01 1 101 102.01 2021-01-01 2 101 900.56 2021-01-03 3 103 913.10 2021-01-02 1 102 93.00 2021-01-01 1
Records are allocated to windows based on account number.
Example table
The virtual table/data frame is cited from SQL - Construct Table using Literals.