Spark SQL - Standard Deviation Calculation

Raymond Raymond event 2021-01-10 visibility 6,464
more_vert

In Spark SQL, function std or stddev or  stddev_sample can be used to calculate sample standard deviation from values of a group. 

Function signature

std(expr)
stddev(expr)
stddev_samp(expr)

The first two functions are the alias of stddev_sample function.

Code snippets

SELECT ACCT, std(AMT) FROM VALUES 
(101,10.01, DATE'2021-01-01'),
(101,102.01, DATE'2021-01-01'),
(102,93., DATE'2021-01-01'),
(103,913.1, DATE'2021-01-02'),
(102,913.1, DATE'2021-01-02'),
(101,900.56, DATE'2021-01-03')
AS TXN(ACCT,AMT, TXN_DT)
group by ACCT;

Output:

ACCT    std(CAST(AMT AS DOUBLE))
101     489.76616954760493
103     NaN
102     579.8982712510876

Another example:

SELECT std(AMT) FROM VALUES 
(101,10.01, DATE'2021-01-01'),
(101,102.01, DATE'2021-01-01'),
(102,93., DATE'2021-01-01'),
(103,913.1, DATE'2021-01-02'),
(102,913.1, DATE'2021-01-02'),
(101,900.56, DATE'2021-01-03')
AS TXN(ACCT,AMT, TXN_DT);	

Output:

std(CAST(AMT AS DOUBLE))
461.543305097149

Population standard deviation stddev_pop

Compared with sample standard deviation, there is a population standard deviation named stddev_pop.

Example:

SELECT stddev_pop(AMT) FROM VALUES 
(101,10.01, DATE'2021-01-01'),
(101,102.01, DATE'2021-01-01'),
(102,93., DATE'2021-01-01'),
(103,913.1, DATE'2021-01-02'),
(102,913.1, DATE'2021-01-02'),
(101,900.56, DATE'2021-01-03')
AS TXN(ACCT,AMT, TXN_DT);

Output:

stddev_pop(CAST(AMT AS DOUBLE))
421.3294657786627

The results are different because sample and population standard deviation have different formula definitions. Find the implementation difference here: spark/CentralMomentAgg.scala at master ยท apache/spark (github.com).

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts