Spark SQL - Standard Deviation Calculation
In Spark SQL, function std or stddev or stddev_sample can be used to calculate sample standard deviation from values of a group.
Function signature
std(expr) stddev(expr) stddev_samp(expr)
The first two functions are the alias of stddev_sample function.
Code snippets
SELECT ACCT, std(AMT) FROM VALUES (101,10.01, DATE'2021-01-01'), (101,102.01, DATE'2021-01-01'), (102,93., DATE'2021-01-01'), (103,913.1, DATE'2021-01-02'), (102,913.1, DATE'2021-01-02'), (101,900.56, DATE'2021-01-03') AS TXN(ACCT,AMT, TXN_DT) group by ACCT;
Output:
ACCT std(CAST(AMT AS DOUBLE)) 101 489.76616954760493 103 NaN 102 579.8982712510876
Another example:
SELECT std(AMT) FROM VALUES (101,10.01, DATE'2021-01-01'), (101,102.01, DATE'2021-01-01'), (102,93., DATE'2021-01-01'), (103,913.1, DATE'2021-01-02'), (102,913.1, DATE'2021-01-02'), (101,900.56, DATE'2021-01-03') AS TXN(ACCT,AMT, TXN_DT);
Output:
std(CAST(AMT AS DOUBLE)) 461.543305097149
Population standard deviation stddev_pop
Compared with sample standard deviation, there is a population standard deviation named stddev_pop.
Example:
SELECT stddev_pop(AMT) FROM VALUES (101,10.01, DATE'2021-01-01'), (101,102.01, DATE'2021-01-01'), (102,93., DATE'2021-01-01'), (103,913.1, DATE'2021-01-02'), (102,913.1, DATE'2021-01-02'), (101,900.56, DATE'2021-01-03') AS TXN(ACCT,AMT, TXN_DT);
Output:
stddev_pop(CAST(AMT AS DOUBLE)) 421.3294657786627
The results are different because sample and population standard deviation have different formula definitions. Find the implementation difference here: spark/CentralMomentAgg.scala at master ยท apache/spark (github.com).
info Last modified by Raymond 4 years ago
copyright
This page is subject to Site terms.
comment Comments
No comments yet.