In Spark SQL, function stdor**** stddevor ****stddev_sample can be used to calculate sample standard deviation from values of a group.
Function signature
std(expr)
stddev(expr)
stddev_samp(expr)
The first two functions are the alias of stddev_samplefunction.
Code snippets
SELECT ACCT, std(AMT) FROM VALUES
(101,10.01, DATE'2021-01-01'),
(101,102.01, DATE'2021-01-01'),
(102,93., DATE'2021-01-01'),
(103,913.1, DATE'2021-01-02'),
(102,913.1, DATE'2021-01-02'),
(101,900.56, DATE'2021-01-03')
AS TXN(ACCT,AMT, TXN_DT)
group by ACCT;
Output:
ACCT std(CAST(AMT AS DOUBLE))
101 489.76616954760493
103 NaN
102 579.8982712510876
Another example:
SELECT std(AMT) FROM VALUES
(101,10.01, DATE'2021-01-01'),
(101,102.01, DATE'2021-01-01'),
(102,93., DATE'2021-01-01'),
(103,913.1, DATE'2021-01-02'),
(102,913.1, DATE'2021-01-02'),
(101,900.56, DATE'2021-01-03')
AS TXN(ACCT,AMT, TXN_DT);
Output:
std(CAST(AMT AS DOUBLE))
461.543305097149
Population standard deviation stddev\_pop
Compared with sample standard deviation, there is a population standard deviation named stddev_pop.
Example:
SELECT stddev_pop(AMT) FROM VALUES
(101,10.01, DATE'2021-01-01'),
(101,102.01, DATE'2021-01-01'),
(102,93., DATE'2021-01-01'),
(103,913.1, DATE'2021-01-02'),
(102,913.1, DATE'2021-01-02'),
(101,900.56, DATE'2021-01-03')
AS TXN(ACCT,AMT, TXN_DT);
Output:
stddev_pop(CAST(AMT AS DOUBLE))
421.3294657786627
The results are different because sample and population standard deviation have different formula definitions. Find the implementation difference here: spark/CentralMomentAgg.scala at master · apache/spark (github.com).