Spark SQL - Standard Deviation Calculation

Raymond Tang Raymond Tang 0 6740 4.13 index 1/10/2021

In Spark SQL, function stdor**** stddevor ****stddev_sample can be used to calculate sample standard deviation from values of a group.

Function signature

std(expr)
stddev(expr)
stddev_samp(expr)

The first two functions are the alias of stddev_samplefunction.

Code snippets

SELECT ACCT, std(AMT) FROM VALUES 
(101,10.01, DATE'2021-01-01'),
(101,102.01, DATE'2021-01-01'),
(102,93., DATE'2021-01-01'),
(103,913.1, DATE'2021-01-02'),
(102,913.1, DATE'2021-01-02'),
(101,900.56, DATE'2021-01-03')
AS TXN(ACCT,AMT, TXN_DT)
group by ACCT;

Output:

ACCT    std(CAST(AMT AS DOUBLE))
101     489.76616954760493
103     NaN
102     579.8982712510876

Another example:

SELECT std(AMT) FROM VALUES 
(101,10.01, DATE'2021-01-01'),
(101,102.01, DATE'2021-01-01'),
(102,93., DATE'2021-01-01'),
(103,913.1, DATE'2021-01-02'),
(102,913.1, DATE'2021-01-02'),
(101,900.56, DATE'2021-01-03')
AS TXN(ACCT,AMT, TXN_DT);	

Output:

std(CAST(AMT AS DOUBLE))
461.543305097149

Population standard deviation stddev\_pop

Compared with sample standard deviation, there is a population standard deviation named stddev_pop.

Example:

SELECT stddev_pop(AMT) FROM VALUES 
(101,10.01, DATE'2021-01-01'),
(101,102.01, DATE'2021-01-01'),
(102,93., DATE'2021-01-01'),
(103,913.1, DATE'2021-01-02'),
(102,913.1, DATE'2021-01-02'),
(101,900.56, DATE'2021-01-03')
AS TXN(ACCT,AMT, TXN_DT);

Output:

stddev_pop(CAST(AMT AS DOUBLE))
421.3294657786627

The results are different because sample and population standard deviation have different formula definitions. Find the implementation difference here: spark/CentralMomentAgg.scala at master · apache/spark (github.com).

spark-sql spark-sql-function

Join the Discussion

View or add your thoughts below

Comments