Spark SQL - Standard Deviation Calculation

access_time 5 days ago visibility10 comment 0

In Spark SQL, function std or stddev or  stddev_sample can be used to calculate sample standard deviation from values of a group. 

Function signature

std(expr)
stddev(expr)
stddev_samp(expr)

The first two functions are the alias of stddev_sample function.

Code snippets

SELECT ACCT, std(AMT) FROM VALUES 
(101,10.01, DATE'2021-01-01'),
(101,102.01, DATE'2021-01-01'),
(102,93., DATE'2021-01-01'),
(103,913.1, DATE'2021-01-02'),
(102,913.1, DATE'2021-01-02'),
(101,900.56, DATE'2021-01-03')
AS TXN(ACCT,AMT, TXN_DT)
group by ACCT;

Output:

ACCT    std(CAST(AMT AS DOUBLE))
101     489.76616954760493
103     NaN
102     579.8982712510876

Another example:

SELECT std(AMT) FROM VALUES 
(101,10.01, DATE'2021-01-01'),
(101,102.01, DATE'2021-01-01'),
(102,93., DATE'2021-01-01'),
(103,913.1, DATE'2021-01-02'),
(102,913.1, DATE'2021-01-02'),
(101,900.56, DATE'2021-01-03')
AS TXN(ACCT,AMT, TXN_DT);	

Output:

std(CAST(AMT AS DOUBLE))
461.543305097149

Population standard deviation stddev_pop

Compared with sample standard deviation, there is a population standard deviation named stddev_pop.

Example:

SELECT stddev_pop(AMT) FROM VALUES 
(101,10.01, DATE'2021-01-01'),
(101,102.01, DATE'2021-01-01'),
(102,93., DATE'2021-01-01'),
(103,913.1, DATE'2021-01-02'),
(102,913.1, DATE'2021-01-02'),
(101,900.56, DATE'2021-01-03')
AS TXN(ACCT,AMT, TXN_DT);

Output:

stddev_pop(CAST(AMT AS DOUBLE))
421.3294657786627

The results are different because sample and population standard deviation have different formula definitions. Find the implementation difference here: spark/CentralMomentAgg.scala at master ยท apache/spark (github.com).

info Last modified by Raymond 5 days ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Follow Kontext

Get our latest updates on LinkedIn or Twitter.

Want to publish your article on Kontext?

Learn more

More from Kontext

visibility 6
thumb_up 0
access_time 7 days ago

Spark SQL function from_json(jsonStr, schema[, options]) returns a struct value with the given JSON string and format. Parameter options is used to control how the json is parsed. It accepts the same options as the  json data source in Spark DataFrame reader APIs. The following code ...

visibility 7
thumb_up 0
access_time 9 days ago

Spark LEAD function provides access to a row at a given offset that follows the current row in a window. This analytic function can be used in a SELECT statement to compare values in the current row with values in a following row. This function is like  Spark SQL - LAG Window Function .

visibility 7
thumb_up 0
access_time 6 days ago

In Spark SQL, MapType is designed for key values, which is like dictionary object type in many other programming languages. This article summarize the commonly used map functions in Spark SQL. Function map is used to create a map.  Example: spark-sql> select ...