# Spark SQL - Standard Deviation Calculation

visibility 2,411 access_time 2 years ago

In Spark SQL, function std or stddev or  stddev_sample can be used to calculate sample standard deviation from values of a group.

### Function signature

```std(expr)
stddev(expr)
stddev_samp(expr)
```

The first two functions are the alias of stddev_sample function.

### Code snippets

```SELECT ACCT, std(AMT) FROM VALUES
(101,10.01, DATE'2021-01-01'),
(101,102.01, DATE'2021-01-01'),
(102,93., DATE'2021-01-01'),
(103,913.1, DATE'2021-01-02'),
(102,913.1, DATE'2021-01-02'),
(101,900.56, DATE'2021-01-03')
AS TXN(ACCT,AMT, TXN_DT)
group by ACCT;```

Output:

```ACCT    std(CAST(AMT AS DOUBLE))
101     489.76616954760493
103     NaN
102     579.8982712510876```

Another example:

```SELECT std(AMT) FROM VALUES
(101,10.01, DATE'2021-01-01'),
(101,102.01, DATE'2021-01-01'),
(102,93., DATE'2021-01-01'),
(103,913.1, DATE'2021-01-02'),
(102,913.1, DATE'2021-01-02'),
(101,900.56, DATE'2021-01-03')
AS TXN(ACCT,AMT, TXN_DT);	```

Output:

```std(CAST(AMT AS DOUBLE))
461.543305097149```

### Population standard deviation stddev_pop

Compared with sample standard deviation, there is a population standard deviation named stddev_pop.

Example:

```SELECT stddev_pop(AMT) FROM VALUES
(101,10.01, DATE'2021-01-01'),
(101,102.01, DATE'2021-01-01'),
(102,93., DATE'2021-01-01'),
(103,913.1, DATE'2021-01-02'),
(102,913.1, DATE'2021-01-02'),
(101,900.56, DATE'2021-01-03')
AS TXN(ACCT,AMT, TXN_DT);```

Output:

```stddev_pop(CAST(AMT AS DOUBLE))
421.3294657786627```

The results are different because sample and population standard deviation have different formula definitions. Find the implementation difference here: spark/CentralMomentAgg.scala at master ยท apache/spark (github.com).

info Last modified by Raymond 2 years ago copyright This page is subject to Site terms.

Please log in or register to comment.

#### Log in with external accounts

Page index 4.85
##### Show Headings (Column Names) in spark-sql CLI Result
access_time 2 years ago
##### Spark SQL - DENSE_RANK Window Function
access_time 2 years ago
##### Spark SQL - Average (AVG) Calculation
access_time 8 months ago