Spark SQL - Calculate Covariance

Spark SQL provides functions to calculate covariances of a set of number pairs. There are two functions: **covar_pop(expr1, expr2)**and covar_samp(expr1, expr2). The first one calculates population covariance while the second one calculates sample covariance.

covar\_pop

Example:

SELECT covar_pop(col1,col2) FROM VALUES 
(1,10.),
(2,20.1),
(3,29.86),
(4,41.8),
(10,101.5)
AS tab(col1, col2);

Output:

covar_pop(CAST(col1 AS DOUBLE), CAST(col2 AS DOUBLE))
101.788

covar\_samp

Example:

SELECT covar_samp(col1,col2) FROM VALUES 
(1,10.),
(2,20.1),
(3,29.86),
(4,41.8),
(10,101.5)
AS tab(col1, col2);

Output:

covar_samp(CAST(col1 AS DOUBLE), CAST(col2 AS DOUBLE))
127.235

infoThe difference between sample and population covariance implementation can be found here: spark/Covariance.scala at master · apache/spark (github.com)