Spark SQL provides functions to calculate covariances of a set of number pairs. There are two functions: **covar_pop(expr1, expr2)**and covar_samp(expr1, expr2). The first one calculates population covariance while the second one calculates sample covariance.
covar\_pop
Example:
SELECT covar_pop(col1,col2) FROM VALUES
(1,10.),
(2,20.1),
(3,29.86),
(4,41.8),
(10,101.5)
AS tab(col1, col2);
Output:
covar_pop(CAST(col1 AS DOUBLE), CAST(col2 AS DOUBLE))
101.788
covar\_samp
Example:
SELECT covar_samp(col1,col2) FROM VALUES
(1,10.),
(2,20.1),
(3,29.86),
(4,41.8),
(10,101.5)
AS tab(col1, col2);
Output:
covar_samp(CAST(col1 AS DOUBLE), CAST(col2 AS DOUBLE))
127.235
infoThe difference between sample and population covariance implementation can be found here: spark/Covariance.scala at master · apache/spark (github.com)