Spark SQL - Calculate Covariance

visibility 165 comment 0 access_time 10m languageEnglish

Spark SQL provides functions to calculate covariances of a set of number pairs. There are two functions: covar_pop(expr1, expr2) and covar_samp(expr1, expr2). The first one calculates population covariance while the second one calculates sample covariance. 

covar_pop

Example:

SELECT covar_pop(col1,col2) FROM VALUES 
(1,10.),
(2,20.1),
(3,29.86),
(4,41.8),
(10,101.5)
AS tab(col1, col2);

Output:

covar_pop(CAST(col1 AS DOUBLE), CAST(col2 AS DOUBLE))
101.788

covar_samp

Example:

SELECT covar_samp(col1,col2) FROM VALUES 
(1,10.),
(2,20.1),
(3,29.86),
(4,41.8),
(10,101.5)
AS tab(col1, col2);

Output:

covar_samp(CAST(col1 AS DOUBLE), CAST(col2 AS DOUBLE))
127.235
infoThe difference between sample and population covariance implementation can be found here: spark/Covariance.scala at master ยท apache/spark (github.com)
copyright This page is subject to Site terms.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Tags
More from Kontext