Monday, November 25, 2013

Calculate the Correlation Coefficient Using SQL

Problem

We want to calculate the correlation coefficient between two variables, such as the length of employment and salary within a company. If the data are Oracle database table, how do we do it using SQL?
SQL> select employee_id, department_id, hire_date, salary 
from EMPLOYEES where rownum <=20;

EMPLOYEE_ID DEPARTMENT_ID HIRE_DATE   SALARY
----------- ------------- --------- --------
        100            90 17-JUN-87  24000.0
        101            90 21-SEP-89  17000.0
        102            90 13-JAN-93  17000.0
        103            60 03-JAN-90   9000.0
        104            60 21-MAY-91   6000.0
        105            60 25-JUN-97   4800.0
        106            60 05-FEB-98   4800.0
        107            60 07-FEB-99   4200.0
        108           100 17-AUG-94  12000.0
        109           100 16-AUG-94   9000.0
        110           100 28-SEP-97   8200.0
        111           100 30-SEP-97   7700.0
        112           100 07-MAR-98   7800.0
        113           100 07-DEC-99   6900.0
        114            30 07-DEC-94  11000.0
        115            30 18-MAY-95   3100.0
        116            30 24-DEC-97   2900.0
        117            30 24-JUL-97   2800.0
        118            30 15-NOV-98   2600.0
        119            30 10-AUG-99   2500.0

20 rows selected.

Solution

We use Oracle corr function to calculate the correlation coefficient. The following is the query (since the data is old, we assume that the current date is May 30, 2000.)


SQL> select  corr(salary, to_date('20000530','YYYYMMDD')-hire_date) cor  from EMPLOYEES;

       COR
----------
.497838041
We can verify the correlation coefficient using the following query based on the equation for calculating correlation coefficient.
SQL> with 
tbl as (select avg(salary) m1, stddev(salary) s1, 
avg(to_date('20000530','YYYYMMDD')-hire_date) m2, 
stddev(to_date('20000530','YYYYMMDD')-hire_date) s2 ,
count(1) n from EMPLOYEES), 
tbl2 as (select salary, 
to_date('20000530','YYYYMMDD')-hire_date l from employees) 
select sum( (salary-m1)*(l-m2)/(n-1)/(s1*s2) ) from tbl, tbl2;

SUM((SALARY-M1)*(L-M2)/(N-1)/(S1*S2))
-------------------------------------
                           .497838041

No comments: