Monday, August 27, 2012

Calculate correlation matrix using SQL


Any table can be converted into the a simple table with only three columns:  rec_id (record id), var_id (column_id), val (value). For example,
Original Table
Record_ID   Variable_1 Variable_2 Variable3
1                      1.2             2.2                 3.3
2                     2.0             1.5                  4.0
The above table can be converted into the following simple table.
Rec_id  var_id    val
1             1         1.2
1             2         2.2
1             3         3.3
2             1         2.0
2             2         1.5
2             3         4.0

Once the data are stored in the simple, three column table,  the top-right correlation matrix can be easily calculated using the following query. 

select a.var_id var_id1, b.var_id var_id2, corr(a.val, b.val) correl
from tbl_simple  a, tbl_simple b where a.rec_id=b.rec_id 
and a.var_id<=b.var_id group by a.var_id, b.var_id ;

This method can handle unlimited number of records and variables. To improve the performance, we should create index on rec_id and var_id. Also see my yesterday's post about "Calculate correlation coefficient between two variables using SQL".

2 comments:

Unknown said...

I am new to SQL how do you create the simple table from the original?

Thank you

Jay Zhou, PhD. said...

Please see my post Calculate correlation matrix using SQL (continued). Thanks.