Showing posts with label Calculate correlation matrix using SQL. Show all posts
Showing posts with label Calculate correlation matrix using SQL. Show all posts

Sunday, June 30, 2013

Calculate correlation matrix using SQL (continued)

In the earlier post Calculate correlation matrix using SQL, we showed a simple trick of calculating correlation matrix for unlimited number of rows and columns. We need to convert the data table into a simple table with only three columns: rec_id (record id), var_id (column_id), val (value). I was asked how I did the conversion.

If the original table is already in an Oracle database, we can write PL/SQL scripts to convert the original table into the simple table.

In the past experience, my original data were in text file format. So I wrote awk scripts similar to the following to convert the text file into simple format and then loaded them into Oracle. In the script, I included file name as the first column so that I could put data from multiple files into a single destination file.

Original data:
$ cat sample.csv
record_id, field1, field2, field3
1001,1,2,3
1002,4,5,6
1003,7,8,9
$ awk -F"," '{for (i=1;i<=NF;i++) print FILENAME","FNR-1","i","$i;}' sample.csv
sample.csv,0,1,record_id
sample.csv,0,2, field1
sample.csv,0,3, field2
sample.csv,0,4, field3
sample.csv,1,1,1001
sample.csv,1,2,1
sample.csv,1,3,2
sample.csv,1,4,3
sample.csv,2,1,1002
sample.csv,2,2,4
sample.csv,2,3,5
sample.csv,2,4,6
sample.csv,3,1,1003
sample.csv,3,2,7
sample.csv,3,3,8
sample.csv,3,4,9

Monday, August 27, 2012

Calculate correlation matrix using SQL


Any table can be converted into the a simple table with only three columns:  rec_id (record id), var_id (column_id), val (value). For example,
Original Table
Record_ID   Variable_1 Variable_2 Variable3
1                      1.2             2.2                 3.3
2                     2.0             1.5                  4.0
The above table can be converted into the following simple table.
Rec_id  var_id    val
1             1         1.2
1             2         2.2
1             3         3.3
2             1         2.0
2             2         1.5
2             3         4.0

Once the data are stored in the simple, three column table,  the top-right correlation matrix can be easily calculated using the following query. 

select a.var_id var_id1, b.var_id var_id2, corr(a.val, b.val) correl
from tbl_simple  a, tbl_simple b where a.rec_id=b.rec_id 
and a.var_id<=b.var_id group by a.var_id, b.var_id ;

This method can handle unlimited number of records and variables. To improve the performance, we should create index on rec_id and var_id. Also see my yesterday's post about "Calculate correlation coefficient between two variables using SQL".