Sunday, June 30, 2013

Calculate correlation matrix using SQL (continued)

In the earlier post Calculate correlation matrix using SQL, we showed a simple trick of calculating correlation matrix for unlimited number of rows and columns. We need to convert the data table into a simple table with only three columns: rec_id (record id), var_id (column_id), val (value). I was asked how I did the conversion.

If the original table is already in an Oracle database, we can write PL/SQL scripts to convert the original table into the simple table.

In the past experience, my original data were in text file format. So I wrote awk scripts similar to the following to convert the text file into simple format and then loaded them into Oracle. In the script, I included file name as the first column so that I could put data from multiple files into a single destination file.

Original data:
$ cat sample.csv
record_id, field1, field2, field3
1001,1,2,3
1002,4,5,6
1003,7,8,9
$ awk -F"," '{for (i=1;i<=NF;i++) print FILENAME","FNR-1","i","$i;}' sample.csv
sample.csv,0,1,record_id
sample.csv,0,2, field1
sample.csv,0,3, field2
sample.csv,0,4, field3
sample.csv,1,1,1001
sample.csv,1,2,1
sample.csv,1,3,2
sample.csv,1,4,3
sample.csv,2,1,1002
sample.csv,2,2,4
sample.csv,2,3,5
sample.csv,2,4,6
sample.csv,3,1,1003
sample.csv,3,2,7
sample.csv,3,3,8
sample.csv,3,4,9