Showing posts with label Calculate correlation coefficient. Show all posts
Showing posts with label Calculate correlation coefficient. Show all posts

Sunday, June 30, 2013

Calculate correlation matrix using SQL (continued)

In the earlier post Calculate correlation matrix using SQL, we showed a simple trick of calculating correlation matrix for unlimited number of rows and columns. We need to convert the data table into a simple table with only three columns: rec_id (record id), var_id (column_id), val (value). I was asked how I did the conversion.

If the original table is already in an Oracle database, we can write PL/SQL scripts to convert the original table into the simple table.

In the past experience, my original data were in text file format. So I wrote awk scripts similar to the following to convert the text file into simple format and then loaded them into Oracle. In the script, I included file name as the first column so that I could put data from multiple files into a single destination file.

Original data:
$ cat sample.csv
record_id, field1, field2, field3
1001,1,2,3
1002,4,5,6
1003,7,8,9
$ awk -F"," '{for (i=1;i<=NF;i++) print FILENAME","FNR-1","i","$i;}' sample.csv
sample.csv,0,1,record_id
sample.csv,0,2, field1
sample.csv,0,3, field2
sample.csv,0,4, field3
sample.csv,1,1,1001
sample.csv,1,2,1
sample.csv,1,3,2
sample.csv,1,4,3
sample.csv,2,1,1002
sample.csv,2,2,4
sample.csv,2,3,5
sample.csv,2,4,6
sample.csv,3,1,1003
sample.csv,3,2,7
sample.csv,3,3,8
sample.csv,3,4,9

Sunday, August 26, 2012

Calculate correlation coefficient between two variables using SQL


Oracle function corr()  returns the correlation coefficient between two variables. The following scripts show two ways of calculating correlation coefficient.

1. Calculate correlation coefficient  using corr() function.
select corr(x,y) from TBL_XY;
.214418947
2. Calculate correlation coefficient directly without using corr() function. The equation is based on  Pearson's coefficient (http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient). The results are the same.
with tbl_mean as(
select avg(x) mean_x, avg(y) mean_y from tbl_xy
),
tbl_vec_m_corrected as
(
select x-mean_x mean_x_corrected, y-mean_y mean_y_corrected
from tbl_xy, tbl_mean
)
select sum(mean_x_corrected*mean_y_corrected)/(
sqrt(sum(mean_x_corrected*mean_x_corrected)) *sqrt(sum(mean_y_corrected*mean_y_corrected)))
 as corr_coef
from tbl_vec_m_corrected ;
.214418947