Saturday, July 28, 2012

Perform Principal Component Analysis (PCA) Using R, SAS and SQL


Perform Principal Component Analysis (PCA) in R.

R1. Using princomp(m1): by default covariance matrix is used. data is center shifted but not scaled. princomp(m1)$score can be precisely replicated by:
scale(m1,scale=F)%*%princomp(m1)$loading
R2. princomp(d, corr=T): correlation matrix is used. data is center shifted and scaled based on standard deviation. However, standard deviation is based on divisor N not N-1. princomp(m1, cor=TRUE)$score can be precisely replicated by:
((m1-princomp(m1, cor=TRUE)$center)/princomp(m1, cor=TRUE)$scale)%*%princomp(m1, cor=TRUE)$loading

Perform PCA in SAS. By default, correlation matrix is used.
SAS 1. proc princomp data=M1 cov out=m1_pca;
         run;

SAS 2. proc princomp data=M1 out=m1_pca_cor;
         run;

Scores from R1 match scores from  SAS 1.
Scores from R2 roughly match scores from SAS 2. The difference is caused by that in R,  standard deviation, used as scaling facor, is based on divisor N not N-1. In SAS,   the divisor for standard deviation is N-1.

Calculating PCA in a database using SQL is a very interesting way. We can perform PCA  on large data sets.

2 comments:

davem said...

Great but no sample of how you have done this in SQL. Or am I missing something

Jay Zhou, PhD. said...

I would suggest taking a look at the function utl_nla.LAPACK_GESVD.