Saturday, July 28, 2012

Perform Principal Component Analysis (PCA) Using R, SAS and SQL

Perform Principal Component Analysis (PCA) in R.

R1. Using princomp(m1): by default covariance matrix is used. data is center shifted but not scaled. princomp(m1)\$score can be precisely replicated by:
R2. princomp(d, corr=T): correlation matrix is used. data is center shifted and scaled based on standard deviation. However, standard deviation is based on divisor N not N-1. princomp(m1, cor=TRUE)\$score can be precisely replicated by:

Perform PCA in SAS. By default, correlation matrix is used.
SAS 1. proc princomp data=M1 cov out=m1_pca;
run;

SAS 2. proc princomp data=M1 out=m1_pca_cor;
run;

Scores from R1 match scores from  SAS 1.
Scores from R2 roughly match scores from SAS 2. The difference is caused by that in R,  standard deviation, used as scaling facor, is based on divisor N not N-1. In SAS,   the divisor for standard deviation is N-1.

Calculating PCA in a database using SQL is a very interesting way. We can perform PCA  on large data sets.