Sunday, July 29, 2012

More on Performing Principal Component Analysis in Oracle

Wiht Oracle `UTL_NLA` package, we can call routines to perform PCA on vectors and matrices represented as `VARRAY`s. How ever, vectors and matrices are stored in `VARRAY`s with a maximum size of one million entries. Given this restriction, `UTL_NLA` vectors can be up to one million entries but matrices need to be of size RxC <= 1,000,000.

That's why I have developed my own functions to perform PCA. It is much easier to use and can handle unlimited number of records. Data and operations always stay in the database.

Saturday, July 28, 2012

Perform Principal Component Analysis (PCA) Using R, SAS and SQL

Perform Principal Component Analysis (PCA) in R.

R1. Using princomp(m1): by default covariance matrix is used. data is center shifted but not scaled. princomp(m1)\$score can be precisely replicated by:
R2. princomp(d, corr=T): correlation matrix is used. data is center shifted and scaled based on standard deviation. However, standard deviation is based on divisor N not N-1. princomp(m1, cor=TRUE)\$score can be precisely replicated by:

Perform PCA in SAS. By default, correlation matrix is used.
SAS 1. proc princomp data=M1 cov out=m1_pca;
run;

SAS 2. proc princomp data=M1 out=m1_pca_cor;
run;

Scores from R1 match scores from  SAS 1.
Scores from R2 roughly match scores from SAS 2. The difference is caused by that in R,  standard deviation, used as scaling facor, is based on divisor N not N-1. In SAS,   the divisor for standard deviation is N-1.

Calculating PCA in a database using SQL is a very interesting way. We can perform PCA  on large data sets.

Friday, July 20, 2012

SAS strip vs Oracle trim functions

SAS Strip and Oracle trim functions are equivalent. Both  can remove all leading and trailing blanks.  With Oracle trim, one can also define specific characters to be removed.

Oracle trim examples.

trim ('  hello ') returns 'hello'
trim (both '0' from '0012130')  returns 1213

SAS Strip remove leading and trailing blanks.

SAS data set vs. relational database table

There is a key difference between SAS data sets and relational database tables. In SAS data set, like a text file, there is a natural, fixed order of records (or observations in SAS term). The order of a data set will not change unless we specifically modify it using method like "proc sort". So in SAS, it makes perfect sense to say "extract the first 500 observations from a data set".

However, in a relational database table, there is no natural order defined for records in a table. To say "exact the first 500 records from a table" does not make sense. We have to specifically define an order. Thus in a relational database, the following statements all make sense.
"Extract 500 records randomly from a table."
"Extract the top 500 records with the highest account numbers from a table (assuming account numbers are unique)".

Caution needs to be taken when converting SAS data set to a relational data table. The natural order of records in the SAS data set will be lost. One way to preserve the natural order of a SAS data set is to  create an extra column based on SAS system variable _N_.

Tuesday, July 10, 2012

How to tokenize text in Oracle

The following procedure splits text column query into keywords,e.g., "Hello World" into two words "Hello" and "World".

Step 1. Create index of type ctxsys.contex.
create index t_query_idx on t_query(query) indextype is ctxsys.context;

Step 2.
set serveroutput on;
declare
the_tokens ctx_doc.token_tab;
begin
for i in (select row_p_key from t_query order by row_p_key) loop
ctx_doc.tokens('t_query_idx', i.row_p_key, the_tokens);
for ii in 1..the_tokens.count loop
insert into tbl_query_token select i.row_p_key,the_tokens(ii).offset, the_tokens(ii).token from dual;
end loop;
commit;
end loop;
end;
/