Deep Data Mining Blog: July 2012

Sunday, July 29, 2012

More on Performing Principal Component Analysis in Oracle

Wiht Oracle UTL_NLA package, we can call routines to perform PCA on vectors and matrices represented as VARRAYs. How ever, vectors and matrices are stored in VARRAYs with a maximum size of one million entries. Given this restriction, UTL_NLA vectors can be up to one million entries but matrices need to be of size RxC <= 1,000,000.

That's why I have developed my own functions to perform PCA. It is much easier to use and can handle unlimited number of records. Data and operations always stay in the database.

Saturday, July 28, 2012

Perform Principal Component Analysis (PCA) Using R, SAS and SQL

Perform Principal Component Analysis (PCA) in R.

R1. Using princomp(m1): by default covariance matrix is used. data is center shifted but not scaled. princomp(m1)$score can be precisely replicated by:
scale(m1,scale=F)%*%princomp(m1)$loading
R2. princomp(d, corr=T): correlation matrix is used. data is center shifted and scaled based on standard deviation. However, standard deviation is based on divisor N not N-1. princomp(m1, cor=TRUE)$score can be precisely replicated by:
((m1-princomp(m1, cor=TRUE)$center)/princomp(m1, cor=TRUE)$scale)%*%princomp(m1, cor=TRUE)$loading

Perform PCA in SAS. By default, correlation matrix is used.
SAS 1. proc princomp data=M1 cov out=m1_pca;
run;

SAS 2. proc princomp data=M1 out=m1_pca_cor;
run;

Scores from R1 match scores from SAS 1.
Scores from R2 roughly match scores from SAS 2. The difference is caused by that in R, standard deviation, used as scaling facor, is based on divisor N not N-1. In SAS, the divisor for standard deviation is N-1.

Calculating PCA in a database using SQL is a very interesting way. We can perform PCA on large data sets.

Friday, July 20, 2012

SAS strip vs Oracle trim functions

SAS Strip and Oracle trim functions are equivalent. Both can remove all leading and trailing blanks. With Oracle trim, one can also define specific characters to be removed.

Oracle trim examples.

trim (' hello ') returns 'hello'

trim (both '0' from '0012130') returns 1213

SAS Strip remove leading and trailing blanks.

SAS data set vs. relational database table

There is a key difference between SAS data sets and relational database tables. In SAS data set, like a text file, there is a natural, fixed order of records (or observations in SAS term). The order of a data set will not change unless we specifically modify it using method like "proc sort". So in SAS, it makes perfect sense to say "extract the first 500 observations from a data set".

However, in a relational database table, there is no natural order defined for records in a table. To say "exact the first 500 records from a table" does not make sense. We have to specifically define an order. Thus in a relational database, the following statements all make sense.
"Extract 500 records randomly from a table."
"Extract the top 500 records with the highest account numbers from a table (assuming account numbers are unique)".

Caution needs to be taken when converting SAS data set to a relational data table. The natural order of records in the SAS data set will be lost. One way to preserve the natural order of a SAS data set is to create an extra column based on SAS system variable _N_.

Tuesday, July 10, 2012

How to tokenize text in Oracle

The following procedure splits text column query into keywords,e.g., "Hello World" into two words "Hello" and "World".

Step 1. Create index of type ctxsys.contex.
create index t_query_idx on t_query(query) indextype is ctxsys.context;

Step 2.
set serveroutput on;
declare
the_tokens ctx_doc.token_tab;
begin
for i in (select row_p_key from t_query order by row_p_key) loop
ctx_doc.tokens('t_query_idx', i.row_p_key, the_tokens);
for ii in 1..the_tokens.count loop
insert into tbl_query_token select i.row_p_key,the_tokens(ii).offset, the_tokens(ii).token from dual;
end loop;
commit;
end loop;
end;
/

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics