Deep Data Mining Blog: August 2012

Friday, August 31, 2012

Random Sampling in Oracle using Sample Clause

In an earlier post about "Random Sampling using SQL", we use functions mode and dbms_random.random to randomly sample 30% of records. This can also be done easily with oracle sample clause as the following:
select * from TBL_TEST sample(30);
Sample clause is also a great way to test the performance of a query before we run it on large tables.

Thursday, August 30, 2012

Select top percent in Oracle

Oracle analytic function cume_dist() calculates the cumulative percentage for a variable.

The following query selects the top 5% records based on variable num.
with tbl as
(
select a.*, cume_dist() over(order by num desc) cumulative_pcnt from tbl_test a
)
select * from tbl where cumulative_pcnt <=0 .05;

The following query selects the bottom 5% records based on variable num.
with tbl as
(
select a.*, cume_dist() over(order by num ) cumulative_pcnt from tbl_test a
)
select * from tbl where cumulative_pcnt <=0 .05;

The following query selects the top 5% records based on variable num by group.
with tbl as
(
select a.*, cume_dist() over(partition by grp order by num desc) cumulative_pcnt from tbl_test a
)
select * from tbl where cumulative_pcnt <=0 .05;

Wednesday, August 29, 2012

Random Sampling using SQL

The following SQL scripts create a view by taking 30% random samples from a table.

Step 1 . Create random numbers within range of 0 and 99 (100 bins).
create table tbl_app_id_w_rnd as select app_id,
mod(abs(dbms_random.random()),100) rnd from CELL_PHONE_SERVICE_APPS;

Step 2. Create a view that contains 30% of the records.
create view v_data_testing
as select a.* from CELL_PHONE_SERVICE_APPS a,
tbl_app_id_w_rnd b where a.app_id=b.app_id and rnd < 30;

Please see a newer post. dbms_random.value is a better function than dbms_random.random.

Build K-Means Clustering Models Using PL/SQL

As mentioned in an earlier post, we can build predictive models within Oracle using PL/SQL. One of such models is K-Means clustering which does not require a target variable. K-Means clustering iteratively

Step 1 . Define K-Means clustering parameters
CREATE TABLE km_settings (
setting_name VARCHAR2(30), setting_value VARCHAR2(30));
BEGIN
INSERT INTO km_settings (setting_name, setting_value) VALUES
(dbms_data_mining.kmns_distance, dbms_data_mining.kmns_euclidean);
INSERT INTO km_settings (setting_name, setting_value) VALUES
(dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_on);
INSERT INTO km_settings (setting_name, setting_value) VALUES (dbms_data_mining.clus_num_clusters, '7');
END;
/
Step 2. Build a K-Means Clustering model
define m_name='km_0425a'
define input_tbl='v_data_training_4_cls'
define rec_id='APP_ID'
BEGIN
DBMS_DATA_MINING.CREATE_MODEL(
model_name => '&m_name',
mining_function => dbms_data_mining.clustering,
data_table_name => '&input_tbl',
case_id_column_name => '&rec_id',
settings_table_name => 'km_settings');
END;
/

Again once the model is built, it is a mining object. We can use Cluster_id() function to calculate cluster members for new data.
select app_id, cluster_id(km_0425a using *) cls from CELL_PHONE_SERVICE_APPS_NEW

Tuesday, August 28, 2012

Random number generators

There are many random generators that can be used for random sampling such as rand() in SAS, dbms_random.rand() in Oracle, rand() in Excel spreadsheet. I have helped implement a Winchmann- Hill random generator using PL/SQL. Here is a link to WH algorithm (http://www2.imperial.ac.uk/~hohs/S9/2007-2008/wichmannhill.pdf).

The reason to implement Winchmann-Hill random generator using PL/SQL was that we wanted to replicate the sequence of random numbers precisely given the same random seed. We did not want to use SAS or Oracle random functions because we do not know how the details of their algorithms. We want to build WH random generators that work consistently across platforms, i.e., SAS and Oracle database. This is easier said than done. One of the lessons we learnt was that SAS integer can only have 17 digits while Oracle integer support 38 digits. We have to make sure that the random seed is not too large.

Monday, August 27, 2012

Predictive model scoring in Oracle

As described in an earlier post, "Build Predictive Models Using PL/SQL", we can build models within Oracle using function
DBMS_DATA_MINING.CREATE_MODEL. The models are called mining objects.

The ultimate purpose of building a predictive mode is to apply it to new data. This can be done using SQL function prediction_probability().
We call this function in our SQL scripts. It will return the probability of being fraud, delinquency, clicking an online ads, or whatever the model is predicting.

select ACCOUNT_NUM, Transaction_ID, prediction_probability(fraud_model, 1 using *) score from tbl_transaction_data;

This is probably the most efficient way to deploy a model into production. (Model deployment is traditionally a major headache.) Again, data never leave the database and no standalone analytics server is needed.

Build Predictive Models Using PL/SQL

Many data analytics professionals/data miners are not aware that predictive modeling can be done within Oracle database. I probably have built more than one hundred predictive models this way in recent years. One of the benefits is that data never leave the database. To do this, we take two steps.

1. Define model parameters. We create a table to store parameters including model types, configurations, etc. This is a one time effort. The parameter table can be reused.
CREATE TABLE glm_settings (
setting_name VARCHAR2(30),
setting_value VARCHAR2(30));
-- The default classification algorithm is Naive Bayes. So override
-- this choice to GLM logistic regression using a settings table.
--
BEGIN
-- Populate settings table
INSERT INTO glm_settings (setting_name, setting_value) VALUES
(dbms_data_mining.algo_name, dbms_data_mining.algo_generalized_linear_model);
-- INSERT INTO glm_settings (setting_name, setting_value) VALUES
-- (dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_on);
-- Examples of other possible overrides are:
--(dbms_data_mining.glms_ridge_regression, dbms_data_mining.glms_ridge_reg_enable);
COMMIT;
END;
/
2. Run DBMS_DATA_MINING.CREATE_MODEL function to build a model. A model is called a mining object which is simply a database object type like table, view, index.
-- Build a new GLM Model
BEGIN
DBMS_DATA_MINING.CREATE_MODEL(
model_name => 'glm_model_1',
mining_function => dbms_data_mining.classification,
data_table_name => 'tbl_training_set',
case_id_column_name => 'record_id',
target_column_name => 'is_fraud',
settings_table_name => 'glm_settings');
END;
/
We can see what models we have built in the database using the following query.
select object_name, object_type, created from all_objects where object_type like '%MINING%' order by created;
OBJECT_NAME OBJECT_TYPE ---------------------------------------- -------------- GLM_CTR MINING MODEL GLM0322A MINING MODEL GLM0322ORG MINING MODEL GLM0322DT MINING MODEL GLM0322EXP MINING MODEL GLM0322V2 MINING MODEL GLM0322V3 MINING MODEL GLM0322V4 MINING MODEL GLM0322V5 MINING MODEL

Calculate correlation matrix using SQL

Any table can be converted into the a simple table with only three columns: rec_id (record id), var_id (column_id), val (value). For example,
Original Table
Record_ID Variable_1 Variable_2 Variable3
1 1.2 2.2 3.3
2 2.0 1.5 4.0
The above table can be converted into the following simple table.
Rec_id var_id val
1 1 1.2
1 2 2.2
1 3 3.3
2 1 2.0
2 2 1.5
2 3 4.0

Once the data are stored in the simple, three column table, the top-right correlation matrix can be easily calculated using the following query.

select a.var_id var_id1, b.var_id var_id2, corr(a.val, b.val) correl
from tbl_simple a, tbl_simple b where a.rec_id=b.rec_id
and a.var_id<=b.var_id group by a.var_id, b.var_id ;

This method can handle unlimited number of records and variables. To improve the performance, we should create index on rec_id and var_id. Also see my yesterday's post about "Calculate correlation coefficient between two variables using SQL".

Sunday, August 26, 2012

Calculate correlation coefficient between two variables using SQL

Oracle function corr() returns the correlation coefficient between two variables. The following scripts show two ways of calculating correlation coefficient.

1. Calculate correlation coefficient using corr() function.
select corr(x,y) from TBL_XY;
.214418947
2. Calculate correlation coefficient directly without using corr() function. The equation is based on Pearson's coefficient (http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient). The results are the same.
with tbl_mean as(
select avg(x) mean_x, avg(y) mean_y from tbl_xy
),
tbl_vec_m_corrected as
(
select x-mean_x mean_x_corrected, y-mean_y mean_y_corrected
from tbl_xy, tbl_mean
)
select sum(mean_x_corrected*mean_y_corrected)/(
sqrt(sum(mean_x_corrected*mean_x_corrected)) *sqrt(sum(mean_y_corrected*mean_y_corrected)))
as corr_coef
from tbl_vec_m_corrected ;
.214418947

Create new tables in Oracle and SQL Server

In Oracle, we can use "create table as select" as shown below.
1. Copy table into a new table.
create table tbl_x as select * from tbl_test;
2.Create a new empty table with the same structure.
create table tbl_y as select * from tbl_test where 2=1;

In SQL Server, we can use select into as shown below.
1. Copy table into a new table.
select * into tbl_x from tbl_test ;
2. Create a new empty table with the same structure.
select * into tbl_y from tbl_test where 2=1;

Friday, August 24, 2012

Calculate the similarity between two strings in Oracle

Oracle function utl_match.edit_distance_similarity (s1 IN VARCHAR2, s2 IN VARCHAR2) returns 100 for perfect match between s1 and s2, 0 for no similarity. It is case sensitive. I have used this functions to match the merchant names in debit card authorizations and claims. I noticed that this function behave slightly differently in 10g and 11g. In 10g, it is possible that it return number larger than 100 when null string is involved and I had to deal with them carefully. This function is particularly useful when performing fuzzing matching between two strings such as merchant names in two tables.

select utl_match.edit_distance_similarity('Hello', 'Hello') dis from dual;
100
/*case sensitive */
select utl_match.edit_distance_similarity('Hello', 'hello') dis from dual;
80
/*case sensitive */
select utl_match.edit_distance_similarity('Hello', 'HELLO') dis from dual;
20
select utl_match.edit_distance_similarity('Hello', 'HLO') dis from dual;
20
select utl_match.edit_distance_similarity('Hello', 'X') dis from dual;
0
select utl_match.edit_distance_similarity('Hello', ' ') dis from dual;
0
select utl_match.edit_distance_similarity('Hello', NULL) dis from dual;

0

Calculate Median value in Oracle

There are three functions that calculate median values, median(), percentile_count() and percentile_disc.
1. The data.
select num from TBL_TEST order by num;
NUM
----------
1
1
2
2
2
3
4
4
5

2. Calculate the median or 50th percentile.
select median(num) f_median,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY num) f_percentile_cont,
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY num) f_percentile_disc
from TBL_TEST ;
F_MEDIAN F_PERCENTILE_CONT F_PERCENTILE_DISC
---------- ----------------- -----------------
2 2 2

Wednesday, August 15, 2012

Calculate Z-Score using SQL in Oracle

We can normalize data using Z-score, Z-score=(data value-mean)/standard deviation.
1. Data.
select num from TBL_TEST order by num;
NUM
----------
1
1
2
2
2
3
4
4
5
2. Calculate Z-score.
with tbl_mean_std as
(
select avg(num) m, stddev(num) std from TBL_TEST
)
select num, (num-m)/std as z_score from TBL_TEST , tbl_mean_std order by num;

NUM Z_SCORE
---------- ----------
1 -1.1785113
1 -1.1785113
2 -.47140452
2 -.47140452
2 -.47140452
3 .23570226
4 .942809042
4 .942809042
5 1.64991582

Calculate cumulative percentage in Oracle

Using Cume_dist(), it is easy to calculate the cumulative percentage of a number.

1. The data.
select num from TBL_TEST order by num;
NUM
----------
1
1
2
2
2
3
4
4
5
2. Calculate cumulative percentage using cums_dist()
select num, cume_dist() over(order by num) as cum_percent from TBL_TEST order by num;
NUM CUM_PERCENT
---------- -----------
1 .222222222
1 .222222222
2 .555555556
2 .555555556
2 .555555556
3 .666666667
4 .888888889
4 .888888889
5 1

Rank in Oracle: rank() and dense_rank()

Unlike row_number() function, rank() and dense_rank() may produce duplicate ranks.

1. The data
select num from TBL_TEST order by num;
NUM
----------
1
1
2
2
2
3
4
4
5
2. rank() may have gaps.
select num, rank() over(order by num) as rnk from TBL_TEST order by num;
NUM RNK
---------- ----------
1 1
1 1
2 3
2 3
2 3
3 6
4 7
4 7
5 9
3. dense_ran() does not have gap in ranks.
select num, dense_rank() over(order by num) as rnk from TBL_TEST order by num;

NUM RNK

---------- ----------
1 1
1 1
2 2
2 2
2 2
3 3
4 4
4 4
5 5

Thursday, August 02, 2012

Rank in Oracle

There are several ways to rank records based on certain variables. One of the is to use row_number

select * from TBL_TEST order by grp, num;
GRP NUM
---------- ----------
A 1
A 2
A 2
A 3
A 4
B 1
B 2
B 4
B 5

Row_number() generates unique sequece numbers without gap. Ties are given different numbers. I use row_number() to generate unique id.

Example 1. generate rank separately for GRP
with tbl as(
select a.grp, a.num, row_number() over(partition by GRP order by num) rnk from TBL_TEST a)
select * from tbl order by grp, rnk ;
GRP NUM RNK
---------- ---------- ----------
A 1 1
A 2 2
A 2 3
A 3 4
A 4 5
B 1 1
B 2 2
B 4 3
B 5 4

Example 2. Generate rank without partition by GRP
with tbl as(
select a.grp, a.num, row_number() over(order by grp, num) rnk from TBL_TEST a)
select * from tbl order by grp, rnk ;
GRP NUM RNK
---------- ---------- ----------
A 1 1
A 2 2
A 2 3
A 3 4
A 4 5
B 1 6
B 2 7
B 4 8
B 5 9

I will discuss rank functions including dense_rank(), ran() in another post.

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics