Deep Data Mining Blog: April 2013

Monday, April 22, 2013

Find the Most Important Variables In Predictive Models

A commonly used method in determining the most important variables is to examine how well each variable individually in predicting the target variable. However, this approach has its limits. The first limit is that it excludes variables that are actually good ones.

For example, to find if credit score is a good indicator of account default, we calculate the default rate for each credit class as shown below (or we may perform some statistical tests such as a Chi-Square test for that matter). As we can see, the low credit score class have a default rate of 18% vs that of 6% for the high credit score class. Thus, credit class is considered as a good variable to build a default model.

Credit score and account default.

However, this approach has its limit because it does not take the relationship among variables into consideration. This can be illustrated using an imaginary example. We want to asses if height and weight of people are indicative of getting a disease. We can calculate the following tables for height and weight, respectively. Since short or tall people have the same percentage of sick people (2%), we may conclude that height is not relevant to predicting the disease. Similarly, we also think weight is not important.

Height and Disease

Weight and Disease

If examining weight and height at the same time, we can develop the following matrix. There are four groups of people, high/heavy (normal), short/light (normal), high/light(abnormal), and short/heavy (abnormal). 11% of short/heavy or light/tall people are sick (orange cells). While the percentage of sick people from tall/heavy and short/light groups (green cells) is only 0.1%. Thus, height and weight are very good variables to be included in a predictive model.

Height/Weight and Disease

As we see, this approach may exclude variables that are actually good. When we determine the most important variables for building a predictive model, ideally we should take a set of variables as a whole into consideration. More often than not, it is the relationships between variables that provide the best predictive power. How to find or generate the most useful variables for predictive models is so crucial that we will talk more about it in upcoming blog posts. I have written another post More on How to Find the Most Important Variables for a Predictive Model using Oracle Attribute Importance function.

Sunday, April 07, 2013

Logistic Regression Model Implemented in SQL

In a project, we need to deploy a logistic regression model into a production system that only takes SQL scripts as its input. Two functions come in handy, decode() and nvl(). Decode() converts categorical value into a weight and nvl() conveniently replaces null with a desired value. The following SQL scripts is similar to what we delivered.

select transaction_id,

(1/
(1+
exp(-(
nvl(AMT1*.000019199,0)+
nvl(AMT3*(-.00002155),0)+
decode(nvl(substr((TXN_CODE1),1,18),' '),
'XX',-.070935,
'57',-.192319,
'1',-.053794,
'81',-.010813,
'NR',-.079628,
'PD',-.102987,
'P',-1.388433,
'Z6',-.106081,
'01',-1.1528,
'Z4',-.004237,
'T1',.697737,
'AK',-.490381,
'U2',.063712,
'NK',.054354,
'PR',.205336,
'51',-.286213,
'N',.075582,
' ',-.330585,
0)+
decode(nvl(substr( trim(TXN_CODE2),1,18),' '),
'U',-.11176,
0)+
decode(nvl(substr( trim(TXN_CODE3),1,18),' '),
'1',-.642605,
0)+
decode(nvl(substr( trim(TXN_CODE4),1,18),' '),
'00',-.084517,
'10',.057248,
0)

-6.8190776

) as score from tbl_data;

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics

Popular Topics

Monday, April 22, 2013

Find the Most Important Variables In Predictive Models

Sunday, April 07, 2013

Logistic Regression Model Implemented in SQL