Monday, April 22, 2013

Find the Most Important Variables In Predictive Models

A commonly used method in determining the most important variables is to examine how well each variable individually in predicting the target variable. However, this approach has its limits. The first limit is that it excludes variables that are actually good ones.

For example, to find if credit score is a good indicator of account default, we calculate the default rate for each credit class as shown below (or we may perform some statistical tests such as a Chi-Square test for that matter). As we can see, the low credit score class have a default rate of 18% vs that of 6% for the high credit score class. Thus, credit class is considered as a good variable to  build a default model.

Credit score and account default.

However, this approach has its limit because it does not take the relationship among variables into consideration.  This can be illustrated using an imaginary example. We want to asses if height  and weight of people are indicative of getting a disease. We can calculate the following tables for height and weight, respectively. Since short or tall people have the same percentage of sick people (2%), we may conclude that height is not relevant to predicting the disease. Similarly, we also  think weight is not important.

Height and Disease

Weight and Disease

If examining weight and height at the same time, we can develop the following matrix. There are four groups of people, high/heavy (normal), short/light (normal), high/light(abnormal), and short/heavy (abnormal).  11% of short/heavy or light/tall people are sick (orange cells). While the percentage of sick people from tall/heavy and short/light  groups (green cells) is only 0.1%. Thus, height and weight are very good variables to be included in a predictive model.

                                                      Height/Weight and Disease


As we see, this approach may exclude variables that are actually good. When we determine the most important variables for building a predictive model, ideally we should take a set of variables as a whole into consideration. More often than not, it is the relationships between variables that provide the best predictive power. How to find or generate the most useful variables for predictive models is so crucial that we will talk more about it in upcoming blog posts. I have written another post More on How to Find the Most Important Variables for a Predictive Model using Oracle Attribute Importance function.

Sunday, April 07, 2013

Logistic Regression Model Implemented in SQL

In a project, we need to deploy a logistic regression model into a production system that only takes SQL scripts as its input. Two functions come in handy, decode() and nvl(). Decode() converts categorical value into a weight and nvl() conveniently replaces null with a desired value. The following SQL scripts is similar to what we delivered.



select transaction_id,
(1/
(1+
exp(-(
nvl(AMT1*.000019199,0)+
nvl(AMT3*(-.00002155),0)+
decode(nvl(substr((TXN_CODE1),1,18),' '),
'XX',-.070935,
'57',-.192319,
'1',-.053794,
'81',-.010813,
'NR',-.079628,
'PD',-.102987,
'P',-1.388433,
'Z6',-.106081,
'01',-1.1528,
'Z4',-.004237,
'T1',.697737,
'AK',-.490381,
'U2',.063712,
'NK',.054354,
'PR',.205336,
'51',-.286213,
'N',.075582,
' ',-.330585,
0)+
decode(nvl(substr( trim(TXN_CODE2),1,18),' '),
'U',-.11176,
0)+
decode(nvl(substr( trim(TXN_CODE3),1,18),' '),
'1',-.642605,
0)+
decode(nvl(substr( trim(TXN_CODE4),1,18),' '),
'00',-.084517,
'10',.057248,
0)
-6.8190776
)
)
)
) as score from tbl_data;