Monday, April 22, 2013

Find the Most Important Variables In Predictive Models

A commonly used method in determining the most important variables is to examine how well each variable individually in predicting the target variable. However, this approach has its limits. The first limit is that it excludes variables that are actually good ones.

For example, to find if credit score is a good indicator of account default, we calculate the default rate for each credit class as shown below (or we may perform some statistical tests such as a Chi-Square test for that matter). As we can see, the low credit score class have a default rate of 18% vs that of 6% for the high credit score class. Thus, credit class is considered as a good variable to  build a default model.

Credit score and account default.

However, this approach has its limit because it does not take the relationship among variables into consideration.  This can be illustrated using an imaginary example. We want to asses if height  and weight of people are indicative of getting a disease. We can calculate the following tables for height and weight, respectively. Since short or tall people have the same percentage of sick people (2%), we may conclude that height is not relevant to predicting the disease. Similarly, we also  think weight is not important.

Height and Disease

Weight and Disease

If examining weight and height at the same time, we can develop the following matrix. There are four groups of people, high/heavy (normal), short/light (normal), high/light(abnormal), and short/heavy (abnormal).  11% of short/heavy or light/tall people are sick (orange cells). While the percentage of sick people from tall/heavy and short/light  groups (green cells) is only 0.1%. Thus, height and weight are very good variables to be included in a predictive model.

                                                      Height/Weight and Disease


As we see, this approach may exclude variables that are actually good. When we determine the most important variables for building a predictive model, ideally we should take a set of variables as a whole into consideration. More often than not, it is the relationships between variables that provide the best predictive power. How to find or generate the most useful variables for predictive models is so crucial that we will talk more about it in upcoming blog posts. I have written another post More on How to Find the Most Important Variables for a Predictive Model using Oracle Attribute Importance function.

3 comments:

Jose Maria Gomez Hidalgo said...

Hi

This is a simple and nice example to get the point. And it makes the case for checking groups of variables in terms of combined Information Gain, for instance using WEKA search methods (classes inheriting from ASSearch.

However I must say that in Text Mining problems, when you are forced to handle thousands of variables, examining the predictive power of groups of variables can be very costly. I believe you may suggest using algebraic methods for feature extraction like Singular Value Decomposition for those cases...

Thanks for the post and regards

Jay Zhou, PhD. said...

Jose,

Thank you for your comments. You are absolutely right. Methods like Singular Value Decomposition are great ways to generating feature variables. I will talk about them in another post.

Jay

Unknown said...

I have read your blog its very Interesting and informative. Call for low cost ERP software. ERP Software in Chennai