Sunday, November 10, 2013

More on How to Find the Most Important Variables for a Predictive Model

Problem

To make a predictive model, we need independent variables as inputs and a single variable as the target. Typically, both independent and target variables are stored in a single table. Often there are many independent variables, say 50 or 200 of them, such as age, sex, annual income, credit limits, and transaction variables etc. How do we select a small number of variables that are most predicative of the target variable and use them to build a model that is robust?

Solution

In the early post Find the Most Important Variables In Predictive Models, we described that there is a drawback to justify the importance of variable individually. Ideally we should take a set of variables as a whole into consideration. One of the good approaches is Oracle’s Attribute Importance model.

I built a credit card transaction data that contains column is_fraud as the target, id as the unique record identifier and other independent variables that I want to analyze. The data set is v_training_set. I write the following PL/SQL script to build an attribute importance model.

begin
DBMS_DATA_MINING.CREATE_MODEL(
model_name => 'VAR_IMPORTANCE',
mining_function => DBMS_DATA_MINING.ATTRIBUTE_IMPORTANCE,
data_table_name => 'v_training_set',
case_id_column_name => 'id',
target_column_name => 'is_fraud');
END;

When the attribute importance model,VAR_IMPORTANCE, is done, all independent, i.e., all variables except target and record identifiers, are assigned an importance value. The higher the value of a variable, the more important it is in predicting the target. We can review our result using the following SQL. (I deliberately masked the independent variable names because I think fraud detection is an sensitive matter and we do not want to give away too much information to fraudsters who might be reading this blog post.)

SQL> select attribute_name, IMPORTANCE_VALUE, rank from TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_AI('VAR_IMPORTANCE')) order by rank;

ATTRIBUTE_NAME IMPORTANCE_VALUE RANK
VAR__EMV___ .024118198 1
VAR__TER___ .014412195 2
VAR__RET___ .013569008 3
VAR__PT____ .008679484 4
VAR__TRA___ .008009003 5
VAR__TER___ .007207152 6
VAR__CHK___ .006322795 7
VAR__NRT___ .00591138 8
VAR__NRT___ .005564262 9
VAR__MSG___ .005332518 10
VAR__TRA___ .004457798 11
VAR__NRT___ .00409855 12
VAR__PIN___ .003852347 13
VAR__PRO___ .001177829 14
VAR__RES___ .000911889 15
VAR__TER___ .000767663 16
VAR__FIL___ .000448031 17
VAR__ACC___ .000172331 18
VAR__PRM___ .00009502 19
VAR__AUT___ .00009502 19
VAR__FRW___ .00009502 19
VAR__AUT___ .000092784 20
VAR__ACC___ .000012229 21
VAR__PT____ .000003078 22
VAR__ACC___ -6.946E-06 23
VAR__TRA___ -.02799773 24
VAR__TER___ -.16968261 25
VAR__TRA___ -.39228472 26
VAR__TIE___ -.57372309 27

Conclusion

Oracle’s Attribute Importance function ranks variables based on their importance in predicting the target. It is a great tool for selecting a small number of input variables out of many before we build a predictive model.

2 comments:

Deepesh "DJ" said...

Hi, Is the attribute importance function in oracle similar to the Information Value feature used commonly in analytics?

dora britton said...

Your article is very nice thank you for share this information. Its very useful for all peoples there are all possibilities to provide for this blog. I am a writer working with best essay writing service it will give us the best choice of writing service for your academic life.