To make a predictive model, we need independent variables as inputs and a single variable as the target. Typically, both independent and target variables are stored in a single table. Often there are many independent variables, say 50 or 200 of them, such as age, sex, annual income, credit limits, and transaction variables etc. How do we select a small number of variables that are most predicative of the target variable and use them to build a model that is robust?
In the early post Find the Most Important Variables In Predictive Models, we described that there is a drawback to justify the importance of variable individually. Ideally we should take a set of variables as a whole into consideration. One of the good approaches is Oracle’s Attribute Importance model.
I built a credit card transaction data that contains column is_fraud as the target, id as the unique record identifier and other independent variables that I want to analyze. The data set is v_training_set. I write the following PL/SQL script to build an attribute importance model.begin
model_name => 'VAR_IMPORTANCE',
mining_function => DBMS_DATA_MINING.ATTRIBUTE_IMPORTANCE,
data_table_name => 'v_training_set',
case_id_column_name => 'id',
target_column_name => 'is_fraud');
When the attribute importance model,VAR_IMPORTANCE, is done, all independent, i.e., all variables except target and record identifiers, are assigned an importance value. The higher the value of a variable, the more important it is in predicting the target. We can review our result using the following SQL. (I deliberately masked the independent variable names because I think fraud detection is an sensitive matter and we do not want to give away too much information to fraudsters who might be reading this blog post.)SQL> select attribute_name, IMPORTANCE_VALUE, rank from TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_AI('VAR_IMPORTANCE')) order by rank;
Oracle’s Attribute Importance function ranks variables based on their importance in predicting the target. It is a great tool for selecting a small number of input variables out of many before we build a predictive model.