Deep Data Mining Blog: More on How to Find the Most Important Variables for a Predictive Model

Sunday, November 10, 2013

More on How to Find the Most Important Variables for a Predictive Model

Problem

To make a predictive model, we need independent variables as inputs and a single variable as the target. Typically, both independent and target variables are stored in a single table. Often there are many independent variables, say 50 or 200 of them, such as age, sex, annual income, credit limits, and transaction variables etc. How do we select a small number of variables that are most predicative of the target variable and use them to build a model that is robust?

Solution

In the early post Find the Most Important Variables In Predictive Models, we described that there is a drawback to justify the importance of variable individually. Ideally we should take a set of variables as a whole into consideration. One of the good approaches is Oracle’s Attribute Importance model.

I built a credit card transaction data that contains column is_fraud as the target, id as the unique record identifier and other independent variables that I want to analyze. The data set is v_training_set. I write the following PL/SQL script to build an attribute importance model.

begin
DBMS_DATA_MINING.CREATE_MODEL(
model_name => 'VAR_IMPORTANCE',
mining_function => DBMS_DATA_MINING.ATTRIBUTE_IMPORTANCE,
data_table_name => 'v_training_set',
case_id_column_name => 'id',
target_column_name => 'is_fraud');
END;

When the attribute importance model,VAR_IMPORTANCE, is done, all independent, i.e., all variables except target and record identifiers, are assigned an importance value. The higher the value of a variable, the more important it is in predicting the target. We can review our result using the following SQL. (I deliberately masked the independent variable names because I think fraud detection is an sensitive matter and we do not want to give away too much information to fraudsters who might be reading this blog post.)

SQL> select attribute_name, IMPORTANCE_VALUE, rank from TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_AI('VAR_IMPORTANCE')) order by rank;

ATTRIBUTE_NAME	IMPORTANCE_VALUE	RANK
VAR__EMV___	.024118198	1
VAR__TER___	.014412195	2
VAR__RET___	.013569008	3
VAR__PT____	.008679484	4
VAR__TRA___	.008009003	5
VAR__TER___	.007207152	6
VAR__CHK___	.006322795	7
VAR__NRT___	.00591138	8
VAR__NRT___	.005564262	9
VAR__MSG___	.005332518	10
VAR__TRA___	.004457798	11
VAR__NRT___	.00409855	12
VAR__PIN___	.003852347	13
VAR__PRO___	.001177829	14
VAR__RES___	.000911889	15
VAR__TER___	.000767663	16
VAR__FIL___	.000448031	17
VAR__ACC___	.000172331	18
VAR__PRM___	.00009502	19
VAR__AUT___	.00009502	19
VAR__FRW___	.00009502	19
VAR__AUT___	.000092784	20
VAR__ACC___	.000012229	21
VAR__PT____	.000003078	22
VAR__ACC___	-6.946E-06	23
VAR__TRA___	-.02799773	24
VAR__TER___	-.16968261	25
VAR__TRA___	-.39228472	26
VAR__TIE___	-.57372309	27

Conclusion

Oracle’s Attribute Importance function ranks variables based on their importance in predicting the target. It is a great tool for selecting a small number of input variables out of many before we build a predictive model.

1 comment:

Deepesh "DJ" said...: Hi, Is the attribute importance function in oracle similar to the Information Value feature used commonly in analytics?; 8:08 AM

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics