Sunday, October 27, 2013

Categorical Variables in Logistic Regression

In the old post Build Predictive Models Using PL/SQL, we showed how to call DBMS_DATA_MINING.CREATE_MODEL() function to build a logistic regression model.

The input data set should contain (and only contain) the following columns:
1. a unique case id;
2. the target variable;
3. independent variables used in the model. All variables other than the case id and target variables will be used as the input variables to the model.

We can easily construct the input data set using view based on a data table. In the view, we specify case id, target variable and independent variables that we desire in the select part of the SQL.

Independent variables are either numeric or character types. Character types, such as state name or male/female, are categorical variables. Oracle models automatically treat the most frequent categorical value as the reference class and assign it a weight of zero. This is very convenient. For example, we built a model that use transaction code as one of the input variables as mentioned in post Logistic Regression Model Implemented in SQL. Take a look at the piece of SQL code below. It converts the txn_code into weight derived from a logistic regression model. substr((TXN_CODE),1,18) is to only take the first 18 characters of txn_code (just in case the txn_code is too long). nvl() is to treat missing value as a blank. txn_code 'XX' will receive a weight of -.070935, NULL or blank value a weight of -.330585. If there is a new txn_code in production that is unseen in the training data set, say 'ZZZ', it will receive a default weight of 0 which is the weight for the most frequent txn_code. This makes sense as we can assume that the unseen code share the weight of historically most common codes. Thus, the model can produce a score (that is reasonable) under any circumstance. This example also shows that it is important to design the model that can handle unseen situations after it is deployed.

decode(nvl(substr((TXN_CODE1),1,18),' '),
'XX',-.070935,
'57',-.192319,
'1',-.053794,
'81',-.010813,
'NR',-.079628,
'PD',-.102987,
'P',-1.388433,
'Z6',-.106081,
'01',-1.1528,
'Z4',-.004237,
'T1',.697737,
'AK',-.490381,
'U2',.063712,
'NK',.054354,
'PR',.205336,
'51',-.286213,
'N',.075582,
' ',-.330585,
0)

The above SQL code that converts values to weights is not necessary normally. Instead, we use the model mining object and prediction_probability function. I took this approach was simply that the database administers of the production databases were unaware of Oracle mining objects and felt not comfortable using them. Thus, to be able to deploy our predictive models into production systems, data miners need to flexible. I have seen to many good models built in labs that never got deployed.

No comments: