Friday, September 14, 2012

Data preparation for building logistic regression models

Logistic regression is a popular model that predicts binary output, e.g., fraud or not, customer response or not, etc. When building a regression model, there are two ways to represent the data. First, it is the transactional format as shown below. Each record is a single transaction and the target variable is a binary variable.
                Data in transactional format
      1001 A  MA       01-MAY-12 A          N
      1001 A  MA       08-MAY-12 A          N
      1001 A  MA       15-MAY-12 A          Y
      1001 A  MA       22-MAY-12 A          N
      1001 A  MA       29-MAY-12 A          N
      1001 A  MA       06-JUN-12 A          N
      1001 B  CT       06-JUN-12 A          N
      1002 B  CT       01-MAY-12 A          N
      1002 B  CT       08-MAY-12 A          N
      1002 B  CT       15-MAY-12 A          Y
      1002 B  CT       22-MAY-12 A          N
      1002 B  CT       29-MAY-12 A          Y

If all the independent variables are categorical, we can convert the data in transactional format into a more compact one by summarizing the data using SQL script similar to the following. We count the numbers of responses and non responses for each unique combination of  independent variable values. For continuous variables, if we want, we can transform them into categorical using techniques like binning.

select cat, state_cd, campgain_cd, 
sum(case when response='Y' then 1 else 0 end) num_response,
sum(case when response='N' then 1 else 0 end) num_no_response
from tbl_txn group by cat, state_cd, campgain_cd;
                 Data in the summary format
  A  MA       A        125       1025
  B  CT       C        75        2133

Summarizing data first can greatly reduce the data size and save memory space when building the model. This is particularly useful if we are use memory-based modeling tools such as R.

If we use R to build the logistic regression model, the script for training data in transactional format is similar to the following.
  data=train.set1,family = binomial(link = "logit")) ->model1

The R scripts for building a logistic model based on summary data is show below.
  data=train.set1,family = binomial(link = "logit")) ->model2

1 comment:

Anonymous said...

This stage in model development process is probably the longest and the most difficult phase of any credit risk model development project. It’s main purpose is to determine if scorecard development is can be built (or not) as well as to set the high-level parameters for the project. Those parameters are typically exclusions, target definition, sample window, and performance window.

I talk about this at Highstone Tower blog very often... feel free to comment