Deep Data Mining Blog: Data preparation for building logistic regression models

Friday, September 14, 2012

Data preparation for building logistic regression models

Logistic regression is a popular model that predicts binary output, e.g., fraud or not, customer response or not, etc. When building a regression model, there are two ways to represent the data. First, it is the transactional format as shown below. Each record is a single transaction and the target variable is a binary variable.
Data in transactional format
USER_ID CAT STATE_CD CAMPAIGN_DATE CAMPAIGN_CD RESPONSE
1001 A MA 01-MAY-12 A N
1001 A MA 08-MAY-12 A N
1001 A MA 15-MAY-12 A Y
1001 A MA 22-MAY-12 A N
1001 A MA 29-MAY-12 A N
1001 A MA 06-JUN-12 A N
1001 B CT 06-JUN-12 A N
1002 B CT 01-MAY-12 A N
1002 B CT 08-MAY-12 A N
1002 B CT 15-MAY-12 A Y
1002 B CT 22-MAY-12 A N
1002 B CT 29-MAY-12 A Y

If all the independent variables are categorical, we can convert the data in transactional format into a more compact one by summarizing the data using SQL script similar to the following. We count the numbers of responses and non responses for each unique combination of independent variable values. For continuous variables, if we want, we can transform them into categorical using techniques like binning.

select cat, state_cd, campgain_cd,
sum(case when response='Y' then 1 else 0 end) num_response,
sum(case when response='N' then 1 else 0 end) num_no_response
from tbl_txn group by cat, state_cd, campgain_cd;
Data in the summary format
CAT STATE_CD CAMPAIGN_CD NUM_RESPONSE NUM_NO_RESPONSE
A MA A 125 1025
B CT C 75 2133
..........................

Summarizing data first can greatly reduce the data size and save memory space when building the model. This is particularly useful if we are use memory-based modeling tools such as R.

If we use R to build the logistic regression model, the script for training data in transactional format is similar to the following.
glm(formula=RESPONSE~CAT+STATE_CD+CAMPAIGN_CD,
data=train.set1,family = binomial(link = "logit")) ->model1

The R scripts for building a logistic model based on summary data is show below.

glm(formula= cbind(NUM_RESPONSE,NUM_NO_RESPONSE) ~CAT+STATE_CD+CAMPAIGN_CD,
data=train.set1,family = binomial(link = "logit")) ->model2

1 comment:

Anonymous said...: This stage in model development process is probably the longest and the most difficult phase of any credit risk model development project. It’s main purpose is to determine if scorecard development is can be built (or not) as well as to set the high-level parameters for the project. Those parameters are typically exclusions, target definition, sample window, and performance window.

I talk about this at Highstone Tower blog very often... feel free to comment

http://www.highstonetower.com/?p=1718; 10:47 AM

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics

Popular Topics

Friday, September 14, 2012

Data preparation for building logistic regression models

1 comment: