## Tuesday, December 11, 2012

### More on Data Preparation for Logistic Regression Models

In the earlier post Data preparation for building logistic regression models , we talked about converting data into more compact format so that we can use memory intensive software like R to handle large number of cases.

Another issue that we commonly face is how to deal with "unbalanced" data. The following are some examples:
1. Only 7 to 12 fraudulent ones out of 10,000 transactions.
2. Online ads has only 1 to 3 clicks per 10,000 impressions.
3. Mobile phone account non-payment rate is 12%.
4. It is assumed the 6% of medical claims are fraudulent.

For the sake of reducing data volume without sacrificing model accuracy, it makes sense to reduce the cases in the major classes (good card transactions, non-click impressions, good mobile phone accounts, normal medical claims) through random sampling. For example, to gather 1,000 fraudulent bank card transactions within certain period, there will be 10 million good transactions from the same period. We can use the approach described in an earlier post More on random sampling in Oracle. However, the final model training data sets do NOT have to be perfectly balanced, i.e., fraud and good transactions do NOT have to be 50% and 50%. For example, it is OK that the model training data set has 80% good transactions and 20% fraudulent ones. For example, we can build two logistic regression models:

Model A. training set: 50% good, 50% fraud
Model B. training set: 80% good, 20% fraud

Model A and B will produce different probabilities of being fraud for an individual transaction(Probability of being fraud produced by Model A will most likely be higher than that produced by Model B). However, the relative ranks of transactions' probabilities of being fraud for a whole data set given by both models could be the same. For example, the top 1% riskiest transactions identified by both models are the same. Thus both models are equivalent. As we can see in many applications what matters is the relative ranks (of being fraud, being non-payment, etc.) produced by the predictive models. The quality/usefulness of relative ranks can be depicted by gain charts.