Saturday, April 05, 2014

Zip Code and Predictive Models

It is obvious that zip codes should be treated as categorical variable instead of numeric. Since zip codes have many distinct values, in some cases they may practically become the unique identifiers for data points. We need to deal with them carefully when building predictive models. Otherwise, the models may perform well on training data set but not so well on new data set due to overlearning problem.

For some predictive models such as decision trees and tree-based ensemble models, we can use zip codes directly. For other models, we may categorize zip codes into smaller number of meaningful groups. For example, we can categorize zip code based on their credit card fraud rates when building fraud dectection model. The following query shows card_id, home_zip and fraud indicator(fraud, 0 normal transaction) of a credit card transaction table.

SQL> select card_id, home_zip, is_fraud from CARD_TXN;

---------- ---------------- ----------
     12345 012320                    0
     45678 012456                    1
Based on the above table, we can categorize zip code by following the following two steps:
1. Calculate the fraud rate for each zip code in the training data set. For zip code with small number of transactions, the fraud rate will be the overall fraud rate for the whole population.
2. Divide zip codes into groups based on fraud rates. we can calculate fraud rate for each home_zip using the following scripts.
create or replace view v_zip_fraud_rate
tbl_overall_fraud_r as
(select sum(is_fraud)/count(*) fraud_rate from CARD_TXN ),
tbl_zip_fraud_r as
(select home_zip, count(*) tot_num_txns,
sum(is_fraud) tot_num_fraud, sum(is_fraud)/count(*) fraud_rate
from CARD_TXN group by home_zip )
select home_zip, tot_num_txns, tot_num_fraud,
       case when tot_num_txns<50 then b.fraud_rate
            else b.fraud_rate
       end fraud_rate
from tbl_zip_fraud_r a, tbl_overall_fraud_r b;

SQL> select * from V_ZIP_FRAUD_RATE;

---------------- ------------ ------------- ----------
012320                   2000            10         .005
012456                   1000             8         .008
012345                     23             0         .006
In the above example, zip code "012345" has only no fraud transactions with only 23 transactions in total. We assign the overal average fraud rate to it. In the same fashion, we can categorize zip code based on click through rate, response rate, income, etc.

We can easily group zip code based on fraud rate. For example, we can use "case when" queries as shown below. After we convert the zip code into a small number of categroies, we can use it in a predicitve models(such as a logistic regression model). Again, it is not nessary to categorize zip codes when using decision tree based models.
select HOME_ZIP,
   case when fraud_rate<0.001 then 1
        when fraud_rate<0.003 then 2
        when fraud_rate<0.005 then 3
        when fraud_rate<0.007 then 4
        when fraud_rate<0.008 then 5
        else 6 end zip_segment

No comments: