Tuesday, January 01, 2013

the First principle of data analytics is to avoid making mistakes (continued)

Mistakes can happen when we deploy the models that work well in our "lab environment" into production. We need to anticipate unseen situations ahead of time and deal with them. For examples, the following are some of the common challenges:

1. Unseen categorical variable values. For example, in the training data, state variable does not contain value RI. However, in the production, RI appears in the data. One way to deal with this is to assign all unseen codes to the most frequent category in our model scripts. That way, the model will run as we expect.

2. Continuous variables out of boundaries. For example, in the training data, the credit card purchase amount is within a range of $0.01 and $22,000. However, in production, the purchase amount could be $56,000. If we do not handle purchase amount properly, extreme scores will be generated by the model for transactions with purchase amount of $56,000. It is a good idea to winsorize the variable. "Winsorize" is a fancy word for clipping the extreme values.

No comments: