Thursday, March 07, 2013

Watch out NULL values when comparing data

It is a very common task to compare data values. For example, I was involved in project where we  upgraded the scoring engine. We wanted to make sure the old and new scoring engines produce the same outputs given the same inputs. I use the following table to illustrate the problem. We want to make sure value_old and value_new are the same. (The blanks are NULL values.)

        ID  VALUE_OLD  VALUE_NEW
---------- ---------- ----------
         1        234
         2                   567
         3        789        789

If we simply use the following query to count the number of discrepancies, the result will return zero. This is not what we expect.
select count(*) from tbl_data_a where VALUE_OLD < > VALUE_NEW;

  COUNT(*)
----------
         0
This is because rows with NULL values appearing in the comparison are ignored.

A better approach is to write a query considering all of the following five situations:
In the following cases, VALUE_OLD and VALUE_NEW are the same.
1. VALUE_OLD is null, and VALUE_NEW is null.
2.VALUE_OLD is not null, VALUE_NEW is not null and VALUE_OLD=VALUE_NEW.

In the following cases, VALUE_OLD and VALUE_NEW are the different.

3.VALUE_OLD is null, and VALUE_NEW is not null.
4.VALUE_OLD is not null, and VALUE_NEW is null.
5.VALUE_OLD is not null, VALUE_NEW is not null and VALUE_OLD < > VALUE_NEW.

Monday, March 04, 2013

Custom model score vs credit bureau score

In previous post, we compare the performance of different types of the models applied to the same data set. In reality, a more frequently encountered  issue is to compare the performance of credit bureau scores and custom model scores.  By custom model scores, we mean the model that is built based on  client's own historical data. I have done numerous predictive modeling projects in the area across industries. My conclusion is that custom model scores are almost always better than generic credit bureau scores by large margins. The following gain charts are from a real project. For example, if we reject worst 20% customers based on their bureau scores, we can stop 26% of the loan default.  If we reject 20% customers based on custom model scores, almost 40% of the loan default can be stopped. The can easily translate into big savings for a company with large number of customers.



The patterns are common. Thus, it is worthwhile to build a custom model that will almost always outperform generic credit bureau scores as long as the client has enough historical data.