Thursday, January 03, 2013
the First principle of data analytics is to avoid making mistakes (continued)
The importance of avoiding mistakes in data mining (or in any work in general) could never be overemphasized. When mistakes happen, wrong (and sometime ridiculous) conclusions are drawn and the credibility of analysts is severely damaged. One of the common sources causing mistakes is survivor bias. The following are some examples. 1. When we study average 10 year stock return, we collect the price history of stocks that are on the market. However, because many companies went under within the 10 years and their data are not included in the analysis, the calculated average return would be higher than the actual return. 2. It was uncovered that 12 months ago a data breach affected some credit cards of a card issuer. To measure the impact of this data breach, analysts take the current active credit cards, find out those that are affected by the data breach, and measure their fraud rate. They may find that the fraud rate is surprisingly low. This is because those cards had fraudulent activities as reported by the cardholders are already closed and these cards are purged from the active portfolio. Stocks for companies going under, closed accounts due to nonpayment, closed credit cards due to fraud, churned customers, etc., are data corpses. To avoid data survivor bias, it is important to collect the complete data, including those data corpses. This is easier said than done as in reality data corpses are regularly removed and hard to collect.