Saturday, December 22, 2012

the First principle of data analytics is to avoid making mistakes (continued)

Mistakes may happen when data or processes flow from one system to another. Sometimes, mistakes are caused by the flaws in the analytic methodologies themselves.

One of the popular mistakes is to think in terms of absolute numbers, not ratios. For example, analysts see a few hundreds of fraudulent credit card transactions happening at a particular online retailer website and conclude that the retailer is very risky. This may not be true. If the retailer has tens of millions of normal transactions within the same period, the ratio of fraudulent transactions is very low. We have seen in number of occasions that people use number of frauds,instead of fraud rate, as the measurement of risk. As a result, they create fraud detection rules that generate a lot of false alarms.

Friday, December 21, 2012

the First principle of data analytics is to avoid making mistakes

In previous post How to build predictive models that win competitions,we showed examples of actual mistakes made by statisticians/analysts. It is obvious that delivering something wrong will cause irrecoverable damage to the credibility. Mistakes can happen particularly when data or processes follow from one system to another:

1. When data files are provided by the clients. Customers make mistakes too. Their database people may use the wrong scripts and do not extract the data that we want.

2. When data are loaded from text files into databases or other systems. A lot of things can go wrong during data loading. Issues like missing value representation, number formats, date formats, unrecognized characters, etc. will surface. Thus, if possible, using text files to store data should be avoided.

3. When reports are generated. Many analysts perform data analytics work using tools like SAS, R, SPSS, SPlus, Oracle Data Mining, etc. Then they summarize the results manually using Excel Spreadsheet and make the final Powerpoint presentation. Things can go wrong here. Decimal points are placed in the wrong places. Dates are not updated.

4. When models are deployed into production systems. There are a lot of moving parts involved here. It is a challenge to precisely replicate the off-line data analytics process in a different production system.

Tuesday, December 18, 2012

How to build predictive models that win competitions.

The predictive models that we produced won head-to-head competitions and were chosen by clients. For example, our model predicting the risk of a new customer will not pay his/her phone bill was used by a top 3 cell phone company. Our debit card fraud detection model is being used by a top 15 banks. Our competitors included one of the three credit bureaus that had hundreds of statisticians working for them.

We have found that if we follow a number of principles, we will be able to produce good predictive models and quality data analytics work in general.

The first principle is to avoid making mistakes. We have seen many cases where mistakes damage the reputation and credibility of data analysts. For example, statisticians analyzed data and drew the conclusion that compromised credit cards were less likely show fraudulent activities than normal cards. It was a mistake, of course. One of our competitors loaded customer's data wrongly and produced reports where the numbers did not make sense at all. Mistakes like those will cause immediate rejection from clients and permanent damage to analysts' reputation.

We need to realize that avoiding mistakes should be an inherent part of our processes. In a large project, from data gathering, loading, validating, summarizing,  model building, report generation and model deployment, we may have to take many steps and produce hundreds of data sets. To avoid making mistake, we need to double check our results. We have found that it is actually much harder to verify the results are correct than to simply execute the steps to produce them.

Thus, we always perform two tasks: 1. produce the result; 2. verify the result. It is better to spend more time to produce correct results than to quickly deliver something wrong that cause irrecoverable damage to the credibility. We will talk more about avoid making mistakes in the post the First principle of data analytics is to avoid making mistakes

Friday, December 14, 2012

Oracle NTILE function

Oracle Ntile function divides the records into the same number of rows (off at most by 1). Please notice that the records with same variable values (ties) may be placed into different buckets. The following query result shows that the maximum value in a bucket may overlap with the minimum value in the next bucket.

SQL> with tbl as (select ntile(5) over(order by price) nt, price from STOCK_PRICE) select nt, count(1), min(price), max(price) from tbl group by nt
 order by nt;

        NT   COUNT(1) MIN(PRICE) MAX(PRICE)
---------- ---------- ---------- ----------
         1      36387       9.58      13.68
         2      36387      13.68      16.72
         3      36386      16.72      20.87
         4      36386      20.87      26.87
         5      36386      26.87      89.53

We can also use function cume_dist combined with ceil function to divide the records into  similar number of rows. Ties will be placed into the same buckets. The following query result shows that there is no overlap of values between different buckets.

SQL> with tbl as (select ceil(cume_dist() over(order by price nulls first)*5)  nt, price from MV_VIX_15S_AVG) select nt, count(1), min(price), max(price) from tbl group by nt order by nt;

        NT   COUNT(1) MIN(PRICE) MAX(PRICE)
---------- ---------- ---------- ----------
         1      36358       9.58     13.675
         2      36332      13.68      16.71
         3      36413      16.72      20.86
         4      36429      20.87     26.865
         5      36400      26.87      89.53

Tuesday, December 11, 2012

More on Data Preparation for Logistic Regression Models

In the earlier post Data preparation for building logistic regression models , we talked about converting data into more compact format so that we can use memory intensive software like R to handle large number of cases.

Another issue that we commonly face is how to deal with "unbalanced" data. The following are some examples:
1. Only 7 to 12 fraudulent ones out of 10,000 transactions.
2. Online ads has only 1 to 3 clicks per 10,000 impressions.
3. Mobile phone account non-payment rate is 12%.
4. It is assumed the 6% of medical claims are fraudulent.

For the sake of reducing data volume without sacrificing model accuracy, it makes sense to reduce the cases in the major classes (good card transactions, non-click impressions, good mobile phone accounts, normal medical claims) through random sampling. For example, to gather 1,000 fraudulent bank card transactions within certain period, there will be 10 million good transactions from the same period. We can use the approach described in an earlier post More on random sampling in Oracle. However, the final model training data sets do NOT have to be perfectly balanced, i.e., fraud and good transactions do NOT have to be 50% and 50%. For example, it is OK that the model training data set has 80% good transactions and 20% fraudulent ones. For example, we can build two logistic regression models:

Model A. training set: 50% good, 50% fraud
Model B. training set: 80% good, 20% fraud

Model A and B will produce different probabilities of being fraud for an individual transaction(Probability of being fraud produced by Model A will most likely be higher than that produced by Model B). However, the relative ranks of transactions' probabilities of being fraud for a whole data set given by both models could be the same. For example, the top 1% riskiest transactions identified by both models are the same. Thus both models are equivalent. As we can see in many applications what matters is the relative ranks (of being fraud, being non-payment, etc.) produced by the predictive models. The quality/usefulness of relative ranks can be depicted by gain charts.

Sunday, December 02, 2012

Build a Scorecard in Oracle Using Logistic Regression Model

One of the popular methods in assessing credit risk is a scorecard. To build a scorecard manually, analysts first identify a number of relevant variables, e.g. having bankruptcies in the past, number of open accounts, etc. Each variable is divided into meaningful segments based on its value. Then a point is assigned to each segment. The credit score is the sum of the points for all the variables. For example, a scorecard may look like the following:

Variable | Segment | Point
Having Bankruptcy | NO | 95
Having Bankruptcy | YES | 15
Number of Open Accounts | from 1 to 3 | 30
Number of Open Accounts | from 4 to 8 | 50
Number of Open Accounts | 9 or more | 40
...................................................

Assume a person's credit history shows past bankruptcy, 4 open accounts and other information. His credit score will be calculated as 15+50.....

To determine the points manually is subjective. A much better way is to build a logistic regression model and use the model coefficients as the "points". A scorecard created this way is accurate and yet still provides the transparency of a scorecard. One of the top cell phone service provider actually used our model-derived scorecard to screen the new customers.

In example 2 of the post Build Predictive Models Using PL/SQL, we show how to build a logistic regression model in Oracle. The extract the model coefficients, we simply use the following query:

select attribute_name, attribute_value, coefficient from table(select dbms_data_mining.get_model_details_glm('GLM_MODEL') from dual);
(Replacing 'GLM_MODEL' with real model name)

We may want to perform some transformation of the coefficients so that the score will be within the desired range.