Saturday, December 22, 2012

the First principle of data analytics is to avoid making mistakes (continued)

Mistakes may happen when data or processes flow from one system to another. Sometimes, mistakes are caused by the flaws in the analytic methodologies themselves.

One of the popular mistakes is to think in terms of absolute numbers, not ratios. For example, analysts see a few hundreds of fraudulent credit card transactions happening at a particular online retailer website and conclude that the retailer is very risky. This may not be true. If the retailer has tens of millions of normal transactions within the same period, the ratio of fraudulent transactions is very low. We have seen in number of occasions that people use number of frauds,instead of fraud rate, as the measurement of risk. As a result, they create fraud detection rules that generate a lot of false alarms.

Friday, December 21, 2012

the First principle of data analytics is to avoid making mistakes

In previous post How to build predictive models that win competitions,we showed examples of actual mistakes made by statisticians/analysts. It is obvious that delivering something wrong will cause irrecoverable damage to the credibility. Mistakes can happen particularly when data or processes follow from one system to another:

1. When data files are provided by the clients. Customers make mistakes too. Their database people may use the wrong scripts and do not extract the data that we want.

2. When data are loaded from text files into databases or other systems. A lot of things can go wrong during data loading. Issues like missing value representation, number formats, date formats, unrecognized characters, etc. will surface. Thus, if possible, using text files to store data should be avoided.

3. When reports are generated. Many analysts perform data analytics work using tools like SAS, R, SPSS, SPlus, Oracle Data Mining, etc. Then they summarize the results manually using Excel Spreadsheet and make the final Powerpoint presentation. Things can go wrong here. Decimal points are placed in the wrong places. Dates are not updated.

4. When models are deployed into production systems. There are a lot of moving parts involved here. It is a challenge to precisely replicate the off-line data analytics process in a different production system.

Tuesday, December 18, 2012

How to build predictive models that win competitions.

The predictive models that we produced won head-to-head competitions and were chosen by clients. For example, our model predicting the risk of a new customer will not pay his/her phone bill was used by a top 3 cell phone company. Our debit card fraud detection model is being used by a top 15 banks. Our competitors included one of the three credit bureaus that had hundreds of statisticians working for them.

We have found that if we follow a number of principles, we will be able to produce good predictive models and quality data analytics work in general.

The first principle is to avoid making mistakes. We have seen many cases where mistakes damage the reputation and credibility of data analysts. For example, statisticians analyzed data and drew the conclusion that compromised credit cards were less likely show fraudulent activities than normal cards. It was a mistake, of course. One of our competitors loaded customer's data wrongly and produced reports where the numbers did not make sense at all. Mistakes like those will cause immediate rejection from clients and permanent damage to analysts' reputation.

We need to realize that avoiding mistakes should be an inherent part of our processes. In a large project, from data gathering, loading, validating, summarizing,  model building, report generation and model deployment, we may have to take many steps and produce hundreds of data sets. To avoid making mistake, we need to double check our results. We have found that it is actually much harder to verify the results are correct than to simply execute the steps to produce them.

Thus, we always perform two tasks: 1. produce the result; 2. verify the result. It is better to spend more time to produce correct results than to quickly deliver something wrong that cause irrecoverable damage to the credibility. We will talk more about avoid making mistakes in the post the First principle of data analytics is to avoid making mistakes

Friday, December 14, 2012

Oracle NTILE function

Oracle Ntile function divides the records into the same number of rows (off at most by 1). Please notice that the records with same variable values (ties) may be placed into different buckets. The following query result shows that the maximum value in a bucket may overlap with the minimum value in the next bucket.

SQL> with tbl as (select ntile(5) over(order by price) nt, price from STOCK_PRICE) select nt, count(1), min(price), max(price) from tbl group by nt
 order by nt;

        NT   COUNT(1) MIN(PRICE) MAX(PRICE)
---------- ---------- ---------- ----------
         1      36387       9.58      13.68
         2      36387      13.68      16.72
         3      36386      16.72      20.87
         4      36386      20.87      26.87
         5      36386      26.87      89.53

We can also use function cume_dist combined with ceil function to divide the records into  similar number of rows. Ties will be placed into the same buckets. The following query result shows that there is no overlap of values between different buckets.

SQL> with tbl as (select ceil(cume_dist() over(order by price nulls first)*5)  nt, price from MV_VIX_15S_AVG) select nt, count(1), min(price), max(price) from tbl group by nt order by nt;

        NT   COUNT(1) MIN(PRICE) MAX(PRICE)
---------- ---------- ---------- ----------
         1      36358       9.58     13.675
         2      36332      13.68      16.71
         3      36413      16.72      20.86
         4      36429      20.87     26.865
         5      36400      26.87      89.53

Tuesday, December 11, 2012

More on Data Preparation for Logistic Regression Models

In the earlier post Data preparation for building logistic regression models , we talked about converting data into more compact format so that we can use memory intensive software like R to handle large number of cases.

Another issue that we commonly face is how to deal with "unbalanced" data. The following are some examples:
1. Only 7 to 12 fraudulent ones out of 10,000 transactions.
2. Online ads has only 1 to 3 clicks per 10,000 impressions.
3. Mobile phone account non-payment rate is 12%.
4. It is assumed the 6% of medical claims are fraudulent.

For the sake of reducing data volume without sacrificing model accuracy, it makes sense to reduce the cases in the major classes (good card transactions, non-click impressions, good mobile phone accounts, normal medical claims) through random sampling. For example, to gather 1,000 fraudulent bank card transactions within certain period, there will be 10 million good transactions from the same period. We can use the approach described in an earlier post More on random sampling in Oracle. However, the final model training data sets do NOT have to be perfectly balanced, i.e., fraud and good transactions do NOT have to be 50% and 50%. For example, it is OK that the model training data set has 80% good transactions and 20% fraudulent ones. For example, we can build two logistic regression models:

Model A. training set: 50% good, 50% fraud
Model B. training set: 80% good, 20% fraud

Model A and B will produce different probabilities of being fraud for an individual transaction(Probability of being fraud produced by Model A will most likely be higher than that produced by Model B). However, the relative ranks of transactions' probabilities of being fraud for a whole data set given by both models could be the same. For example, the top 1% riskiest transactions identified by both models are the same. Thus both models are equivalent. As we can see in many applications what matters is the relative ranks (of being fraud, being non-payment, etc.) produced by the predictive models. The quality/usefulness of relative ranks can be depicted by gain charts.

Sunday, December 02, 2012

Build a Scorecard in Oracle Using Logistic Regression Model

One of the popular methods in assessing credit risk is a scorecard. To build a scorecard manually, analysts first identify a number of relevant variables, e.g. having bankruptcies in the past, number of open accounts, etc. Each variable is divided into meaningful segments based on its value. Then a point is assigned to each segment. The credit score is the sum of the points for all the variables. For example, a scorecard may look like the following:

Variable | Segment | Point
Having Bankruptcy | NO | 95
Having Bankruptcy | YES | 15
Number of Open Accounts | from 1 to 3 | 30
Number of Open Accounts | from 4 to 8 | 50
Number of Open Accounts | 9 or more | 40
...................................................

Assume a person's credit history shows past bankruptcy, 4 open accounts and other information. His credit score will be calculated as 15+50.....

To determine the points manually is subjective. A much better way is to build a logistic regression model and use the model coefficients as the "points". A scorecard created this way is accurate and yet still provides the transparency of a scorecard. One of the top cell phone service provider actually used our model-derived scorecard to screen the new customers.

In example 2 of the post Build Predictive Models Using PL/SQL, we show how to build a logistic regression model in Oracle. The extract the model coefficients, we simply use the following query:

select attribute_name, attribute_value, coefficient from table(select dbms_data_mining.get_model_details_glm('GLM_MODEL') from dual);
(Replacing 'GLM_MODEL' with real model name)

We may want to perform some transformation of the coefficients so that the score will be within the desired range.

Thursday, November 22, 2012

A Real World Case Study: Business Rule vs Predictive Model

The following is a true story to complement earlier posts Comparison of Business Rules and Predictive Models and Predictive Modeling vs Intuitive Business Rules .

A few years ago, we built a new customer acquisition model for a cell phone service provider based on its historical application and payment data. The model calculated a risk score for each cell phone service applicant using information found in his/her credit reports. The higher the score, the higher the risk that a customer will not pay his/her bill.

A few weeks after the model was running, we received an angry email from the client company manager. In the email, the manager gave a list of applicants who had several bankruptcies. According to the manager, they should be high risk customers. However, our model gave them average risk scores. He questioned the validity of the model.

We mentioned that the model score was based on 20 or so variables, not bankruptcies alone. We also analyzed people with bankruptcies in the data that we used to build the model. We found that they paid bills on time. It might be that people with bankruptcies are more mobile and thus depend more on cell phones for communication. They may not be good candidates for mortgage. But from cell phone service providers' perspective, they are good customers.

This is the bottom line. Data-driven predictive models are more trustworthy than intuition-driven business rules.

Wednesday, November 07, 2012

From Oracle SQL Developer to Data Miner

As mentioned in an earlier post How easy is it for a SQL developer/analyst to become a data miner?, if we consider data mining as just more sophisticated SQL queries, data mining is simply a natural extension of SQL programmer/analyst's daily job, i.e., querying the database. For example, a SQL programmer can write a query to answer a business question: finding people living in Boston, MA with annual income at least $50K and credit score 750 or higher. One simply puts the three conditions in the "where clause" as the following:

select person_id, person_first_name, person_last_name
from tbl_data_about_people
where
home_city='Boston' and home_state='MA' and
annual_income>50000 and
credit_score >=750;


Similarly, we can put a predictive function in the "where clause" to answer a more useful question: finding people living in Boston, MA who are 25% more likely to purchase a new car with 2 months as the following:
select person_id, person_first_name, person_last_name
from tbl_data_about_people
where
home_city='Boston' and home_state='MA' and
prediction_probability(model_likely_to_buy, 1 using *)>=0.25;


We can see how easy and powerful to use predictive models in SQL. A traditional Oracle SQL developer can greatly enhance his/her value by picking up those data mining functions. There are a number of good documents and sample codes available at Oracle's site, e.g,
The Data Mining Sample Programs
Oracle® Data Mining Concepts

Tuesday, November 06, 2012

Major Data Mining Steps

Normally, we need to go through the following steps to build a predictive modeling solution.

  1. Data Gathering 
  2. Data Validation 
  3. Data Preparation 
  4. Feature Variable Calculation (creating more salient variables that are more predictive of target)
  5. Predictive Model Building and Testing 
  6. Model Deployment
In our opinion, the main challenges are: Data Preparation, Feature Variable Calculation and Model Deployment. Probably 90% of time is spent on the above three areas. 

Saturday, October 27, 2012

Take Advantage of Materialized View

Materialized view is one of the most useful features for data processing in Oracle. We would suggest that materialized views be used when possible.

To create a materialized view, we use "create materialized view as select.. ", just as when we create a view or table. For example, the following SQL script creates a materialized view that calculates the following bank card summary information for each retailer terminal: the number of transactions, amount of transactions and number of unique cards.

create materialized view mv_retailer_terminal_sts
as
select
terminal_id ,
retailer_id ,
terminal_city,
terminal_state ,
terminal_country,
count(1) number_of_txn ,
sum(tran_amt) amount_of_transaction,
count(distinct card_key) number_of_cards
from credit_card_txn_detail a
group by
terminal_id ,
retailer_id ,
terminal_city,
terminal_state ,
terminal_country;

The materialized view captures a snapshot of the "from " tables. It can be refreshed later manually on demand or based on a defined schedule.

For example, we can refresh a materialized view manually using the following command.
execute DBMS_MVIEW.REFRESH('mv_retailer_terminal_sts','C');
('C') means complete refresh.

One of the advantages of using materialized view is that the SQL query used to produce it can be retrieved using the following query:
select query from user_mviews where mview_name=upper('mv_retailer_terminal_sts');

The difference between a view and a materialized view is that a materialized view creates a physical table. Thus a materialized view is faster to query than a view. Thus we suggest that materialized views be used when possible in replacement of tables.

Sunday, October 21, 2012

Oracle Normal Distribution Function

Oracle function dbms_random.normal generates a standard normal distribution number with zero mean and one standard deviation.
SQL> select dbms_random.normal from dual;
NORMAL
----------
.781195382
The following query generates 6,000 random numbers using dbms_random.normal and then calculate their mean and standard deviation.We can see that their mean is close to zero and standard deviation close to one.
SQL> select avg(v), stddev(v), count(1) from (select dbms_random.normal v from V_6K_OBS);
AVG(V) STDDEV(V) COUNT(1)
---------- ---------- ----------
-.00110804 1.00954138 6000

Wednesday, October 03, 2012

More on random sampling in Oracle

The earlier post Random Sampling using SQL uses dbms_random.random to generate random numbers. A better function is dbms_random.value which generates uniformly distributed random number greater than or equal to 0 and less than 1.
We can rewrite random sampling scripts as the following.
Step 1 . Create random numbers.
create table tbl_app_id_w_rnd as select app_id, dbms_random.value rnd from CELL_PHONE_SERVICE_APPS;
Step 2. Create a view that contains 30% of the records.
create view v_data_testing as select a.* from CELL_PHONE_SERVICE_APPS a, tbl_app_id_w_rnd b where a.app_id=b.app_id and rnd <=0.3;

Saturday, September 29, 2012

Find the smallest or largest value in SQL

In Oracle we can use least function to find the smallest one among many values. The number of input variables are flexible. Similarly, to find the largest one among many values, we use greatest function. The following are examples.
select least(0,1,2,3,4,4,5,6,7,9) from dual;

--------------------------
                         0
select greatest(0,1,2,3,4,4,5,6,7,9) from dual;
-----------------------------
                         9
The following example shows that least/greatest functions take multiple columns as inputs and return the result for each row.  Min/max functions  take only one column as parameter and return the result for the all the rows.
select p0, p1,p2, least(p0,p1,p2), greatest(p0,p1,p2) from TBL_TEST_DATA where rownum < 10;

        P0         P1         P2 LEAST(P0,P1,P2) GREATEST(P0,P1,P2)
---------- ---------- ---------- --------------- ------------------
   1086.62    1089.03    1084.38         1084.38            1089.03
   1214.19    1213.58    1200.94         1200.94            1214.19
   1268.42    1263.25    1257.68         1257.68            1268.42
   1505.11    1511.73    1515.61         1505.11            1515.61
    758.84     764.72     753.37          753.37             764.72
   1521.24    1532.53    1528.63         1521.24            1532.53
   1184.88    1181.19    1185.01         1181.19            1185.01
   1260.55    1266.51    1273.23         1260.55            1273.23
    1483.4    1484.26     1469.9          1469.9            1484.26
select min(p0), min(p1), min(p2),  max(p0), max(p1), max(p2)  from MV_SP_VIX_CORR where rownum < 10;

   MIN(P0)    MIN(P1)    MIN(P2)    MAX(P0)    MAX(P1)    MAX(P2)
---------- ---------- ---------- ---------- ---------- ----------
    758.84     764.72     753.37    1521.24    1532.53    1528.63


Thursday, September 20, 2012

Comparison of Business Rules and Predictive Models

In many organizations, there are two ways to support decision making, i.e., intuitive business rules and predictive models. Business analysts build intuitive rules. Statistician/data modelers use data-driven method mainly predictive models. The following is a comparison of them from 13 different aspects. You may want to check out an earlier post.
Consideration Business Rules Predictive Models
Examples Example 1. A medical fraud rule: Treatments for pregnancy discomfort for male patient is fraud. Example 2. A direct sale rule: Customers who purchase product A within 3 months will likely buy product B. Example 3. Mobile phone customer acquisition rule: If a new customer has bankruptcy in the past, do not offer him a free phone. Example 1. A medical fraud detection model: An unsupervised model that detect anomaly in medical claims. Example 2. A credit card fraud detection model: A neural net that generates fraud score for credit card transactions. Example 3. A customer acquisition model: a logistic regression model that predicts if a new customer will pay her cell phone bill.
Owners Business Analysts who have experience in particular fields such as bank risk management, healthcare, etc. In their communications, they use terms that are understood by business people. They believe that there are explainable cause-effect relationships in predict frauds, etc. Their job is to identify those rules. Statisticians/Data Miners see predictive models are solutions to a wide range of business problems. They believe that in reality the cause-effect relationship is too complex to be understood by human beings. A black box data-driven predictive model is the way to make the most accurate predictions.
Understandability High. Rules are normally intuitive and easy to understand. Low. Most predictive models are black boxes.
Forms of Existence SQL scripts, SAS scripts, Excel Spreadsheet, Word document, in people's head and not documented, etc. SQL scripts, SAS scripts, Excel Spreadsheet, R objects, Oracle mining objects, etc.
Initial Investment Low. High. Need to build data warehouse that archive historical data. May need to purchase data mining software such as SAS, Statistica, etc.
Effort to Build Low. Analysts with domain knowledge can write rules without doing heavy data analysis work. High. Data miners have to go through the processes systemically to build predictive models.
Effort to Deploy into Production Low. Business rules are usually simple scripts such as SQL. High. Predictive models, including the input variable calculation, could be very sophisticated. To deploy a model into production will take significant amount of effort.
Quantity Many. For example, many banks or insurance companies have hundreds or thousands of business rules. Usually, the number of predictive models are very small. For example, even some large banks or insurance companies have only 2 or 3 predictive models to detect frauds.
Perception of Productivity and Effectiveness High. Because of the following two reasons: 1. there are hundreds or thousands of understandable rules; 2. new rules can be created on daily basis, it gives the management an impression that the group who build them is productive and effective. Medium. It may take many days/weeks to build a predictive model before one can see any result. Because predictive model is mostly a black box, it is hard to explain to the management how a predictive model works. Thus, it is possible that the management thinks the model building group is not productive.
Actual Productivity and Effectiveness Low to medium. See an earlier post. Since rules are usually created individually, there are redundancies among them. All the rules as a whole are not optimized. Highly productive and effective. It is often to see that a single predictive model outperforms hundreds or thousands of intuitive business rules combined.
Tools Excel spreadsheet, SQL, special scripting language, and others SAS, SPSS, R, Oracle data mining, and others
Adaptability Low. Since rules are created by people based on their experience manually, they are harder to adapt to the changing situations. High. Predictive models will be refreshed once new data become available. The newly refreshed model will reflect the changes automatically.
Maintenance Difficult. It is hard to maintain hundreds or thousands of rules. It is helpful to rank them based on their effectiveness. Since the number of predictive models is small, it is easier to maintain them.
Longevity Low. Business rules usually lose its effectiveness more quickly than predictive models do after their creations. High to medium. Because of the following two reasons, predictive models maintain its effectiveness longer than business rules: 1. they are built based on large quantity of historical data; 2. predictive models have the ability to generalize. The ability to generalize means they do not simply memorize historical data. They are structured to perform well for future data.

Friday, September 14, 2012

Data preparation for building logistic regression models

Logistic regression is a popular model that predicts binary output, e.g., fraud or not, customer response or not, etc. When building a regression model, there are two ways to represent the data. First, it is the transactional format as shown below. Each record is a single transaction and the target variable is a binary variable.
                Data in transactional format
      USER_ID CAT STATE_CD   CAMPAIGN_DATE CAMPAIGN_CD RESPONSE
      1001 A  MA       01-MAY-12 A          N
      1001 A  MA       08-MAY-12 A          N
      1001 A  MA       15-MAY-12 A          Y
      1001 A  MA       22-MAY-12 A          N
      1001 A  MA       29-MAY-12 A          N
      1001 A  MA       06-JUN-12 A          N
      1001 B  CT       06-JUN-12 A          N
      1002 B  CT       01-MAY-12 A          N
      1002 B  CT       08-MAY-12 A          N
      1002 B  CT       15-MAY-12 A          Y
      1002 B  CT       22-MAY-12 A          N
      1002 B  CT       29-MAY-12 A          Y

If all the independent variables are categorical, we can convert the data in transactional format into a more compact one by summarizing the data using SQL script similar to the following. We count the numbers of responses and non responses for each unique combination of  independent variable values. For continuous variables, if we want, we can transform them into categorical using techniques like binning.

select cat, state_cd, campgain_cd, 
sum(case when response='Y' then 1 else 0 end) num_response,
sum(case when response='N' then 1 else 0 end) num_no_response
from tbl_txn group by cat, state_cd, campgain_cd;
                 Data in the summary format
 CAT STATE_CD  CAMPAIGN_CD  NUM_RESPONSE  NUM_NO_RESPONSE
  A  MA       A        125       1025
  B  CT       C        75        2133
  ..........................

Summarizing data first can greatly reduce the data size and save memory space when building the model. This is particularly useful if we are use memory-based modeling tools such as R.

If we use R to build the logistic regression model, the script for training data in transactional format is similar to the following.
glm(formula=RESPONSE~CAT+STATE_CD+CAMPAIGN_CD,
  data=train.set1,family = binomial(link = "logit")) ->model1

The R scripts for building a logistic model based on summary data is show below.
glm(formula= cbind(NUM_RESPONSE,NUM_NO_RESPONSE) ~CAT+STATE_CD+CAMPAIGN_CD,
  data=train.set1,family = binomial(link = "logit")) ->model2

Relational database and data mining algorithms

Some algorithms may appear prohibitively expensive to perform computationally.  One of such examples is finding for each data point in table A the nearest K neighboring data points in table B (shown in the figure below). From the surface, it involves calculating the distances between 2 trillion pairs of data points (1 million times 2 million).

However, with database tricks such as binary-tree index, we can make the above algorithm very efficient. Thus, it is helpful for algorithm guys like mathematicians to understand some database technologies.  With those database tricks, we can implement algorithms that may be hard to do otherwise.

Thursday, September 13, 2012

Fuzzy Match Data from Different Sources without Exact Keys

Merging data from difference sources is probably one of the most challenging data preparation work. For example, debit card authorizations are captured in real time at the point of sales. However, the actual settlement of the transactions may take a few days. The settlement date, amount and merchant name could all be different from those in the authorization data. It is very challenging to match transactions in these two tables. The following two tables illustrate this issue.
Bank card authorizations.
CARD_NUMBER TXN_DATE       MERCHANT_NAME                   AMT
----------- -------------- ------------------------ ----------
     123457 20120304:11:00 WALMART                       85.35
     123457 20120304:17:30 ABC Restaurant                 56.2
     123457 20120305:08:25 ABC Restaurant                   23
     123457 20120305:08:36 WAL_MART                      135.2

Bank card settlements.
CARD_NUMBER TXN_DATE       MERCHANT_NAME                   AMT
----------- -------------- ------------------------ ----------
     123457 20120305:00:00 WAL**                         85.35
     123457 20120305:00:00 ABC                              70
     123457 20120306:00:00 WAL-M                         135.2
     123457 20120315:00:00 A** Restaurant                   29
To match authorizations and their settlements, simple table joining will obviously not work because there are not exact keys to link them. Probably, the only way to do this is to use a fuzzy, iterative matching approach. We can measure three distances between an authorization and settlement, i.e., merchant name distance, transaction amount difference and transaction date difference.
1. Merchant name distance can be measured using the following function.
100-utl_match.EDIT_DISTANCE_SIMILARITY(upper(a.merchant_name), upper( b.merchant_name))
For exact string matches, utl_match.EDIT_DISTANCE_SIMILARITY returns 100. For no matches, it returns zero.
2. Transaction amount difference is  abs(a.amt-b.amt) .
3. Transaction date difference is trunc(b.txn_date,'DDD')-a.txn_date
Once we calculate the above three distances for each pair of records in authorization and settlement data sets, we can assign a weight to each of them and combine them into a single distance that measures the similarity between two records, i.e., record_distance= w1*merchant_name_distance + w2*amount_distance +w3*date_distance. We can identify those pairs having the least distances for each card.Those pairs are considered successful matches and are removed from the data. We repeat this matching process for the remaining data until we are happy with the results. This process can be fully automated. I have used the approach to merge about 50 millions authorizations from 3 millions debit cards. It only took about an hour.

Recency, Frequency, Monetary (RFM) Analysis: Part 5. RFM Code

To calculate monetary variable, we can use analytic function sum (amount) over(). Thus, we can combine the quintiles for recency, frequency and monetary to create a RFM code for each customer. This can be done using the following query. RFM code 555 is the best group of customers.


   USER_ID    CAT   STATE_ID CAMPAIGN_DATE CAMPAIGN_ID  RESPONSE     AMOUNT
---------- -- -------- --------- -------- -------- ----------
      1001 A  MA       29-MAY-12 A        N
      1001 A  MA       22-MAY-12 A        N
      1001 A  MA       15-MAY-12 A        Y                50
      1001 A  MA       08-MAY-12 A        N
      1001 A  MA       01-MAY-12 A        N
      1001 B  CT       06-JUN-12 A        N
      1001 A  MA       06-JUN-12 A        N
      1002 B  CT       08-MAY-12 A        N
      1002 B  CT       15-MAY-12 A        Y                50
      1002 B  CT       22-MAY-12 A        N
      1002 B  CT       29-MAY-12 A        Y                50
      1002 B  CT       01-MAY-12 A        N
with temp_tbl as
(
select 
user_id, 
min(sysdate-campaign_date) as recency,
count(1) frequency, 
sum(amount) as sum_amount
  from TBL_DIRECT_MAILING a where RESPONSE='Y'
group by user_id
),
temp_tbl2 as
(
select 
a.user_id,  
recency,
frequency, 
sum_amount from 
(select distinct user_id from TBL_DIRECT_MAILING) a,  
temp_tbl b
where a.user_id=b.user_id(+)
),
temp_tbl3 as
(
select user_id,
ceil(cume_dist() over(order by recency desc nulls first)*5) recency_quintile,
ceil(cume_dist() over(order by frequency nulls first)*5) frequency_quintile,
ceil(cume_dist() over(order by sum_amount  nulls first)*5) amt_quintile
from temp_tbl2)
select user_id, recency_quintile*100+frequency_quintile*10+amt_quintile as RFM_Code from
temp_tbl3 order by user_id;

The result may look like the following.

   USER_ID   RFM_CODE
---------- ----------
      1001        333
      1002        555
      1003        233

Recency, Frequency, Monetary (RFM) Analysis: Part 4

Just like we do for recency variable, we can divide customers into 5 or 10 equal size buckets, called quintiles or deciles, based on how frequently they respond. 
        USER_ID CAT STATE_ID       CAMPAIGN_DATE CAMPAIGN_ID RESPONSE
---------- -- -------- --------- -------- --------
      1001 A  MA       01-MAY-12 A        N
      1001 A  MA       08-MAY-12 A        N
      1001 A  MA       15-MAY-12 A        Y
      1001 A  MA       22-MAY-12 A        N
      1001 A  MA       29-MAY-12 A        N
      1001 A  MA       06-JUN-12 A        N
      1001 B  CT       06-JUN-12 A        N
      1002 B  CT       01-MAY-12 A        N
      1002 B  CT       08-MAY-12 A        N
      1002 B  CT       15-MAY-12 A        Y
      1002 B  CT       22-MAY-12 A        N
      1002 B  CT       29-MAY-12 A        Y
The following query calculates quintiles and deciles for each customer based on how frequently they responded to our offerings. The more frequently the responses, the higher the number for quintiles or deciles.
with temp_tbl as
(
select 
user_id, count(1) num_of_response  from TBL_DIRECT_MAILING a where RESPONSE='Y'
group by user_id
),
temp_tbl2 as
(
select a.user_id, num_of_response  from 
(select distinct user_id from TBL_DIRECT_MAILING) a,  
temp_tbl b
where a.user_id=b.user_id(+)
)
select user_id,
num_of_response,  
ceil(cume_dist() over(order by num_of_response nulls first)*5) frequency_quintile,
ceil(cume_dist() over(order by num_of_response nulls first)*10) frequency_decile
from temp_tbl2
temp_tbl2 order by user_id;

The output of the query looks like the following. 

   USER_ID NUM_OF_RESPONSE FREQUENCY_QUINTILE FREQUENCY_DECILE
--------- --------------- ------------------ ----------------
     1001               1                  3                5
     1002               2                  5                9
     1003               1                  3                5
     1004               2                  5                9

Recency, Frequency, Monetary (RFM) Analysis: Part 3

Frequency measures how many transactions happen within certain period of time. Usually, the more frequently a customer responds to the direct mailing offerings in the past, the more likely she will make additional purchases. In the case of credit card fraud, the fraudsters tend to make more frequent purchases than normal consumers do. Frequency variable can be calculated using SQL analytic function such as count() over().

Again we use the following credit card transactions as examples.
CARD_NUMBER     TXN_ID TXN_DATE   TXN_TYPE        TXN_AMT
----------- ---------- ---------- ------------ ----------
     123457          1 0304:11:00 Grocery           85.35
     123457          2 0304:17:30 Electronic      2100.75
     123457          3 0305:08:25 Gas Station           1
     123457          4 0305:08:36 Electronic       1435.2
     123457          5 0305:11:23 Grocery            55.3

The following query calculate the number of transactions in the past 6 (0.25 day) and12 hours (0.5 day) before the current transaction.
select 
card_number, 
to_char(TXN_DATE,'YYYYMMDD:HH24:MI') txn_date, 
count(1) over(partition by CARD_NUMBER order by txn_date range between 0.25 preceding and 0.0000001 preceding) num_in_6h,
count(1) over(partition by CARD_NUMBER order by txn_date range between 0.5 preceding and 0.0000001 preceding) num_in_12h
from TBL_CARD_TXN   a
order by CARD_NUMBER, txn_date;


CARD_NUMBER TXN_DATE        NUM_IN_6H NUM_IN_12H
---------- -------------- ---------- ----------
    123457 20120304:11:00          0          0
    123457 20120304:17:30          0          1
    123457 20120305:08:25          0          0
    123457 20120305:08:36          1          1
    123457 20120305:11:23          2          2

Recency, Frequency, Monetary (RFM) Analysis: Part 2

Assume we have historical data about customer's responses to our direct mail offers as shown below. Today we want to send them new letters. We believe that customers who responded to our offerings recently are more likely to respond to our future offers. We can divide customers into 5 or 10 equal size buckets, called quintiles or deciles, based on how recently they responded. This practice is common.

        USER_ID CAT STATE_ID       CAMPAIGN_DATE CAMPAIGN_ID RESPONSE
---------- -- -------- --------- -------- --------
      1001 A  MA       01-MAY-12 A        N
      1001 A  MA       08-MAY-12 A        N
      1001 A  MA       15-MAY-12 A        Y
      1001 A  MA       22-MAY-12 A        N
      1001 A  MA       29-MAY-12 A        N
      1001 A  MA       06-JUN-12 A        N
      1001 B  CT       06-JUN-12 A        N
      1002 B  CT       01-MAY-12 A        N
      1002 B  CT       08-MAY-12 A        N
      1002 B  CT       15-MAY-12 A        Y
      1002 B  CT       22-MAY-12 A        N
      1002 B  CT       29-MAY-12 A        Y

The following query calculates quintiles and deciles for each customer based on how recently they responded to our offerings. The more recent the response, the higher the number for quintiles or deciles.

with temp_tbl as
(
select 
user_id, max(campaign_date) recent_response  from TBL_DIRECT_MAILING a where RESPONSE='Y'
group by user_id
),
temp_tbl2 as
(
select a.user_id, sysdate-recent_response as recency from 
(select distinct user_id from TBL_DIRECT_MAILING) a,  
temp_tbl b
where a.user_id=b.user_id(+)
)
select user_id,
round(recency) recency_days,
ceil(cume_dist() over(order by recency desc nulls first)*5) recency_quintile,
ceil(cume_dist() over(order by recency desc nulls first)*10) recency_decile
from temp_tbl2
temp_tbl2 order by user_id;

The output of the query looks like the following. Customers with higher quintile or decile numbers are more likely to respond to our offerings. Of course, the most accurate way to predict a customer's response is to build a predictive model that uses the recency as one of its input variables.

  USER_ID RECENCY_DAYS RECENCY_QUINTILE RECENCY_DECILE

--------- ------------ ---------------- --------------
     1001          151                3              5
     1002           30                5              9
     1003           72                2              4

Recency, Frequency, Monetary (RFM) Analysis: Part 1

RFM (Recency, Frequency, Monetary) analysis captures the context information about an account or customer. The result of RFM analysis is a set of variables that can be used for ad hoc queries, segmentation, or predictive modeling to identify valuable customers, detect fraud, improve direct mailing response rates, etc.

RFM analysis is usually performed based on a "transaction" data set. Each record represents a transaction at specific point in time.  For example, the following are a sequence of credit card transactions.

We talk about recency factor first. In the following table, transaction 4 as the highlighted is most suspicious. Not only it is an expensive electronics purchase but also the immediate transaction before it is a $1 gas station purchase that happens 11 minutes ago. Fraudsters often use gas station to test a credit cards to make sure they work before they make big purchase at a store.
CARD_NUMBER     TXN_ID TXN_DATE   TXN_TYPE        TXN_AMT
----------- ---------- ---------- ------------ ----------
     123457          1 0304:11:00 Grocery           85.35
     123457          2 0304:17:30 Electronic      2100.75
     123457          3 0305:08:25 Gas Station           1
     123457          4 0305:08:36 Electronic       1435.2
     123457          5 0305:11:23 Grocery            55.3
Thus, the recency is an important consideration. It provides the context information about a transaction. When we build predictive models or write ad hoc fraud detection rules, including recency factors can greatly improve detection accuracy.

Another example is to predict if a prospect responds to a direct mail offering. The following table shows a prospect's responses in 5 weeks to 5 products. The reason that she responds to offering of product C in week 3 could be related to the offering of B one week before. For example, a person will more likely accept a credit card with 6.99% APR if the she is offered 17.99% APR earlier. Again, the recency factors are important.
User_ID    User_Category   State        Week   Product_ID    Response
1001             A                     MA             1             A                        No
1001             A                     MA             2             B                        No
1001             A                     MA             3             C                        Yes
1001             A                     MA             4             D                        No
1001             A                     MA             5             E                        No
One of the convenient ways to calculate recency factors is SQL analytic function lag. For example, the following query calculates three recency factors for credit transactions, i.e., time since recent transaction, recent transaction type and recent transaction amount. The "Partition by card_number" clause means the calculation is done independently for each card number.
select 
a.*,
( txn_date-lag(txn_date ) over(partition by card_number  order by txn_date)) time_since_recent_txn,
lag(txn_type) over(partition by card_number  order by txn_date) last_recent_type ,
lag(txn_amt)  over(partition by card_number order by txn_date) last_recent_amt
from TBL_CARD_TXN a order by txn_date;

When there are time/date variables in the data, it is extremely important to take advantage of them by using RFM analysis to create derived variables. Those variables can then be used as inputs to predictive models. Please see another post Build Best Direct Marketing Predictive Models.


Wednesday, September 12, 2012

A random sampling method for medical claim audit

Random sampling processes could become very sophisticated. One of such examples is a stratified random sampling method developed by the Department of Social and Health Services (DSHS) to audit hospital Medicaid claims. This method is described in a paper "Analysis of the Regression Estimator
and its Implementation by DSHS"( http://www.hca.wa.gov/documents/rfp/recovery_audit/exhibit_b.pdf)

Hospitals provide services for patients who are covered by Medicaid and then file claims to the states to get the payment. State governments want to perform post-payment reviews to make sure that they do not overpay hospitals. Because the size of claims from a hospital could be very large, only a randomly sampled claims will be audited.

The claims of highest 25 amounts are always reviewed. For the remaining claims, the process first divides them into many buckets, or strata, based on their claim amount. Random sampling will be performed for each stratum. The higher the variance in claim amount in a stratum, the more samples will be drawn from it. Please notice that the sample size is NOT proportional to the population size in the stratum.

The most common ways to implement the above methodology are probably using SAS or SQL. I have seen the SAS version of it. I also implemented two SQL versions. One SQL version strictly follows the steps in the SAS package. Another SQL version uses Oracle analytic functions heavily that results in 75% reduction in the volume of code. These analytic functions include row_number() over(), width_bucket () , sum() over(), count() over(), etc.

Tuesday, September 04, 2012

Predictive Modeling vs Intuitive Business Rules

Many organizations have analysts who create business rules to identify most profitable customers, detect credit card or medical claim frauds, etc. Analysts in a company may maintain hundreds or even thousands of rules and add new ones regularly. Those rules are usually derived from human intuition and experience. The following are two imaginary credit card fraud detection rules. (They are for illustration purpose only, not actual rules used by any banks).
1. High frequency rule:  If the number of transactions from a card in the past 7 days is above 75, then current transaction is fraud.
2. High dollar amount:  If the amount of transactions from a card in the past 7 days is above $95,000, then current transaction is fraud.
We can plot credit card transactions on a plane as shown below. X-axis is the number of transactions in last 7 days. Y-axis is the total dollar amount of transactions in the last 7 days. Based on the two variables, every transaction is represented by a dot in the plane. In this plot, red dots represent fraud transactions and blue represent good. Fraud transactions normally have higher frequency and cumulative amount since fraudsters spend money more aggressively. Thus most of the red dots are located in the up right corner (high frequency and high cumulative amount).
A high frequency rule detects 2 red dots (yellow rectangle) and a high dollar amount rule detection 2 red dots (blue rectangle). Totally, we detect 4 red dots. If we want to detect more red dots, we have to lower the thresholds for those rules which will lead to many normal transactions being mistakenly marked as fraud.
The most effective way to separate red and blue dots is the straight line that runs northwest-southeast direction as shown in the figure. Unfortunately, the line can not be described by intuitive rules. What a statistical predictive model does is to find such a line through learning from the data. A statistical  predictive model can not be described intuitively but it is far more accurate. It is often to see that a single predictive model outperforms hundreds or thousands of intuitive business rules combined.  There is a newer post about the comparison of predictive models and intuitive rules from 13 aspects.

How easy is it for a SQL developer/analyst to become a data miner?

An Oracle SQL developer/analyst can become a satisfactory data miner if she learns to know a PL/SQL procedure and a SQL function:
1. A PL/SQL procedure create_model in dbms_data_mining package.
2. A SQL function prediction_probability.
A good thing is that learning the above items is within the framework of Oracle PL/SQL. There is no need to study another special data mining/statistical language such as SAS or R. There is no need to install another standalone data mining software. Everything happens within the database using PL/SQL or SQL that she is already familiar with. Thus it is possible that an Oracle SQL developer/analyst becomes a very productive data miner. As the majority of enterprise data existing in the databases, SQL programmers have the advantage over SAS or R programmers in that she is closer to the data.   If we consider data mining as just more sophisticated SQL queries, data mining is simply a natural extension of SQL programmer/analyst's daily job, i.e., querying the database.

3 most useful little SQL functions for data manipulation

I have found the following three little SQL functions useful for data manipulation. The are almost always used in my data mining projects in these years. They are not fancy, but very convenient.

1. Mod function.
Mod(m,n) in Oracle returns the remainder of m divided by n. For example, mod(13,7)=6 because 6 is the remainder of 13 divided by 7. If m and n are positive integers, mod(m,n) is always within the range of 0 and n-1. We can use Mod function for random sampling. For SQL Server, Mod function is operator m%n.

select record_id from tbl_test2 order by record_id;
 RECORD_ID
----------
         1
         2
         3
         4
         5
         6
         7
The following query picks every other record id.
select record_id from tbl_test2 where mod(record_id,2)=1 order by record_id;
 RECORD_ID
----------
         1
         3
         5
         7
The following query picks every third record id.
select record_id from tbl_test2 where mod(record_id,3)=1 order by record_id;
 RECORD_ID
----------
         1
         4
         7
We can combine mod and dbms_random.random functions to perform random sample. See previous post "Random Sampling using SQL".
2. Nvl function
Oracle nvl(x,y) function replaces null value in x with y. The following query replaces null value with 'X';

select var, nvl(a.var, 'X')  from tbl_test3;
VAR                              NVL(A.VAR,'X')
-------------------------------- --------------------------------
A                                A
B                                B
                                 X
C                                C
3. Decode function
Decode function maps values of a variable to other values.  In a number of projects, I used decode function to convert the categories into weights (derived from logistic regression models). The following example assigns a weight of -.420916 to A, -.028324 to B and so on. A default weight of 0 is assigned to all other categories that are not listed.
decode(var,
'A ',-.420916,
'B',-.028324,
'C',-.186108,
'D',-1.165688,
'E',-.016754,
'F',-.153281,
'G',-.010159,
'H',-1.136179,
0)

Sunday, September 02, 2012

Dealing with missing values in data mining

We have found that when analyzing large amount of data such as financial transactions, medical claims,  cell phone calls, and credit information, missing values are common. Simply ignoring data points with missing values is recommended in most cases. There are a number of ways to handle missing values for categorical and continuous variables.

The following examples illustrate how to replace missing values in categorical variables.
Original data:
Record_ID   Variable_X
 1                     A
 2                     B
 3                     A
........................
95                  Missing

Method 1. Replace missing value with a special value (e.g. X)
Record_ID   Variable_X
 1                     A
 2                     B
 3                     A
........................
95                   X
In Oracle, this can be done easily using function nvl(). The following is the SQL scripts to perform the replacement.
select nvl(Variable_X,'X')  as Variable_X, ..... from original_table;
Method 2. Replace missing value with the most frequent value.
Record_ID   Variable_X
 1                     A
 2                     B
 3                     A
........................
95                   A
The following is the SQL scripts to perform the replacement.
select nvl(Variable_X,'A')  as Variable_X, ..... from original_table;

The following examples illustrate how to replace missing values in continuous variables.
Original data:
Record_ID   Variable_X
 1                     1.3
 2                     0.5
 3                     2.9
........................
95                  Missing

Method 3. Replace missing value with mean or median value.
Record_ID   Variable_X
 1                     1.3
 2                     0.5
 3                     2.9
........................
95                  1.56
The following is the SQL scripts to perform the replacement.
select  nvl(Variable_X, 1.56) as Variable_X, ..... from original_table;

Method 4. We first convert continuous variable into categorical variable. This is also called binning. I will discuss binning in another post. We define missing as a special category.
The following is the SQL scripts to perform the above logic.
case when Variable_X <=1.2 then 'A'
        when Variable_X <=1.8 then 'B'
        when Variable_X  > 1.8 then 'C'
       else 'Z' end  Variable_X_BIN

New Variable Variable_X_BIN.
Record_ID   Variable_X  --> Variable_X_BIN
 1                     1.3           ---> B
 2                     0.5           ---> A
 3                     2.9           --->B
........................
95                  missing      ---->Z

There are more sophisticated methods to impute missing values using multiple variable models. But they are rarely used in real world applications.

Saturday, September 01, 2012

In-Database Data Mining

In previous post "Data Mining Components", we talk about four components or layers in a using data mining to solve business problems, e.g., fraud prevention, new customer life time value prediction, online ads click through rate, etc. Two abstract components are business problems and data mining algorithms. By abstract, we mean their existence is largely mental. They do not have physical forms and can not be purchased. Two physical components are data mining software and data management tools. Data mining software are the implementation of data mining algorithms. Data management tools could be relational databases or plain files.   It is important to realize that items within each layer are sometimes exchangeable. We can select the combination of data mining models, tools and databases that suit our needs.  In these years, we have found that performing data mining within database has its unique advantages.  In this in-database data mining scenario,  two physical components, i.e., data mining algorithm implementation and data management, are collapsed into a single component, Oracle database. For example, we have used Oracle Data Mining to build debit card and check deposit fraud  models. The in-database solution provides substantially increased security, productivity, manageability and scalability. This should not be a surprise because a database is designed to handle large data using SQL as a standard language in an enterprise environment. If we consider data mining as just more sophisticated SQL queries, nothing more, nothing less, a lot of difficulties in data mining practices will disappear naturally
       In-Database Data Mining 

Data Mining Components

We identify four components or layers in a data mining engagement as shown in the figure below. Two abstract layers, business problem and data mining algorithms, are on the top. And two physical layers, data mining tools and data management, are at the bottom.

Business problems that we want to solve is one of the most important abstract layers. It could be predicting fraud (bank card, check, medical claims), new customer life time value at point of sales, online ads click rate, credit worthiness, customer segmentation. We can address the above business problems using various predictive models such as logistic regressions, neural nets, support vector machines, K-means clustering.

Data mining tools layer contains commercial or open source software such as SAS, Splus, R, Weca, SPSS, Statsoft, Oracle Data Mining. Common data mining algorithms can be found in almost all of the software mentioned above. Data management/storage layer are relational databases such as Oracle, SQL server, MySQL, or simply files such as SAS files or text files.

We can predict if a new customer will pay his car loan using logistic regression model implemented in SAS and store the data in Oracle. Or we can solve the same problem using decision tree models implemented in R and store data in SQL server. It is important to realize that items within each layer are sometimes exchangeable. We can solve any business problems with varieties of data mining algorithms implemented by commercial or open source tools and store data in any databases.  Thus it is a misconcept that neural nets are best in predicting credit card fraud. An experienced data miner can build a decision tree model to predict credit card fraud that performs equally well as a neural net does. We can select the combination of data mining models, tools and databases that suit our needs.
Four Components or Layers in Data Mining


Friday, August 31, 2012

Random Sampling in Oracle using Sample Clause

In an earlier post about "Random Sampling using SQL", we use functions mode and dbms_random.random to randomly sample 30% of records.  This can also be done easily with oracle sample clause as the following:
select * from TBL_TEST sample(30);
Sample clause is also a great way to test the performance of a query before we run it on large tables.

Thursday, August 30, 2012

Select top percent in Oracle

Oracle analytic function cume_dist() calculates the cumulative percentage for a variable.

The following query selects the top 5% records based on variable num.
with tbl as
(
select a.*, cume_dist() over(order by num desc) cumulative_pcnt from tbl_test a
)
select * from tbl where cumulative_pcnt <=0 .05;

The following query selects the bottom 5% records based on variable num.
with tbl as
(
select a.*, cume_dist() over(order by num ) cumulative_pcnt from tbl_test a
)
select * from tbl where cumulative_pcnt <=0 .05;

The following query selects the top 5% records based on variable num by group.
with tbl as
(
select a.*, cume_dist() over(partition by grp order by num desc) cumulative_pcnt from tbl_test a
)
select * from tbl where cumulative_pcnt <=0 .05;


Wednesday, August 29, 2012

Random Sampling using SQL

The following SQL scripts create a view by taking 30% random samples from a table.

Step 1 . Create random numbers within range of 0 and 99 (100 bins).
create table tbl_app_id_w_rnd as select app_id, 
mod(abs(dbms_random.random()),100) rnd from CELL_PHONE_SERVICE_APPS;

Step 2. Create a view that contains 30% of the records.
create view v_data_testing
as select a.* from CELL_PHONE_SERVICE_APPS a,
tbl_app_id_w_rnd b where a.app_id=b.app_id and  rnd < 30;

Please see a newer post. dbms_random.value is a better function than dbms_random.random.

Build K-Means Clustering Models Using PL/SQL

As mentioned in an earlier post,  we can build predictive models within Oracle using PL/SQL. One of such models is K-Means clustering which does not require a target variable. K-Means clustering iteratively

Step 1 . Define K-Means clustering parameters
  CREATE TABLE km_settings (
   setting_name  VARCHAR2(30),   setting_value VARCHAR2(30));
BEGIN      
   INSERT INTO km_settings (setting_name, setting_value) VALUES
   (dbms_data_mining.kmns_distance, dbms_data_mining.kmns_euclidean);
   INSERT INTO km_settings (setting_name, setting_value) VALUES
   (dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_on);
  INSERT INTO km_settings (setting_name, setting_value) VALUES  (dbms_data_mining.clus_num_clusters, '7');
END;
/
Step 2. Build a K-Means Clustering model
define m_name='km_0425a'
define input_tbl='v_data_training_4_cls'
define rec_id='APP_ID'
BEGIN
  DBMS_DATA_MINING.CREATE_MODEL(
    model_name          => '&m_name',
    mining_function     => dbms_data_mining.clustering,
    data_table_name     => '&input_tbl',
    case_id_column_name => '&rec_id',
    settings_table_name => 'km_settings');
END;
/
Again once the model is built, it is a mining object. We can use Cluster_id() function to calculate cluster members for new data.
select app_id, cluster_id(km_0425a using *) cls from CELL_PHONE_SERVICE_APPS_NEW