Saturday, September 29, 2012

Find the smallest or largest value in SQL

In Oracle we can use least function to find the smallest one among many values. The number of input variables are flexible. Similarly, to find the largest one among many values, we use greatest function. The following are examples.
select least(0,1,2,3,4,4,5,6,7,9) from dual;

--------------------------
                         0
select greatest(0,1,2,3,4,4,5,6,7,9) from dual;
-----------------------------
                         9
The following example shows that least/greatest functions take multiple columns as inputs and return the result for each row.  Min/max functions  take only one column as parameter and return the result for the all the rows.
select p0, p1,p2, least(p0,p1,p2), greatest(p0,p1,p2) from TBL_TEST_DATA where rownum < 10;

        P0         P1         P2 LEAST(P0,P1,P2) GREATEST(P0,P1,P2)
---------- ---------- ---------- --------------- ------------------
   1086.62    1089.03    1084.38         1084.38            1089.03
   1214.19    1213.58    1200.94         1200.94            1214.19
   1268.42    1263.25    1257.68         1257.68            1268.42
   1505.11    1511.73    1515.61         1505.11            1515.61
    758.84     764.72     753.37          753.37             764.72
   1521.24    1532.53    1528.63         1521.24            1532.53
   1184.88    1181.19    1185.01         1181.19            1185.01
   1260.55    1266.51    1273.23         1260.55            1273.23
    1483.4    1484.26     1469.9          1469.9            1484.26
select min(p0), min(p1), min(p2),  max(p0), max(p1), max(p2)  from MV_SP_VIX_CORR where rownum < 10;

   MIN(P0)    MIN(P1)    MIN(P2)    MAX(P0)    MAX(P1)    MAX(P2)
---------- ---------- ---------- ---------- ---------- ----------
    758.84     764.72     753.37    1521.24    1532.53    1528.63


Thursday, September 20, 2012

Comparison of Business Rules and Predictive Models

In many organizations, there are two ways to support decision making, i.e., intuitive business rules and predictive models. Business analysts build intuitive rules. Statistician/data modelers use data-driven method mainly predictive models. The following is a comparison of them from 13 different aspects. You may want to check out an earlier post.
Consideration Business Rules Predictive Models
Examples Example 1. A medical fraud rule: Treatments for pregnancy discomfort for male patient is fraud. Example 2. A direct sale rule: Customers who purchase product A within 3 months will likely buy product B. Example 3. Mobile phone customer acquisition rule: If a new customer has bankruptcy in the past, do not offer him a free phone. Example 1. A medical fraud detection model: An unsupervised model that detect anomaly in medical claims. Example 2. A credit card fraud detection model: A neural net that generates fraud score for credit card transactions. Example 3. A customer acquisition model: a logistic regression model that predicts if a new customer will pay her cell phone bill.
Owners Business Analysts who have experience in particular fields such as bank risk management, healthcare, etc. In their communications, they use terms that are understood by business people. They believe that there are explainable cause-effect relationships in predict frauds, etc. Their job is to identify those rules. Statisticians/Data Miners see predictive models are solutions to a wide range of business problems. They believe that in reality the cause-effect relationship is too complex to be understood by human beings. A black box data-driven predictive model is the way to make the most accurate predictions.
Understandability High. Rules are normally intuitive and easy to understand. Low. Most predictive models are black boxes.
Forms of Existence SQL scripts, SAS scripts, Excel Spreadsheet, Word document, in people's head and not documented, etc. SQL scripts, SAS scripts, Excel Spreadsheet, R objects, Oracle mining objects, etc.
Initial Investment Low. High. Need to build data warehouse that archive historical data. May need to purchase data mining software such as SAS, Statistica, etc.
Effort to Build Low. Analysts with domain knowledge can write rules without doing heavy data analysis work. High. Data miners have to go through the processes systemically to build predictive models.
Effort to Deploy into Production Low. Business rules are usually simple scripts such as SQL. High. Predictive models, including the input variable calculation, could be very sophisticated. To deploy a model into production will take significant amount of effort.
Quantity Many. For example, many banks or insurance companies have hundreds or thousands of business rules. Usually, the number of predictive models are very small. For example, even some large banks or insurance companies have only 2 or 3 predictive models to detect frauds.
Perception of Productivity and Effectiveness High. Because of the following two reasons: 1. there are hundreds or thousands of understandable rules; 2. new rules can be created on daily basis, it gives the management an impression that the group who build them is productive and effective. Medium. It may take many days/weeks to build a predictive model before one can see any result. Because predictive model is mostly a black box, it is hard to explain to the management how a predictive model works. Thus, it is possible that the management thinks the model building group is not productive.
Actual Productivity and Effectiveness Low to medium. See an earlier post. Since rules are usually created individually, there are redundancies among them. All the rules as a whole are not optimized. Highly productive and effective. It is often to see that a single predictive model outperforms hundreds or thousands of intuitive business rules combined.
Tools Excel spreadsheet, SQL, special scripting language, and others SAS, SPSS, R, Oracle data mining, and others
Adaptability Low. Since rules are created by people based on their experience manually, they are harder to adapt to the changing situations. High. Predictive models will be refreshed once new data become available. The newly refreshed model will reflect the changes automatically.
Maintenance Difficult. It is hard to maintain hundreds or thousands of rules. It is helpful to rank them based on their effectiveness. Since the number of predictive models is small, it is easier to maintain them.
Longevity Low. Business rules usually lose its effectiveness more quickly than predictive models do after their creations. High to medium. Because of the following two reasons, predictive models maintain its effectiveness longer than business rules: 1. they are built based on large quantity of historical data; 2. predictive models have the ability to generalize. The ability to generalize means they do not simply memorize historical data. They are structured to perform well for future data.

Friday, September 14, 2012

Data preparation for building logistic regression models

Logistic regression is a popular model that predicts binary output, e.g., fraud or not, customer response or not, etc. When building a regression model, there are two ways to represent the data. First, it is the transactional format as shown below. Each record is a single transaction and the target variable is a binary variable.
                Data in transactional format
      USER_ID CAT STATE_CD   CAMPAIGN_DATE CAMPAIGN_CD RESPONSE
      1001 A  MA       01-MAY-12 A          N
      1001 A  MA       08-MAY-12 A          N
      1001 A  MA       15-MAY-12 A          Y
      1001 A  MA       22-MAY-12 A          N
      1001 A  MA       29-MAY-12 A          N
      1001 A  MA       06-JUN-12 A          N
      1001 B  CT       06-JUN-12 A          N
      1002 B  CT       01-MAY-12 A          N
      1002 B  CT       08-MAY-12 A          N
      1002 B  CT       15-MAY-12 A          Y
      1002 B  CT       22-MAY-12 A          N
      1002 B  CT       29-MAY-12 A          Y

If all the independent variables are categorical, we can convert the data in transactional format into a more compact one by summarizing the data using SQL script similar to the following. We count the numbers of responses and non responses for each unique combination of  independent variable values. For continuous variables, if we want, we can transform them into categorical using techniques like binning.

select cat, state_cd, campgain_cd, 
sum(case when response='Y' then 1 else 0 end) num_response,
sum(case when response='N' then 1 else 0 end) num_no_response
from tbl_txn group by cat, state_cd, campgain_cd;
                 Data in the summary format
 CAT STATE_CD  CAMPAIGN_CD  NUM_RESPONSE  NUM_NO_RESPONSE
  A  MA       A        125       1025
  B  CT       C        75        2133
  ..........................

Summarizing data first can greatly reduce the data size and save memory space when building the model. This is particularly useful if we are use memory-based modeling tools such as R.

If we use R to build the logistic regression model, the script for training data in transactional format is similar to the following.
glm(formula=RESPONSE~CAT+STATE_CD+CAMPAIGN_CD,
  data=train.set1,family = binomial(link = "logit")) ->model1

The R scripts for building a logistic model based on summary data is show below.
glm(formula= cbind(NUM_RESPONSE,NUM_NO_RESPONSE) ~CAT+STATE_CD+CAMPAIGN_CD,
  data=train.set1,family = binomial(link = "logit")) ->model2

Relational database and data mining algorithms

Some algorithms may appear prohibitively expensive to perform computationally.  One of such examples is finding for each data point in table A the nearest K neighboring data points in table B (shown in the figure below). From the surface, it involves calculating the distances between 2 trillion pairs of data points (1 million times 2 million).

However, with database tricks such as binary-tree index, we can make the above algorithm very efficient. Thus, it is helpful for algorithm guys like mathematicians to understand some database technologies.  With those database tricks, we can implement algorithms that may be hard to do otherwise.

Thursday, September 13, 2012

Fuzzy Match Data from Different Sources without Exact Keys

Merging data from difference sources is probably one of the most challenging data preparation work. For example, debit card authorizations are captured in real time at the point of sales. However, the actual settlement of the transactions may take a few days. The settlement date, amount and merchant name could all be different from those in the authorization data. It is very challenging to match transactions in these two tables. The following two tables illustrate this issue.
Bank card authorizations.
CARD_NUMBER TXN_DATE       MERCHANT_NAME                   AMT
----------- -------------- ------------------------ ----------
     123457 20120304:11:00 WALMART                       85.35
     123457 20120304:17:30 ABC Restaurant                 56.2
     123457 20120305:08:25 ABC Restaurant                   23
     123457 20120305:08:36 WAL_MART                      135.2

Bank card settlements.
CARD_NUMBER TXN_DATE       MERCHANT_NAME                   AMT
----------- -------------- ------------------------ ----------
     123457 20120305:00:00 WAL**                         85.35
     123457 20120305:00:00 ABC                              70
     123457 20120306:00:00 WAL-M                         135.2
     123457 20120315:00:00 A** Restaurant                   29
To match authorizations and their settlements, simple table joining will obviously not work because there are not exact keys to link them. Probably, the only way to do this is to use a fuzzy, iterative matching approach. We can measure three distances between an authorization and settlement, i.e., merchant name distance, transaction amount difference and transaction date difference.
1. Merchant name distance can be measured using the following function.
100-utl_match.EDIT_DISTANCE_SIMILARITY(upper(a.merchant_name), upper( b.merchant_name))
For exact string matches, utl_match.EDIT_DISTANCE_SIMILARITY returns 100. For no matches, it returns zero.
2. Transaction amount difference is  abs(a.amt-b.amt) .
3. Transaction date difference is trunc(b.txn_date,'DDD')-a.txn_date
Once we calculate the above three distances for each pair of records in authorization and settlement data sets, we can assign a weight to each of them and combine them into a single distance that measures the similarity between two records, i.e., record_distance= w1*merchant_name_distance + w2*amount_distance +w3*date_distance. We can identify those pairs having the least distances for each card.Those pairs are considered successful matches and are removed from the data. We repeat this matching process for the remaining data until we are happy with the results. This process can be fully automated. I have used the approach to merge about 50 millions authorizations from 3 millions debit cards. It only took about an hour.

Recency, Frequency, Monetary (RFM) Analysis: Part 5. RFM Code

To calculate monetary variable, we can use analytic function sum (amount) over(). Thus, we can combine the quintiles for recency, frequency and monetary to create a RFM code for each customer. This can be done using the following query. RFM code 555 is the best group of customers.


   USER_ID    CAT   STATE_ID CAMPAIGN_DATE CAMPAIGN_ID  RESPONSE     AMOUNT
---------- -- -------- --------- -------- -------- ----------
      1001 A  MA       29-MAY-12 A        N
      1001 A  MA       22-MAY-12 A        N
      1001 A  MA       15-MAY-12 A        Y                50
      1001 A  MA       08-MAY-12 A        N
      1001 A  MA       01-MAY-12 A        N
      1001 B  CT       06-JUN-12 A        N
      1001 A  MA       06-JUN-12 A        N
      1002 B  CT       08-MAY-12 A        N
      1002 B  CT       15-MAY-12 A        Y                50
      1002 B  CT       22-MAY-12 A        N
      1002 B  CT       29-MAY-12 A        Y                50
      1002 B  CT       01-MAY-12 A        N
with temp_tbl as
(
select 
user_id, 
min(sysdate-campaign_date) as recency,
count(1) frequency, 
sum(amount) as sum_amount
  from TBL_DIRECT_MAILING a where RESPONSE='Y'
group by user_id
),
temp_tbl2 as
(
select 
a.user_id,  
recency,
frequency, 
sum_amount from 
(select distinct user_id from TBL_DIRECT_MAILING) a,  
temp_tbl b
where a.user_id=b.user_id(+)
),
temp_tbl3 as
(
select user_id,
ceil(cume_dist() over(order by recency desc nulls first)*5) recency_quintile,
ceil(cume_dist() over(order by frequency nulls first)*5) frequency_quintile,
ceil(cume_dist() over(order by sum_amount  nulls first)*5) amt_quintile
from temp_tbl2)
select user_id, recency_quintile*100+frequency_quintile*10+amt_quintile as RFM_Code from
temp_tbl3 order by user_id;

The result may look like the following.

   USER_ID   RFM_CODE
---------- ----------
      1001        333
      1002        555
      1003        233

Recency, Frequency, Monetary (RFM) Analysis: Part 4

Just like we do for recency variable, we can divide customers into 5 or 10 equal size buckets, called quintiles or deciles, based on how frequently they respond. 
        USER_ID CAT STATE_ID       CAMPAIGN_DATE CAMPAIGN_ID RESPONSE
---------- -- -------- --------- -------- --------
      1001 A  MA       01-MAY-12 A        N
      1001 A  MA       08-MAY-12 A        N
      1001 A  MA       15-MAY-12 A        Y
      1001 A  MA       22-MAY-12 A        N
      1001 A  MA       29-MAY-12 A        N
      1001 A  MA       06-JUN-12 A        N
      1001 B  CT       06-JUN-12 A        N
      1002 B  CT       01-MAY-12 A        N
      1002 B  CT       08-MAY-12 A        N
      1002 B  CT       15-MAY-12 A        Y
      1002 B  CT       22-MAY-12 A        N
      1002 B  CT       29-MAY-12 A        Y
The following query calculates quintiles and deciles for each customer based on how frequently they responded to our offerings. The more frequently the responses, the higher the number for quintiles or deciles.
with temp_tbl as
(
select 
user_id, count(1) num_of_response  from TBL_DIRECT_MAILING a where RESPONSE='Y'
group by user_id
),
temp_tbl2 as
(
select a.user_id, num_of_response  from 
(select distinct user_id from TBL_DIRECT_MAILING) a,  
temp_tbl b
where a.user_id=b.user_id(+)
)
select user_id,
num_of_response,  
ceil(cume_dist() over(order by num_of_response nulls first)*5) frequency_quintile,
ceil(cume_dist() over(order by num_of_response nulls first)*10) frequency_decile
from temp_tbl2
temp_tbl2 order by user_id;

The output of the query looks like the following. 

   USER_ID NUM_OF_RESPONSE FREQUENCY_QUINTILE FREQUENCY_DECILE
--------- --------------- ------------------ ----------------
     1001               1                  3                5
     1002               2                  5                9
     1003               1                  3                5
     1004               2                  5                9

Recency, Frequency, Monetary (RFM) Analysis: Part 3

Frequency measures how many transactions happen within certain period of time. Usually, the more frequently a customer responds to the direct mailing offerings in the past, the more likely she will make additional purchases. In the case of credit card fraud, the fraudsters tend to make more frequent purchases than normal consumers do. Frequency variable can be calculated using SQL analytic function such as count() over().

Again we use the following credit card transactions as examples.
CARD_NUMBER     TXN_ID TXN_DATE   TXN_TYPE        TXN_AMT
----------- ---------- ---------- ------------ ----------
     123457          1 0304:11:00 Grocery           85.35
     123457          2 0304:17:30 Electronic      2100.75
     123457          3 0305:08:25 Gas Station           1
     123457          4 0305:08:36 Electronic       1435.2
     123457          5 0305:11:23 Grocery            55.3

The following query calculate the number of transactions in the past 6 (0.25 day) and12 hours (0.5 day) before the current transaction.
select 
card_number, 
to_char(TXN_DATE,'YYYYMMDD:HH24:MI') txn_date, 
count(1) over(partition by CARD_NUMBER order by txn_date range between 0.25 preceding and 0.0000001 preceding) num_in_6h,
count(1) over(partition by CARD_NUMBER order by txn_date range between 0.5 preceding and 0.0000001 preceding) num_in_12h
from TBL_CARD_TXN   a
order by CARD_NUMBER, txn_date;


CARD_NUMBER TXN_DATE        NUM_IN_6H NUM_IN_12H
---------- -------------- ---------- ----------
    123457 20120304:11:00          0          0
    123457 20120304:17:30          0          1
    123457 20120305:08:25          0          0
    123457 20120305:08:36          1          1
    123457 20120305:11:23          2          2

Recency, Frequency, Monetary (RFM) Analysis: Part 2

Assume we have historical data about customer's responses to our direct mail offers as shown below. Today we want to send them new letters. We believe that customers who responded to our offerings recently are more likely to respond to our future offers. We can divide customers into 5 or 10 equal size buckets, called quintiles or deciles, based on how recently they responded. This practice is common.

        USER_ID CAT STATE_ID       CAMPAIGN_DATE CAMPAIGN_ID RESPONSE
---------- -- -------- --------- -------- --------
      1001 A  MA       01-MAY-12 A        N
      1001 A  MA       08-MAY-12 A        N
      1001 A  MA       15-MAY-12 A        Y
      1001 A  MA       22-MAY-12 A        N
      1001 A  MA       29-MAY-12 A        N
      1001 A  MA       06-JUN-12 A        N
      1001 B  CT       06-JUN-12 A        N
      1002 B  CT       01-MAY-12 A        N
      1002 B  CT       08-MAY-12 A        N
      1002 B  CT       15-MAY-12 A        Y
      1002 B  CT       22-MAY-12 A        N
      1002 B  CT       29-MAY-12 A        Y

The following query calculates quintiles and deciles for each customer based on how recently they responded to our offerings. The more recent the response, the higher the number for quintiles or deciles.

with temp_tbl as
(
select 
user_id, max(campaign_date) recent_response  from TBL_DIRECT_MAILING a where RESPONSE='Y'
group by user_id
),
temp_tbl2 as
(
select a.user_id, sysdate-recent_response as recency from 
(select distinct user_id from TBL_DIRECT_MAILING) a,  
temp_tbl b
where a.user_id=b.user_id(+)
)
select user_id,
round(recency) recency_days,
ceil(cume_dist() over(order by recency desc nulls first)*5) recency_quintile,
ceil(cume_dist() over(order by recency desc nulls first)*10) recency_decile
from temp_tbl2
temp_tbl2 order by user_id;

The output of the query looks like the following. Customers with higher quintile or decile numbers are more likely to respond to our offerings. Of course, the most accurate way to predict a customer's response is to build a predictive model that uses the recency as one of its input variables.

  USER_ID RECENCY_DAYS RECENCY_QUINTILE RECENCY_DECILE

--------- ------------ ---------------- --------------
     1001          151                3              5
     1002           30                5              9
     1003           72                2              4

Recency, Frequency, Monetary (RFM) Analysis: Part 1

RFM (Recency, Frequency, Monetary) analysis captures the context information about an account or customer. The result of RFM analysis is a set of variables that can be used for ad hoc queries, segmentation, or predictive modeling to identify valuable customers, detect fraud, improve direct mailing response rates, etc.

RFM analysis is usually performed based on a "transaction" data set. Each record represents a transaction at specific point in time.  For example, the following are a sequence of credit card transactions.

We talk about recency factor first. In the following table, transaction 4 as the highlighted is most suspicious. Not only it is an expensive electronics purchase but also the immediate transaction before it is a $1 gas station purchase that happens 11 minutes ago. Fraudsters often use gas station to test a credit cards to make sure they work before they make big purchase at a store.
CARD_NUMBER     TXN_ID TXN_DATE   TXN_TYPE        TXN_AMT
----------- ---------- ---------- ------------ ----------
     123457          1 0304:11:00 Grocery           85.35
     123457          2 0304:17:30 Electronic      2100.75
     123457          3 0305:08:25 Gas Station           1
     123457          4 0305:08:36 Electronic       1435.2
     123457          5 0305:11:23 Grocery            55.3
Thus, the recency is an important consideration. It provides the context information about a transaction. When we build predictive models or write ad hoc fraud detection rules, including recency factors can greatly improve detection accuracy.

Another example is to predict if a prospect responds to a direct mail offering. The following table shows a prospect's responses in 5 weeks to 5 products. The reason that she responds to offering of product C in week 3 could be related to the offering of B one week before. For example, a person will more likely accept a credit card with 6.99% APR if the she is offered 17.99% APR earlier. Again, the recency factors are important.
User_ID    User_Category   State        Week   Product_ID    Response
1001             A                     MA             1             A                        No
1001             A                     MA             2             B                        No
1001             A                     MA             3             C                        Yes
1001             A                     MA             4             D                        No
1001             A                     MA             5             E                        No
One of the convenient ways to calculate recency factors is SQL analytic function lag. For example, the following query calculates three recency factors for credit transactions, i.e., time since recent transaction, recent transaction type and recent transaction amount. The "Partition by card_number" clause means the calculation is done independently for each card number.
select 
a.*,
( txn_date-lag(txn_date ) over(partition by card_number  order by txn_date)) time_since_recent_txn,
lag(txn_type) over(partition by card_number  order by txn_date) last_recent_type ,
lag(txn_amt)  over(partition by card_number order by txn_date) last_recent_amt
from TBL_CARD_TXN a order by txn_date;

When there are time/date variables in the data, it is extremely important to take advantage of them by using RFM analysis to create derived variables. Those variables can then be used as inputs to predictive models. Please see another post Build Best Direct Marketing Predictive Models.


Wednesday, September 12, 2012

A random sampling method for medical claim audit

Random sampling processes could become very sophisticated. One of such examples is a stratified random sampling method developed by the Department of Social and Health Services (DSHS) to audit hospital Medicaid claims. This method is described in a paper "Analysis of the Regression Estimator
and its Implementation by DSHS"( http://www.hca.wa.gov/documents/rfp/recovery_audit/exhibit_b.pdf)

Hospitals provide services for patients who are covered by Medicaid and then file claims to the states to get the payment. State governments want to perform post-payment reviews to make sure that they do not overpay hospitals. Because the size of claims from a hospital could be very large, only a randomly sampled claims will be audited.

The claims of highest 25 amounts are always reviewed. For the remaining claims, the process first divides them into many buckets, or strata, based on their claim amount. Random sampling will be performed for each stratum. The higher the variance in claim amount in a stratum, the more samples will be drawn from it. Please notice that the sample size is NOT proportional to the population size in the stratum.

The most common ways to implement the above methodology are probably using SAS or SQL. I have seen the SAS version of it. I also implemented two SQL versions. One SQL version strictly follows the steps in the SAS package. Another SQL version uses Oracle analytic functions heavily that results in 75% reduction in the volume of code. These analytic functions include row_number() over(), width_bucket () , sum() over(), count() over(), etc.

Tuesday, September 04, 2012

Predictive Modeling vs Intuitive Business Rules

Many organizations have analysts who create business rules to identify most profitable customers, detect credit card or medical claim frauds, etc. Analysts in a company may maintain hundreds or even thousands of rules and add new ones regularly. Those rules are usually derived from human intuition and experience. The following are two imaginary credit card fraud detection rules. (They are for illustration purpose only, not actual rules used by any banks).
1. High frequency rule:  If the number of transactions from a card in the past 7 days is above 75, then current transaction is fraud.
2. High dollar amount:  If the amount of transactions from a card in the past 7 days is above $95,000, then current transaction is fraud.
We can plot credit card transactions on a plane as shown below. X-axis is the number of transactions in last 7 days. Y-axis is the total dollar amount of transactions in the last 7 days. Based on the two variables, every transaction is represented by a dot in the plane. In this plot, red dots represent fraud transactions and blue represent good. Fraud transactions normally have higher frequency and cumulative amount since fraudsters spend money more aggressively. Thus most of the red dots are located in the up right corner (high frequency and high cumulative amount).
A high frequency rule detects 2 red dots (yellow rectangle) and a high dollar amount rule detection 2 red dots (blue rectangle). Totally, we detect 4 red dots. If we want to detect more red dots, we have to lower the thresholds for those rules which will lead to many normal transactions being mistakenly marked as fraud.
The most effective way to separate red and blue dots is the straight line that runs northwest-southeast direction as shown in the figure. Unfortunately, the line can not be described by intuitive rules. What a statistical predictive model does is to find such a line through learning from the data. A statistical  predictive model can not be described intuitively but it is far more accurate. It is often to see that a single predictive model outperforms hundreds or thousands of intuitive business rules combined.  There is a newer post about the comparison of predictive models and intuitive rules from 13 aspects.

How easy is it for a SQL developer/analyst to become a data miner?

An Oracle SQL developer/analyst can become a satisfactory data miner if she learns to know a PL/SQL procedure and a SQL function:
1. A PL/SQL procedure create_model in dbms_data_mining package.
2. A SQL function prediction_probability.
A good thing is that learning the above items is within the framework of Oracle PL/SQL. There is no need to study another special data mining/statistical language such as SAS or R. There is no need to install another standalone data mining software. Everything happens within the database using PL/SQL or SQL that she is already familiar with. Thus it is possible that an Oracle SQL developer/analyst becomes a very productive data miner. As the majority of enterprise data existing in the databases, SQL programmers have the advantage over SAS or R programmers in that she is closer to the data.   If we consider data mining as just more sophisticated SQL queries, data mining is simply a natural extension of SQL programmer/analyst's daily job, i.e., querying the database.

3 most useful little SQL functions for data manipulation

I have found the following three little SQL functions useful for data manipulation. The are almost always used in my data mining projects in these years. They are not fancy, but very convenient.

1. Mod function.
Mod(m,n) in Oracle returns the remainder of m divided by n. For example, mod(13,7)=6 because 6 is the remainder of 13 divided by 7. If m and n are positive integers, mod(m,n) is always within the range of 0 and n-1. We can use Mod function for random sampling. For SQL Server, Mod function is operator m%n.

select record_id from tbl_test2 order by record_id;
 RECORD_ID
----------
         1
         2
         3
         4
         5
         6
         7
The following query picks every other record id.
select record_id from tbl_test2 where mod(record_id,2)=1 order by record_id;
 RECORD_ID
----------
         1
         3
         5
         7
The following query picks every third record id.
select record_id from tbl_test2 where mod(record_id,3)=1 order by record_id;
 RECORD_ID
----------
         1
         4
         7
We can combine mod and dbms_random.random functions to perform random sample. See previous post "Random Sampling using SQL".
2. Nvl function
Oracle nvl(x,y) function replaces null value in x with y. The following query replaces null value with 'X';

select var, nvl(a.var, 'X')  from tbl_test3;
VAR                              NVL(A.VAR,'X')
-------------------------------- --------------------------------
A                                A
B                                B
                                 X
C                                C
3. Decode function
Decode function maps values of a variable to other values.  In a number of projects, I used decode function to convert the categories into weights (derived from logistic regression models). The following example assigns a weight of -.420916 to A, -.028324 to B and so on. A default weight of 0 is assigned to all other categories that are not listed.
decode(var,
'A ',-.420916,
'B',-.028324,
'C',-.186108,
'D',-1.165688,
'E',-.016754,
'F',-.153281,
'G',-.010159,
'H',-1.136179,
0)

Sunday, September 02, 2012

Dealing with missing values in data mining

We have found that when analyzing large amount of data such as financial transactions, medical claims,  cell phone calls, and credit information, missing values are common. Simply ignoring data points with missing values is recommended in most cases. There are a number of ways to handle missing values for categorical and continuous variables.

The following examples illustrate how to replace missing values in categorical variables.
Original data:
Record_ID   Variable_X
 1                     A
 2                     B
 3                     A
........................
95                  Missing

Method 1. Replace missing value with a special value (e.g. X)
Record_ID   Variable_X
 1                     A
 2                     B
 3                     A
........................
95                   X
In Oracle, this can be done easily using function nvl(). The following is the SQL scripts to perform the replacement.
select nvl(Variable_X,'X')  as Variable_X, ..... from original_table;
Method 2. Replace missing value with the most frequent value.
Record_ID   Variable_X
 1                     A
 2                     B
 3                     A
........................
95                   A
The following is the SQL scripts to perform the replacement.
select nvl(Variable_X,'A')  as Variable_X, ..... from original_table;

The following examples illustrate how to replace missing values in continuous variables.
Original data:
Record_ID   Variable_X
 1                     1.3
 2                     0.5
 3                     2.9
........................
95                  Missing

Method 3. Replace missing value with mean or median value.
Record_ID   Variable_X
 1                     1.3
 2                     0.5
 3                     2.9
........................
95                  1.56
The following is the SQL scripts to perform the replacement.
select  nvl(Variable_X, 1.56) as Variable_X, ..... from original_table;

Method 4. We first convert continuous variable into categorical variable. This is also called binning. I will discuss binning in another post. We define missing as a special category.
The following is the SQL scripts to perform the above logic.
case when Variable_X <=1.2 then 'A'
        when Variable_X <=1.8 then 'B'
        when Variable_X  > 1.8 then 'C'
       else 'Z' end  Variable_X_BIN

New Variable Variable_X_BIN.
Record_ID   Variable_X  --> Variable_X_BIN
 1                     1.3           ---> B
 2                     0.5           ---> A
 3                     2.9           --->B
........................
95                  missing      ---->Z

There are more sophisticated methods to impute missing values using multiple variable models. But they are rarely used in real world applications.

Saturday, September 01, 2012

In-Database Data Mining

In previous post "Data Mining Components", we talk about four components or layers in a using data mining to solve business problems, e.g., fraud prevention, new customer life time value prediction, online ads click through rate, etc. Two abstract components are business problems and data mining algorithms. By abstract, we mean their existence is largely mental. They do not have physical forms and can not be purchased. Two physical components are data mining software and data management tools. Data mining software are the implementation of data mining algorithms. Data management tools could be relational databases or plain files.   It is important to realize that items within each layer are sometimes exchangeable. We can select the combination of data mining models, tools and databases that suit our needs.  In these years, we have found that performing data mining within database has its unique advantages.  In this in-database data mining scenario,  two physical components, i.e., data mining algorithm implementation and data management, are collapsed into a single component, Oracle database. For example, we have used Oracle Data Mining to build debit card and check deposit fraud  models. The in-database solution provides substantially increased security, productivity, manageability and scalability. This should not be a surprise because a database is designed to handle large data using SQL as a standard language in an enterprise environment. If we consider data mining as just more sophisticated SQL queries, nothing more, nothing less, a lot of difficulties in data mining practices will disappear naturally
       In-Database Data Mining 

Data Mining Components

We identify four components or layers in a data mining engagement as shown in the figure below. Two abstract layers, business problem and data mining algorithms, are on the top. And two physical layers, data mining tools and data management, are at the bottom.

Business problems that we want to solve is one of the most important abstract layers. It could be predicting fraud (bank card, check, medical claims), new customer life time value at point of sales, online ads click rate, credit worthiness, customer segmentation. We can address the above business problems using various predictive models such as logistic regressions, neural nets, support vector machines, K-means clustering.

Data mining tools layer contains commercial or open source software such as SAS, Splus, R, Weca, SPSS, Statsoft, Oracle Data Mining. Common data mining algorithms can be found in almost all of the software mentioned above. Data management/storage layer are relational databases such as Oracle, SQL server, MySQL, or simply files such as SAS files or text files.

We can predict if a new customer will pay his car loan using logistic regression model implemented in SAS and store the data in Oracle. Or we can solve the same problem using decision tree models implemented in R and store data in SQL server. It is important to realize that items within each layer are sometimes exchangeable. We can solve any business problems with varieties of data mining algorithms implemented by commercial or open source tools and store data in any databases.  Thus it is a misconcept that neural nets are best in predicting credit card fraud. An experienced data miner can build a decision tree model to predict credit card fraud that performs equally well as a neural net does. We can select the combination of data mining models, tools and databases that suit our needs.
Four Components or Layers in Data Mining