Deep Data Mining Blog

Tuesday, January 15, 2013

How to build predictive models that win competitions. (continued)

To build predictive models that are accurate and robust, it is crucial to find the input variables, or so called feature variables, that are truly predicative of the target variable. This is even more important than choosing types of predictive models. Good feature variables can greatly reduce the complexity of the modeling learning.

For example, suppose we want to build a model that predicts if a pair of vectors are parallel to each other. This is illustrated in the following chart.

We can represent the pair of vectors by all of their coordinates for the starting and ending points, i.e.,(Xa1,Ya1,Xa2,Ya2,Xb1,Yb1,Xb2,Yb2)

Suppose we have a few hundred of training samples (pairs of vectors) that are marked as parallel or not. The following are some examples.

Sample 1: variables (0,0,1,1,1,0,2,1) target: yes (parallel)
Sample 2: variables (0,0,2,2,2,0,4,2) target: yes (parallel)
Sample 3: variables (0,0,2,2,2,0,4,4) target: no (parallel)
......

However, it would be extremely challenging if we use (Xa1,Ya1,Xa2,Ya2,Xb1,Yb1,Xb2,Yb2) as input variables to build predictive models. Probably, most of the models will perform very poorly and are useless.

If we realize that we can compare the angles of the two vectors to decide if they are parallel, we can build two new variables using atan functions as the following.
(atan((Ya2-Ya1)/(Xa2-Xa1)), atan((Yb2-Yb1)/(Xb2-Xb1))

The atan function is the inverse tangent. Using the new variables, the above training samples will be:

Sample 1: variables (0.785, 0.785) target: yes (parallel)
Sample 2: variables (0.785, 0.785) target: yes (parallel)
Sample 3: variables (0.785, 1.107) target: no (parallel)
......

This representation greatly simplifies the problem. When the two numbers are close, the two vectors are parallel.

Thus, before we build models, a great deal of time should be spent on building feature variables that are truly predicative of the target variables. This is true for any applications such as fraud detection, credit risk, click through rate prediction, etc.

Some models, such as neural nets and SVM,are able to implicitly re-represent the input variables and automatically figure out the best feature variables during the learning process. For example, some researchers claim that a neural net is able to approximate the atan functions in its hidden neurons during training. This will be a subject of another post. From data mining practitioners' perspective, nothing prevents us from taking advantage of our domain knowledge and calculate the feature variables directly.

Sunday, January 06, 2013

Split data into multiple tables using multi-table insert

With Oracle multi-table insert, we can conveniently store data into multiple tables using a single select statment.

For example, The following insert statement puts records in TBL_STOCK_QUOTES into three tables based on the first letter of stock symbol, T_A2K,T_L2S and T_T2Z.

insert first
when substr(symbol,1,1) between 'A' and 'K' then
into T_A2K (SYMBOL,DT,HIGH,LOW,OPEN,CLOSE,VOLUME)
values (SYMBOL,DT,HIGH,LOW,OPEN,CLOSE,VOLUME)
when substr(symbol,1,1) between 'L' and 'S' then
into T_L2S(SYMBOL,DT,HIGH,LOW,OPEN,CLOSE,VOLUME)
else
into T_T2Z(SYMBOL,DT,HIGH,LOW,OPEN,CLOSE,VOLUME)
values(SYMBOL,DT,HIGH,LOW,OPEN,CLOSE,VOLUME)
select
SYMBOL,DT,HIGH,LOW,OPEN,CLOSE,VOLUME
from
TBL_STOCK_QUOTES;

We can verify the result after the inserting.
SQL> select min(symbol), max(symbol) from T_A2K;

MIN(SYMBOL) MAX(SYMBOL)
-------------------------------- --------------------------------
AAIT KYO

SQL> select min(symbol), max(symbol) from T_L2S;
MIN(SYMBOL) MAX(SYMBOL)
-------------------------------- --------------------------------
LANC SWI

SQL> select min(symbol), max(symbol) from T_T2Z;
MIN(SYMBOL) MAX(SYMBOL)
-------------------------------- --------------------------------
TAX ZN

Saturday, January 05, 2013

Watch out invisible characters.

When text files (or csv files when the filed delimiter is a comma) are loaded into a table, sometimes the columns contain weird and invisible characters. They cause unexpected results when we compare values or convert them from strings to numbers. Oracle function utl_raw.CAST_TO_RAW is a great way to look into the text strings and uncover those invisible characters.

In a project, we used ETL tool to move financial transactions into a data warehouse and found that in a very small number of cases the data showed discrepancies when we compared the data before and after the movement. However, the data looked the same visually. By using utl_raw.CAST_TO_RAW function, we were able to pinpoint those discrepancies caused by invisible characters and fixed the problem.

For example, there is a table containing text string about stock prices. The following query shows one record containing symbol ,Date,Open,High,Low,Close,Volume, and Adj Close.

SQL> select quotes from TBL_SYMBOL_QUOTES_CUR a where rownum <2; 

QUOTES
---------------------------------------------------------------- 
APEI,2013-01-02,36.89,37.34,36.59,36.96,198700,36.96

The above query result looks OK. But the text string actually contains an invisible character new line (hex value 0A or decimal value 10). We can use utl_raw.CAST_TO_RAW to uncover this.

SQL> select utl_raw.CAST_TO_RAW(quotes) quotes_raw from TBL_SYMBOL_QUOTES_CUR a where rownum <2; 

QUOTES_RAW 
------------------------------------------------------------------------------------------------------------------------------------ 
415045492C323031332D30312D30322C33362E38392C33372E33342C33362E35392C33362E39362C3139383730302C33362E39360A

Here are the complete ASCII tables showing the numerical representation of characters. We can remove the unwanted characters using Oracle replace function. chr(10) means the character with decimal value 10 or hex value 0A or new line character.
<

SQL> select replace(quotes,chr(10)) from TBL_SYMBOL_QUOTES_CUR a where rownum <2;

The following query shows new line character is removed.

SQL> select utl_raw.CAST_TO_RAW(replace(quotes,chr(10))) from TBL_SYMBOL_QUOTES_CUR a where rownum <2; 

UTL_RAW.CAST_TO_RAW(REPLACE(QUOTES,CHR(10))) 
------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------- 
415045492C323031332D30312D30322C33362E38392C33372E33342C33362E35392C33362E39362C3139383730302C33362E3936

Thursday, January 03, 2013

the First principle of data analytics is to avoid making mistakes (continued)

The importance of avoiding mistakes in data mining (or in any work in general) could never be overemphasized. When mistakes happen, wrong (and sometime ridiculous) conclusions are drawn and the credibility of analysts is severely damaged.

One of the common sources causing mistakes is survivor bias. The following are some examples.

1. When we study average 10 year stock return, we collect the price history of stocks that are on the market. However, because many companies went under within the 10 years and their data are not included in the analysis, the calculated average return would be higher than the actual return.

2. It was uncovered that 12 months ago a data breach affected some credit cards of a card issuer. To measure the impact of this data breach, analysts take the current active credit cards, find out those that are affected by the data breach, and measure their fraud rate. They may find that the fraud rate is surprisingly low. This is because those cards had fraudulent activities as reported by the cardholders are already closed and these cards are purged from the active portfolio.

Stocks for companies going under, closed accounts due to nonpayment, closed credit cards due to fraud, churned customers, etc., are data corpses. To avoid data survivor bias, it is important to collect the complete data, including those data corpses. This is easier said than done as in reality data corpses are regularly removed and hard to collect.

Tuesday, January 01, 2013

the First principle of data analytics is to avoid making mistakes (continued)

Mistakes can happen when we deploy the models that work well in our "lab environment" into production. We need to anticipate unseen situations ahead of time and deal with them. For examples, the following are some of the common challenges:

1. Unseen categorical variable values. For example, in the training data, state variable does not contain value RI. However, in the production, RI appears in the data. One way to deal with this is to assign all unseen codes to the most frequent category in our model scripts. That way, the model will run as we expect.

2. Continuous variables out of boundaries. For example, in the training data, the credit card purchase amount is within a range of $0.01 and $22,000. However, in production, the purchase amount could be $56,000. If we do not handle purchase amount properly, extreme scores will be generated by the model for transactions with purchase amount of $56,000. It is a good idea to winsorize the variable. "Winsorize" is a fancy word for clipping the extreme values.

Saturday, December 22, 2012

the First principle of data analytics is to avoid making mistakes (continued)

Mistakes may happen when data or processes flow from one system to another. Sometimes, mistakes are caused by the flaws in the analytic methodologies themselves.

One of the popular mistakes is to think in terms of absolute numbers, not ratios. For example, analysts see a few hundreds of fraudulent credit card transactions happening at a particular online retailer website and conclude that the retailer is very risky. This may not be true. If the retailer has tens of millions of normal transactions within the same period, the ratio of fraudulent transactions is very low. We have seen in number of occasions that people use number of frauds,instead of fraud rate, as the measurement of risk. As a result, they create fraud detection rules that generate a lot of false alarms.

Friday, December 21, 2012

the First principle of data analytics is to avoid making mistakes

In previous post How to build predictive models that win competitions,we showed examples of actual mistakes made by statisticians/analysts. It is obvious that delivering something wrong will cause irrecoverable damage to the credibility. Mistakes can happen particularly when data or processes follow from one system to another:

1. When data files are provided by the clients. Customers make mistakes too. Their database people may use the wrong scripts and do not extract the data that we want.

2. When data are loaded from text files into databases or other systems. A lot of things can go wrong during data loading. Issues like missing value representation, number formats, date formats, unrecognized characters, etc. will surface. Thus, if possible, using text files to store data should be avoided.

3. When reports are generated. Many analysts perform data analytics work using tools like SAS, R, SPSS, SPlus, Oracle Data Mining, etc. Then they summarize the results manually using Excel Spreadsheet and make the final Powerpoint presentation. Things can go wrong here. Decimal points are placed in the wrong places. Dates are not updated.

4. When models are deployed into production systems. There are a lot of moving parts involved here. It is a challenge to precisely replicate the off-line data analytics process in a different production system.

Tuesday, December 18, 2012

How to build predictive models that win competitions.

The predictive models that we produced won head-to-head competitions and were chosen by clients. For example, our model predicting the risk of a new customer will not pay his/her phone bill was used by a top 3 cell phone company. Our debit card fraud detection model is being used by a top 15 banks. Our competitors included one of the three credit bureaus that had hundreds of statisticians working for them.

We have found that if we follow a number of principles, we will be able to produce good predictive models and quality data analytics work in general.

The first principle is to avoid making mistakes. We have seen many cases where mistakes damage the reputation and credibility of data analysts. For example, statisticians analyzed data and drew the conclusion that compromised credit cards were less likely show fraudulent activities than normal cards. It was a mistake, of course. One of our competitors loaded customer's data wrongly and produced reports where the numbers did not make sense at all. Mistakes like those will cause immediate rejection from clients and permanent damage to analysts' reputation.

We need to realize that avoiding mistakes should be an inherent part of our processes. In a large project, from data gathering, loading, validating, summarizing, model building, report generation and model deployment, we may have to take many steps and produce hundreds of data sets. To avoid making mistake, we need to double check our results. We have found that it is actually much harder to verify the results are correct than to simply execute the steps to produce them.

Thus, we always perform two tasks: 1. produce the result; 2. verify the result. It is better to spend more time to produce correct results than to quickly deliver something wrong that cause irrecoverable damage to the credibility. We will talk more about avoid making mistakes in the post the First principle of data analytics is to avoid making mistakes

Friday, December 14, 2012

Oracle NTILE function

Oracle Ntile function divides the records into the same number of rows (off at most by 1). Please notice that the records with same variable values (ties) may be placed into different buckets. The following query result shows that the maximum value in a bucket may overlap with the minimum value in the next bucket.

SQL> with tbl as (select ntile(5) over(order by price) nt, price from STOCK_PRICE) select nt, count(1), min(price), max(price) from tbl group by nt
order by nt;

NT COUNT(1) MIN(PRICE) MAX(PRICE)
---------- ---------- ---------- ----------
1 36387 9.58 13.68
2 36387 13.68 16.72
3 36386 16.72 20.87
4 36386 20.87 26.87
5 36386 26.87 89.53

We can also use function cume_dist combined with ceil function to divide the records into similar number of rows. Ties will be placed into the same buckets. The following query result shows that there is no overlap of values between different buckets.

SQL> with tbl as (select ceil(cume_dist() over(order by price nulls first)*5) nt, price from MV_VIX_15S_AVG) select nt, count(1), min(price), max(price) from tbl group by nt order by nt;

NT COUNT(1) MIN(PRICE) MAX(PRICE)
---------- ---------- ---------- ----------
1 36358 9.58 13.675
2 36332 13.68 16.71
3 36413 16.72 20.86
4 36429 20.87 26.865
5 36400 26.87 89.53

Tuesday, December 11, 2012

More on Data Preparation for Logistic Regression Models

In the earlier post Data preparation for building logistic regression models , we talked about converting data into more compact format so that we can use memory intensive software like R to handle large number of cases.

Another issue that we commonly face is how to deal with "unbalanced" data. The following are some examples:
1. Only 7 to 12 fraudulent ones out of 10,000 transactions.
2. Online ads has only 1 to 3 clicks per 10,000 impressions.
3. Mobile phone account non-payment rate is 12%.
4. It is assumed the 6% of medical claims are fraudulent.

For the sake of reducing data volume without sacrificing model accuracy, it makes sense to reduce the cases in the major classes (good card transactions, non-click impressions, good mobile phone accounts, normal medical claims) through random sampling. For example, to gather 1,000 fraudulent bank card transactions within certain period, there will be 10 million good transactions from the same period. We can use the approach described in an earlier post More on random sampling in Oracle. However, the final model training data sets do NOT have to be perfectly balanced, i.e., fraud and good transactions do NOT have to be 50% and 50%. For example, it is OK that the model training data set has 80% good transactions and 20% fraudulent ones. For example, we can build two logistic regression models:

Model A. training set: 50% good, 50% fraud
Model B. training set: 80% good, 20% fraud

Model A and B will produce different probabilities of being fraud for an individual transaction(Probability of being fraud produced by Model A will most likely be higher than that produced by Model B). However, the relative ranks of transactions' probabilities of being fraud for a whole data set given by both models could be the same. For example, the top 1% riskiest transactions identified by both models are the same. Thus both models are equivalent. As we can see in many applications what matters is the relative ranks (of being fraud, being non-payment, etc.) produced by the predictive models. The quality/usefulness of relative ranks can be depicted by gain charts.

Sunday, December 02, 2012

Build a Scorecard in Oracle Using Logistic Regression Model

One of the popular methods in assessing credit risk is a scorecard. To build a scorecard manually, analysts first identify a number of relevant variables, e.g. having bankruptcies in the past, number of open accounts, etc. Each variable is divided into meaningful segments based on its value. Then a point is assigned to each segment. The credit score is the sum of the points for all the variables. For example, a scorecard may look like the following:

Variable | Segment | Point
Having Bankruptcy | NO | 95
Having Bankruptcy | YES | 15
Number of Open Accounts | from 1 to 3 | 30
Number of Open Accounts | from 4 to 8 | 50
Number of Open Accounts | 9 or more | 40
...................................................

Assume a person's credit history shows past bankruptcy, 4 open accounts and other information. His credit score will be calculated as 15+50.....

To determine the points manually is subjective. A much better way is to build a logistic regression model and use the model coefficients as the "points". A scorecard created this way is accurate and yet still provides the transparency of a scorecard. One of the top cell phone service provider actually used our model-derived scorecard to screen the new customers.

In example 2 of the post Build Predictive Models Using PL/SQL, we show how to build a logistic regression model in Oracle. The extract the model coefficients, we simply use the following query:

select attribute_name, attribute_value, coefficient from table(select dbms_data_mining.get_model_details_glm('GLM_MODEL') from dual);
(Replacing 'GLM_MODEL' with real model name)

We may want to perform some transformation of the coefficients so that the score will be within the desired range.

Thursday, November 22, 2012

A Real World Case Study: Business Rule vs Predictive Model

The following is a true story to complement earlier posts Comparison of Business Rules and Predictive Models and Predictive Modeling vs Intuitive Business Rules .

A few years ago, we built a new customer acquisition model for a cell phone service provider based on its historical application and payment data. The model calculated a risk score for each cell phone service applicant using information found in his/her credit reports. The higher the score, the higher the risk that a customer will not pay his/her bill.

A few weeks after the model was running, we received an angry email from the client company manager. In the email, the manager gave a list of applicants who had several bankruptcies. According to the manager, they should be high risk customers. However, our model gave them average risk scores. He questioned the validity of the model.

We mentioned that the model score was based on 20 or so variables, not bankruptcies alone. We also analyzed people with bankruptcies in the data that we used to build the model. We found that they paid bills on time. It might be that people with bankruptcies are more mobile and thus depend more on cell phones for communication. They may not be good candidates for mortgage. But from cell phone service providers' perspective, they are good customers.

This is the bottom line. Data-driven predictive models are more trustworthy than intuition-driven business rules.

Wednesday, November 07, 2012

From Oracle SQL Developer to Data Miner

As mentioned in an earlier post How easy is it for a SQL developer/analyst to become a data miner?, if we consider data mining as just more sophisticated SQL queries, data mining is simply a natural extension of SQL programmer/analyst's daily job, i.e., querying the database. For example, a SQL programmer can write a query to answer a business question: finding people living in Boston, MA with annual income at least $50K and credit score 750 or higher. One simply puts the three conditions in the "where clause" as the following:

select person_id, person_first_name, person_last_name
from tbl_data_about_people
where
home_city='Boston' and home_state='MA' and
annual_income>50000 and
credit_score >=750;

Similarly, we can put a predictive function in the "where clause" to answer a more useful question: finding people living in Boston, MA who are 25% more likely to purchase a new car with 2 months as the following:
select person_id, person_first_name, person_last_name
from tbl_data_about_people
where
home_city='Boston' and home_state='MA' and
prediction_probability(model_likely_to_buy, 1 using *)>=0.25;

We can see how easy and powerful to use predictive models in SQL. A traditional Oracle SQL developer can greatly enhance his/her value by picking up those data mining functions. There are a number of good documents and sample codes available at Oracle's site, e.g,
The Data Mining Sample Programs
Oracle® Data Mining Concepts

Tuesday, November 06, 2012

Major Data Mining Steps

Normally, we need to go through the following steps to build a predictive modeling solution.

Data Gathering
Data Validation
Data Preparation
Feature Variable Calculation (creating more salient variables that are more predictive of target)
Predictive Model Building and Testing
Model Deployment

In our opinion, the main challenges are: Data Preparation, Feature Variable Calculation and Model Deployment. Probably 90% of time is spent on the above three areas.

Saturday, October 27, 2012

Take Advantage of Materialized View

Materialized view is one of the most useful features for data processing in Oracle. We would suggest that materialized views be used when possible.

To create a materialized view, we use "create materialized view as select.. ", just as when we create a view or table. For example, the following SQL script creates a materialized view that calculates the following bank card summary information for each retailer terminal: the number of transactions, amount of transactions and number of unique cards.

create materialized view mv_retailer_terminal_sts
as
select
terminal_id ,
retailer_id ,
terminal_city,
terminal_state ,
terminal_country,
count(1) number_of_txn ,
sum(tran_amt) amount_of_transaction,
count(distinct card_key) number_of_cards
from credit_card_txn_detail a
group by
terminal_id ,
retailer_id ,
terminal_city,
terminal_state ,
terminal_country;

The materialized view captures a snapshot of the "from " tables. It can be refreshed later manually on demand or based on a defined schedule.

For example, we can refresh a materialized view manually using the following command.
execute DBMS_MVIEW.REFRESH('mv_retailer_terminal_sts','C');
('C') means complete refresh.

One of the advantages of using materialized view is that the SQL query used to produce it can be retrieved using the following query:
select query from user_mviews where mview_name=upper('mv_retailer_terminal_sts');

The difference between a view and a materialized view is that a materialized view creates a physical table. Thus a materialized view is faster to query than a view. Thus we suggest that materialized views be used when possible in replacement of tables.

Sunday, October 21, 2012

Oracle Normal Distribution Function

Oracle function dbms_random.normal generates a standard normal distribution number with zero mean and one standard deviation.
SQL> select dbms_random.normal from dual;
NORMAL
----------
.781195382
The following query generates 6,000 random numbers using dbms_random.normal and then calculate their mean and standard deviation.We can see that their mean is close to zero and standard deviation close to one.
SQL> select avg(v), stddev(v), count(1) from (select dbms_random.normal v from V_6K_OBS);
AVG(V) STDDEV(V) COUNT(1)
---------- ---------- ----------
-.00110804 1.00954138 6000

Wednesday, October 03, 2012

More on random sampling in Oracle

The earlier post Random Sampling using SQL uses dbms_random.random to generate random numbers. A better function is dbms_random.value which generates uniformly distributed random number greater than or equal to 0 and less than 1.
We can rewrite random sampling scripts as the following.
Step 1 . Create random numbers.
create table tbl_app_id_w_rnd as select app_id, dbms_random.value rnd from CELL_PHONE_SERVICE_APPS;
Step 2. Create a view that contains 30% of the records.
create view v_data_testing as select a.* from CELL_PHONE_SERVICE_APPS a, tbl_app_id_w_rnd b where a.app_id=b.app_id and rnd <=0.3;

Saturday, September 29, 2012

Find the smallest or largest value in SQL

In Oracle we can use least function to find the smallest one among many values. The number of input variables are flexible. Similarly, to find the largest one among many values, we use greatest function. The following are examples.
select least(0,1,2,3,4,4,5,6,7,9) from dual;

--------------------------
0
select greatest(0,1,2,3,4,4,5,6,7,9) from dual;
-----------------------------
9
The following example shows that least/greatest functions take multiple columns as inputs and return the result for each row. Min/max functions take only one column as parameter and return the result for the all the rows.
select p0, p1,p2, least(p0,p1,p2), greatest(p0,p1,p2) from TBL_TEST_DATA where rownum < 10;

P0 P1 P2 LEAST(P0,P1,P2) GREATEST(P0,P1,P2)
---------- ---------- ---------- --------------- ------------------
1086.62 1089.03 1084.38 1084.38 1089.03
1214.19 1213.58 1200.94 1200.94 1214.19
1268.42 1263.25 1257.68 1257.68 1268.42
1505.11 1511.73 1515.61 1505.11 1515.61
758.84 764.72 753.37 753.37 764.72
1521.24 1532.53 1528.63 1521.24 1532.53
1184.88 1181.19 1185.01 1181.19 1185.01
1260.55 1266.51 1273.23 1260.55 1273.23
1483.4 1484.26 1469.9 1469.9 1484.26
select min(p0), min(p1), min(p2), max(p0), max(p1), max(p2) from MV_SP_VIX_CORR where rownum < 10;

   MIN(P0)    MIN(P1)    MIN(P2)    MAX(P0)    MAX(P1)    MAX(P2)

---------- ---------- ---------- ---------- ---------- ----------

    758.84     764.72     753.37    1521.24    1532.53    1528.63

Thursday, September 20, 2012

Comparison of Business Rules and Predictive Models

In many organizations, there are two ways to support decision making, i.e., intuitive business rules and predictive models. Business analysts build intuitive rules. Statistician/data modelers use data-driven method mainly predictive models. The following is a comparison of them from 13 different aspects. You may want to check out an earlier post.

Consideration	Business Rules	Predictive Models
Examples	Example 1. A medical fraud rule: Treatments for pregnancy discomfort for male patient is fraud. Example 2. A direct sale rule: Customers who purchase product A within 3 months will likely buy product B. Example 3. Mobile phone customer acquisition rule: If a new customer has bankruptcy in the past, do not offer him a free phone.	Example 1. A medical fraud detection model: An unsupervised model that detect anomaly in medical claims. Example 2. A credit card fraud detection model: A neural net that generates fraud score for credit card transactions. Example 3. A customer acquisition model: a logistic regression model that predicts if a new customer will pay her cell phone bill.
Owners	Business Analysts who have experience in particular fields such as bank risk management, healthcare, etc. In their communications, they use terms that are understood by business people. They believe that there are explainable cause-effect relationships in predict frauds, etc. Their job is to identify those rules.	Statisticians/Data Miners see predictive models are solutions to a wide range of business problems. They believe that in reality the cause-effect relationship is too complex to be understood by human beings. A black box data-driven predictive model is the way to make the most accurate predictions.
Understandability	High. Rules are normally intuitive and easy to understand.	Low. Most predictive models are black boxes.
Forms of Existence	SQL scripts, SAS scripts, Excel Spreadsheet, Word document, in people's head and not documented, etc.	SQL scripts, SAS scripts, Excel Spreadsheet, R objects, Oracle mining objects, etc.
Initial Investment	Low.	High. Need to build data warehouse that archive historical data. May need to purchase data mining software such as SAS, Statistica, etc.
Effort to Build	Low. Analysts with domain knowledge can write rules without doing heavy data analysis work.	High. Data miners have to go through the processes systemically to build predictive models.
Effort to Deploy into Production	Low. Business rules are usually simple scripts such as SQL.	High. Predictive models, including the input variable calculation, could be very sophisticated. To deploy a model into production will take significant amount of effort.
Quantity	Many. For example, many banks or insurance companies have hundreds or thousands of business rules.	Usually, the number of predictive models are very small. For example, even some large banks or insurance companies have only 2 or 3 predictive models to detect frauds.
Perception of Productivity and Effectiveness	High. Because of the following two reasons: 1. there are hundreds or thousands of understandable rules; 2. new rules can be created on daily basis, it gives the management an impression that the group who build them is productive and effective.	Medium. It may take many days/weeks to build a predictive model before one can see any result. Because predictive model is mostly a black box, it is hard to explain to the management how a predictive model works. Thus, it is possible that the management thinks the model building group is not productive.
Actual Productivity and Effectiveness	Low to medium. See an earlier post. Since rules are usually created individually, there are redundancies among them. All the rules as a whole are not optimized.	Highly productive and effective. It is often to see that a single predictive model outperforms hundreds or thousands of intuitive business rules combined.
Tools	Excel spreadsheet, SQL, special scripting language, and others	SAS, SPSS, R, Oracle data mining, and others
Adaptability	Low. Since rules are created by people based on their experience manually, they are harder to adapt to the changing situations.	High. Predictive models will be refreshed once new data become available. The newly refreshed model will reflect the changes automatically.
Maintenance	Difficult. It is hard to maintain hundreds or thousands of rules. It is helpful to rank them based on their effectiveness.	Since the number of predictive models is small, it is easier to maintain them.
Longevity	Low. Business rules usually lose its effectiveness more quickly than predictive models do after their creations.	High to medium. Because of the following two reasons, predictive models maintain its effectiveness longer than business rules: 1. they are built based on large quantity of historical data; 2. predictive models have the ability to generalize. The ability to generalize means they do not simply memorize historical data. They are structured to perform well for future data.

Friday, September 14, 2012

Data preparation for building logistic regression models

Logistic regression is a popular model that predicts binary output, e.g., fraud or not, customer response or not, etc. When building a regression model, there are two ways to represent the data. First, it is the transactional format as shown below. Each record is a single transaction and the target variable is a binary variable.
Data in transactional format
USER_ID CAT STATE_CD CAMPAIGN_DATE CAMPAIGN_CD RESPONSE
1001 A MA 01-MAY-12 A N
1001 A MA 08-MAY-12 A N
1001 A MA 15-MAY-12 A Y
1001 A MA 22-MAY-12 A N
1001 A MA 29-MAY-12 A N
1001 A MA 06-JUN-12 A N
1001 B CT 06-JUN-12 A N
1002 B CT 01-MAY-12 A N
1002 B CT 08-MAY-12 A N
1002 B CT 15-MAY-12 A Y
1002 B CT 22-MAY-12 A N
1002 B CT 29-MAY-12 A Y

If all the independent variables are categorical, we can convert the data in transactional format into a more compact one by summarizing the data using SQL script similar to the following. We count the numbers of responses and non responses for each unique combination of independent variable values. For continuous variables, if we want, we can transform them into categorical using techniques like binning.

select cat, state_cd, campgain_cd,
sum(case when response='Y' then 1 else 0 end) num_response,
sum(case when response='N' then 1 else 0 end) num_no_response
from tbl_txn group by cat, state_cd, campgain_cd;
Data in the summary format
CAT STATE_CD CAMPAIGN_CD NUM_RESPONSE NUM_NO_RESPONSE
A MA A 125 1025
B CT C 75 2133
..........................

Summarizing data first can greatly reduce the data size and save memory space when building the model. This is particularly useful if we are use memory-based modeling tools such as R.

If we use R to build the logistic regression model, the script for training data in transactional format is similar to the following.
glm(formula=RESPONSE~CAT+STATE_CD+CAMPAIGN_CD,
data=train.set1,family = binomial(link = "logit")) ->model1

The R scripts for building a logistic model based on summary data is show below.

glm(formula= cbind(NUM_RESPONSE,NUM_NO_RESPONSE) ~CAT+STATE_CD+CAMPAIGN_CD,
data=train.set1,family = binomial(link = "logit")) ->model2

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics

Popular Topics

Tuesday, January 15, 2013

Sunday, January 06, 2013

Saturday, January 05, 2013

Thursday, January 03, 2013

Tuesday, January 01, 2013

Saturday, December 22, 2012

Friday, December 21, 2012

Tuesday, December 18, 2012

Friday, December 14, 2012

Tuesday, December 11, 2012

Sunday, December 02, 2012

Thursday, November 22, 2012

Wednesday, November 07, 2012

Tuesday, November 06, 2012

Saturday, October 27, 2012

Sunday, October 21, 2012

Wednesday, October 03, 2012

Saturday, September 29, 2012

Thursday, September 20, 2012

Friday, September 14, 2012