Tuesday, January 15, 2013

How to build predictive models that win competitions. (continued)

To build predictive models that are accurate and robust, it is crucial to find the input variables, or so called feature variables, that are truly predicative of the target variable. This is even more important than choosing types of predictive models. Good feature variables can greatly reduce the complexity of the modeling learning.

For example, suppose we want to build a model that predicts if a pair of vectors are parallel to each other. This is illustrated in the following chart.


We can represent the pair of vectors by all of their coordinates for the starting and ending points, i.e.,(Xa1,Ya1,Xa2,Ya2,Xb1,Yb1,Xb2,Yb2)

Suppose we have a few hundred of training samples (pairs of vectors) that are marked as parallel or not. The following are some examples.

Sample 1: variables (0,0,1,1,1,0,2,1) target: yes (parallel)
Sample 2: variables (0,0,2,2,2,0,4,2) target: yes (parallel)
Sample 3: variables (0,0,2,2,2,0,4,4) target: no (parallel)
......

However, it would be extremely challenging if we use (Xa1,Ya1,Xa2,Ya2,Xb1,Yb1,Xb2,Yb2) as input variables to build predictive models. Probably, most of the models will perform very poorly and are useless.

If we realize that we can compare the angles of the two vectors to decide if they are parallel, we can build two new variables using atan functions as the following.
(atan((Ya2-Ya1)/(Xa2-Xa1)), atan((Yb2-Yb1)/(Xb2-Xb1))

The atan function is the inverse tangent. Using the new variables, the above training samples will be:

Sample 1: variables (0.785, 0.785) target: yes (parallel)
Sample 2: variables (0.785, 0.785) target: yes (parallel)
Sample 3: variables (0.785, 1.107) target: no (parallel)
......

This representation greatly simplifies the problem. When the two numbers are close, the two vectors are parallel.

Thus, before we build models, a great deal of time should be spent on building feature variables that are truly predicative of the target variables. This is true for any applications such as fraud detection, credit risk, click through rate prediction, etc.

Some models, such as neural nets and SVM,are able to implicitly re-represent the input variables and automatically figure out the best feature variables during the learning process. For example, some researchers claim that a neural net is able to approximate the atan functions in its hidden neurons during training. This will be a subject of another post. From data mining practitioners' perspective, nothing prevents us from taking advantage of our domain knowledge and calculate the feature variables directly.

Sunday, January 06, 2013

Split data into multiple tables using multi-table insert

With Oracle multi-table insert, we can conveniently store data into multiple tables using a single select statment.

For example, The following insert statement puts records in TBL_STOCK_QUOTES into three tables based on the first letter of stock symbol, T_A2K,T_L2S and T_T2Z.

insert first
when substr(symbol,1,1) between 'A' and 'K' then
into T_A2K (SYMBOL,DT,HIGH,LOW,OPEN,CLOSE,VOLUME)
values (SYMBOL,DT,HIGH,LOW,OPEN,CLOSE,VOLUME)
when substr(symbol,1,1) between 'L' and 'S' then
into T_L2S(SYMBOL,DT,HIGH,LOW,OPEN,CLOSE,VOLUME)
else
into T_T2Z(SYMBOL,DT,HIGH,LOW,OPEN,CLOSE,VOLUME)
values(SYMBOL,DT,HIGH,LOW,OPEN,CLOSE,VOLUME)
select
SYMBOL,DT,HIGH,LOW,OPEN,CLOSE,VOLUME
from
TBL_STOCK_QUOTES;


We can verify the result after the inserting.
SQL> select min(symbol), max(symbol) from T_A2K;

MIN(SYMBOL) MAX(SYMBOL)
-------------------------------- --------------------------------
AAIT KYO

SQL> select min(symbol), max(symbol) from T_L2S;
MIN(SYMBOL) MAX(SYMBOL)
-------------------------------- --------------------------------
LANC SWI

SQL> select min(symbol), max(symbol) from T_T2Z;
MIN(SYMBOL) MAX(SYMBOL)
-------------------------------- --------------------------------
TAX ZN

Saturday, January 05, 2013

Watch out invisible characters.

When text files (or csv files when the filed delimiter is a comma) are loaded into a table, sometimes the columns contain weird and invisible characters. They cause unexpected results when we compare values or convert them from strings to numbers. Oracle function utl_raw.CAST_TO_RAW is a great way to look into the text strings and uncover those invisible characters.

In a project, we used ETL tool to move financial transactions into a data warehouse and found that in a very small number of cases the data showed discrepancies when we compared the data before and after the movement. However, the data looked the same visually. By using utl_raw.CAST_TO_RAW function, we were able to pinpoint those discrepancies caused by invisible characters and fixed the problem.

For example, there is a table containing text string about stock prices. The following query shows one record containing symbol ,Date,Open,High,Low,Close,Volume, and Adj Close.
SQL> select quotes from TBL_SYMBOL_QUOTES_CUR a where rownum <2; 

QUOTES
---------------------------------------------------------------- 
APEI,2013-01-02,36.89,37.34,36.59,36.96,198700,36.96 
The above query result looks OK. But the text string actually contains an invisible character new line (hex value 0A or decimal value 10). We can use utl_raw.CAST_TO_RAW to uncover this.
SQL> select utl_raw.CAST_TO_RAW(quotes) quotes_raw from TBL_SYMBOL_QUOTES_CUR a where rownum <2; 

QUOTES_RAW 
------------------------------------------------------------------------------------------------------------------------------------ 
415045492C323031332D30312D30322C33362E38392C33372E33342C33362E35392C33362E39362C3139383730302C33362E39360A 
Here are the complete ASCII tables showing the numerical representation of characters. We can remove the unwanted characters using Oracle replace function. chr(10) means the character with decimal value 10 or hex value 0A or new line character.
<
SQL> select replace(quotes,chr(10)) from TBL_SYMBOL_QUOTES_CUR a where rownum <2;
The following query shows new line character is removed.
SQL> select utl_raw.CAST_TO_RAW(replace(quotes,chr(10))) from TBL_SYMBOL_QUOTES_CUR a where rownum <2; 

UTL_RAW.CAST_TO_RAW(REPLACE(QUOTES,CHR(10))) 
------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------- 
415045492C323031332D30312D30322C33362E38392C33372E33342C33362E35392C33362E39362C3139383730302C33362E3936 

Thursday, January 03, 2013

the First principle of data analytics is to avoid making mistakes (continued)

The importance of avoiding mistakes in data mining (or in any work in general) could never be overemphasized. When mistakes happen, wrong (and sometime ridiculous) conclusions are drawn and the credibility of analysts is severely damaged.

One of the common sources causing mistakes is survivor bias. The following are some examples.

1. When we study average 10 year stock return, we collect the price history of stocks that are on the market. However, because many companies went under within the 10 years and their data are not included in the analysis, the calculated average return would be higher than the actual return.

2. It was uncovered that 12 months ago a data breach affected some credit cards of a card issuer. To measure the impact of this data breach, analysts take the current active credit cards, find out those that are affected by the data breach, and measure their fraud rate. They may find that the fraud rate is surprisingly low. This is because those cards had fraudulent activities as reported by the cardholders are already closed and these cards are purged from the active portfolio.

Stocks for companies going under, closed accounts due to nonpayment, closed credit cards due to fraud, churned customers, etc., are data corpses. To avoid data survivor bias, it is important to collect the complete data, including those data corpses. This is easier said than done as in reality data corpses are regularly removed and hard to collect.

Tuesday, January 01, 2013

the First principle of data analytics is to avoid making mistakes (continued)

Mistakes can happen when we deploy the models that work well in our "lab environment" into production. We need to anticipate unseen situations ahead of time and deal with them. For examples, the following are some of the common challenges:

1. Unseen categorical variable values. For example, in the training data, state variable does not contain value RI. However, in the production, RI appears in the data. One way to deal with this is to assign all unseen codes to the most frequent category in our model scripts. That way, the model will run as we expect.

2. Continuous variables out of boundaries. For example, in the training data, the credit card purchase amount is within a range of $0.01 and $22,000. However, in production, the purchase amount could be $56,000. If we do not handle purchase amount properly, extreme scores will be generated by the model for transactions with purchase amount of $56,000. It is a good idea to winsorize the variable. "Winsorize" is a fancy word for clipping the extreme values.