Deep Data Mining Blog: How to build predictive models that win competitions. (continued)

Tuesday, January 15, 2013

How to build predictive models that win competitions. (continued)

To build predictive models that are accurate and robust, it is crucial to find the input variables, or so called feature variables, that are truly predicative of the target variable. This is even more important than choosing types of predictive models. Good feature variables can greatly reduce the complexity of the modeling learning.

For example, suppose we want to build a model that predicts if a pair of vectors are parallel to each other. This is illustrated in the following chart.

We can represent the pair of vectors by all of their coordinates for the starting and ending points, i.e.,(Xa1,Ya1,Xa2,Ya2,Xb1,Yb1,Xb2,Yb2)

Suppose we have a few hundred of training samples (pairs of vectors) that are marked as parallel or not. The following are some examples.

Sample 1: variables (0,0,1,1,1,0,2,1) target: yes (parallel)
Sample 2: variables (0,0,2,2,2,0,4,2) target: yes (parallel)
Sample 3: variables (0,0,2,2,2,0,4,4) target: no (parallel)
......

However, it would be extremely challenging if we use (Xa1,Ya1,Xa2,Ya2,Xb1,Yb1,Xb2,Yb2) as input variables to build predictive models. Probably, most of the models will perform very poorly and are useless.

If we realize that we can compare the angles of the two vectors to decide if they are parallel, we can build two new variables using atan functions as the following.
(atan((Ya2-Ya1)/(Xa2-Xa1)), atan((Yb2-Yb1)/(Xb2-Xb1))

The atan function is the inverse tangent. Using the new variables, the above training samples will be:

Sample 1: variables (0.785, 0.785) target: yes (parallel)
Sample 2: variables (0.785, 0.785) target: yes (parallel)
Sample 3: variables (0.785, 1.107) target: no (parallel)
......

This representation greatly simplifies the problem. When the two numbers are close, the two vectors are parallel.

Thus, before we build models, a great deal of time should be spent on building feature variables that are truly predicative of the target variables. This is true for any applications such as fraud detection, credit risk, click through rate prediction, etc.

Some models, such as neural nets and SVM,are able to implicitly re-represent the input variables and automatically figure out the best feature variables during the learning process. For example, some researchers claim that a neural net is able to approximate the atan functions in its hidden neurons during training. This will be a subject of another post. From data mining practitioners' perspective, nothing prevents us from taking advantage of our domain knowledge and calculate the feature variables directly.

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics

Popular Topics

Tuesday, January 15, 2013

How to build predictive models that win competitions. (continued)

No comments: