Tuesday, January 15, 2013

How to build predictive models that win competitions. (continued)

To build predictive models that are accurate and robust, it is crucial to find the input variables, or so called feature variables, that are truly predicative of the target variable. This is even more important than choosing types of predictive models. Good feature variables can greatly reduce the complexity of the modeling learning.

For example, suppose we want to build a model that predicts if a pair of vectors are parallel to each other. This is illustrated in the following chart.


We can represent the pair of vectors by all of their coordinates for the starting and ending points, i.e.,(Xa1,Ya1,Xa2,Ya2,Xb1,Yb1,Xb2,Yb2)

Suppose we have a few hundred of training samples (pairs of vectors) that are marked as parallel or not. The following are some examples.

Sample 1: variables (0,0,1,1,1,0,2,1) target: yes (parallel)
Sample 2: variables (0,0,2,2,2,0,4,2) target: yes (parallel)
Sample 3: variables (0,0,2,2,2,0,4,4) target: no (parallel)
......

However, it would be extremely challenging if we use (Xa1,Ya1,Xa2,Ya2,Xb1,Yb1,Xb2,Yb2) as input variables to build predictive models. Probably, most of the models will perform very poorly and are useless.

If we realize that we can compare the angles of the two vectors to decide if they are parallel, we can build two new variables using atan functions as the following.
(atan((Ya2-Ya1)/(Xa2-Xa1)), atan((Yb2-Yb1)/(Xb2-Xb1))

The atan function is the inverse tangent. Using the new variables, the above training samples will be:

Sample 1: variables (0.785, 0.785) target: yes (parallel)
Sample 2: variables (0.785, 0.785) target: yes (parallel)
Sample 3: variables (0.785, 1.107) target: no (parallel)
......

This representation greatly simplifies the problem. When the two numbers are close, the two vectors are parallel.

Thus, before we build models, a great deal of time should be spent on building feature variables that are truly predicative of the target variables. This is true for any applications such as fraud detection, credit risk, click through rate prediction, etc.

Some models, such as neural nets and SVM,are able to implicitly re-represent the input variables and automatically figure out the best feature variables during the learning process. For example, some researchers claim that a neural net is able to approximate the atan functions in its hidden neurons during training. This will be a subject of another post. From data mining practitioners' perspective, nothing prevents us from taking advantage of our domain knowledge and calculate the feature variables directly.

No comments: