Saturday, February 16, 2013

The Comparison of Different Models

In previous posts, we mentioned that a great deal of the time should be spent on understanding data and building feature variables that are truly relevant to the target variable. The next question is which predictive models should we use? There are so many choices of types of models. For examples, for classification problems, the models we can use include CART, logistic regression, SVM, Neural Nets, Nearest-K, Bayesian classification model, ensemble models, etc. In my PhD dissertation, most of the content is dedicated to the empirical comparison of different models. In the commercial world, sometimes I applied different models to solve the same problem. The follow lift charts are the actual results for models that predict direct mail response. The models I tested include gradient boosting trees, CART,a logistic regression, and a simple cell (or cube) model. The cell of cube model here divides the training data into many cubes and calculates the response rate for each cube. The predicted response rate for a new data point is that of the cube where it is located.

As we can see from the above lift charts, gradient boosting trees is the best. Logistic regression and CART are almost the same. However, all the models, while vary greatly in terms of structures and sophistication, perform satisfactorily on the testing data set.

However, in reality the selection of models should not solely based on the model's predictive accuracy. Other important considerations are: how hard a model can be deployed into a production system, the computation efficiency, memory usage, can the model give a reason for its prediction, etc. It is completely acceptable that we choose a simple model that performs reasonably well. I have seen too many cases where statisticians build great (and sophisticated) models in their lab environments that could not be deployed into the production system. In those cases, the benefits of predictive modeling are never realized.

No comments: