I am compiling a list of leading data providers. This is the current list. If your favorite data providers are not included here and you feel they should be, please contact me at firstname.lastname@example.org. Thanks. Dr. Zhou.
Saturday, February 16, 2013
In previous posts, we mentioned that a great deal of the time should be spent on understanding data and building feature variables that are truly relevant to the target variable. The next question is which predictive models should we use? There are so many choices of types of models. For examples, for classification problems, the models we can use include CART, logistic regression, SVM, Neural Nets, Nearest-K, Bayesian classification model, ensemble models, etc. In my PhD dissertation, most of the content is dedicated to the empirical comparison of different models. In the commercial world, sometimes I applied different models to solve the same problem. The follow lift charts are the actual results for models that predict direct mail response. The models I tested include gradient boosting trees, CART,a logistic regression, and a simple cell (or cube) model. The cell of cube model here divides the training data into many cubes and calculates the response rate for each cube. The predicted response rate for a new data point is that of the cube where it is located.
Wednesday, February 13, 2013
Abraham Lincoln said, "If I had six hours to chop down a tree, I'd spend the first four hours sharpening the axe". The same principle can be applied to a predictive modeling project. From the discussion in previous posts, we can see that most of the time are not spent on building model itself. Thus we may say, "If I had ten days to build a predictive model, I'd spend the first seven days understanding data and building feature variables that are truly relevant to the target variable".