Sunday, March 06, 2016

the Danger of Predictive Model Overfitting

In the post a Young Data Scientist- Kaggle Competition Top 5% Winner: Yuyu Zhou, Yuyu talks about the important role of feature engineering, i.e, finding good derived variables, and gradient boosting trees in their success. He also tells me a very interesting observations on Kaggle Competition ranking.

"After participating teams finished building their predictive models, they apply their models to two data sets to generate predictions: a smaller set containing target variable and a larger data set where the target variable is removed. Each participating team's model is temporarily ranked based on the result on the smaller data set with the target variable. After the competition's deadline is due, Kaggle will calculate the final ranking of each team based on a model's prediction on the larger data set. "
"It is interesting to see that the rankings of some top 1% models based on the smaller data set drop more than 20% on the larger data set. I figure out what might cause the huge discrepancies in their model performance. Those teams' models fit the smaller data set so well that they lose their capability to generalize. It is a typical overfitting problem."
It is important to avoid model overfitting. After all, a predictive model is only useful if it can generalize, i.e., able to handle new data reasonably well.

1 comment:

Unknown said...

Ishwar Mining is a largest mining company in India. It provides high quality mining services for minerals like Pyrophyllite and Diaspore.