Yuyu Zhou is a graduate student in Analytics in University of New Hampshire. His team has achieved the top 3% and 5% in two Kaggle prediction competitions respectively. In an interview, I asked him how their predictive models performed so well. Yuyu said,
"One of the keys to the success is that we spend tremendous amount of time working on building feature variables. Those variables are usually the results of combining several raw variables. For example, the ratio between the body weight and height is a better variable in predicting a patient's health than using body weight or height alone."
"My training in computer science is extremely helpful in these projects. I am able to write Java, Python and SQL scripts to perform tasks such as data cleansing, data merge, and data transform, etc. As we know, more than 80% of time in a project is typically spent on those tasks before we start building predictive models."
"We have tried many type of predictive models and found that gradient boosting trees have consistently perform the best."
The following is a summary of Yuyu's contribution in those two projects.
Kaggle Competition: Rossmann Store Sales Prediction (ranked top 5%) Oct 2015 – Dec 2015
- Built the Predictive Model for daily sales for Rossmann Stores using Python Machine Learning library.
- Conducted data cleaning and feature engineering for increasing data quality.
- Designed final prediction model by combining the multiple gradient boosting trees algorithms
- Prediction accuracy was ranked at 163 out of 3303 teams
Kaggle Competition: Property Risk Level Prediction (ranked top 3%) July 2015 – Aug 2015
- Developed Statistics models to predict risk level of properties which Liberty Mutual Inc is going to protect.
- Led the team and conducted cost and benefit analysis on new ideas.
- Implemented ideas using statistical packages from Python.
- Prediction accuracy was ranked at 71 out of 2236 teams.