For someone who is looking for a job in data analytics field, high rankings in Kaggle Competition will give him tremendous advantage. Employers see the competition winners having strong problem solving skills and hands-on expertise. Indeed, to be able to complete some projects for Kaggle competitions, participants have to put in a lot of effort to deal with data issues even before any predictive models can be built. This is very similar to that in the real world applications where 80% of time is spent on data cleanse and manipulation. It is not a surprise that some employers prefer Kaggle competition winners over PhD graduates whose skills are perceived as more theoretical. Taking my nephew, Yuyu Zhou, as an example. He got a master degree in data analytics. While he was in school, he spent a few weeks with other classmates to participate in Kaggle competitions. His team has achieved the top 3% and 5% in two Kaggle prediction competitions respectively (see my blog post a Young Data Scientist- Kaggle Competition Top 5% Winner: Yuyu Zhou. Once he graduated, he was quickly hired by a prestigious company and has been earning PhD level salary. Those few weeks he spent on Kaggle Competition was the best time investment of his life.
I am compiling a list of leading data providers. This is the current list. If your favorite data providers are not included here and you feel they should be, please contact me at firstname.lastname@example.org. Thanks. Dr. Zhou.
Wednesday, October 26, 2016
Thursday, October 20, 2016
Oracle ora_hash() is a very useful function. I have used it for different purposes such as generating random number. The following query generate 5 buckets from 0 to 4, each of them have the similar number of records. First, we create a table and populate it with 1,000 numbers.
create table t_n (n number); begin for i in 1..1000 loop insert into t_n values(i); end loop; commit; end;In the query below, the parameter 5 of ora_hash defines the number of buckets is 5. As we see, each bucket has simlilar number of records.
with tbl as ( select ora_hash(n, 5) bucket, n from t_n) select bucket, count(*), min(n), max(n) from tbl group by bucket order by 1; BBUCKET COUNT(*) MIN(N) MAX(N) 0 154 2 993 1 164 7 999 2 168 6 991 3 175 4 995 4 173 8 1000
Saturday, July 02, 2016
Listagg is a useful function to combine column values from different rows. In a sense, it is similar to sum() function that aggregates information across records. To show how listagg works, first we create a table that has record id, group id and words.
create table t(id number, gp number, word varchar2(128)); insert into t values(1,1,'Hello'); insert into t values(2,1,'World!'); insert into t values(3,2,'Good'); insert into t values(4,2,'Morning!');
select * from t; ID GP WORD 1 1 Hello 2 1 World! 3 2 Good 4 2 Morning!We can combine the words by group using listagg function.
select gp, listagg(word,' ') within group(order by id) from t group by gp order by gp; GP LISTAGG(WORD,'')WITHINGROUP(ORDERBYID) 1 Hello World! 2 Good Morning!
Saturday, June 11, 2016
I have found an excellent website to learn Oracle PL/SQL Oracle SQL live. From the website, we can run SQL or PL/SQL commands. There are also many scripts in Code Library that we can learn from. Here are screenshots.
Wednesday, May 11, 2016
- ZSales: 70% increase in response rate
- ZClaim: 40% reduction in insurance claim losses
- ZCheck: 70% reduction in charge off
- ZCredit: 50% reduction in delinquency rate
Sunday, March 06, 2016
In the post a Young Data Scientist- Kaggle Competition Top 5% Winner: Yuyu Zhou, Yuyu talks about the important role of feature engineering, i.e, finding good derived variables, and gradient boosting trees in their success. He also tells me a very interesting observations on Kaggle Competition ranking.
"After participating teams finished building their predictive models, they apply their models to two data sets to generate predictions: a smaller set containing target variable and a larger data set where the target variable is removed. Each participating team's model is temporarily ranked based on the result on the smaller data set with the target variable. After the competition's deadline is due, Kaggle will calculate the final ranking of each team based on a model's prediction on the larger data set. "
"It is interesting to see that the rankings of some top 1% models based on the smaller data set drop more than 20% on the larger data set. I figure out what might cause the huge discrepancies in their model performance. Those teams' models fit the smaller data set so well that they lose their capability to generalize. It is a typical overfitting problem."It is important to avoid model overfitting. After all, a predictive model is only useful if it can generalize, i.e., able to handle new data reasonably well.
Saturday, March 05, 2016
Yuyu Zhou is a graduate student in Analytics in University of New Hampshire. His team has achieved the top 3% and 5% in two Kaggle prediction competitions respectively. In an interview, I asked him how their predictive models performed so well. Yuyu said,
"One of the keys to the success is that we spend tremendous amount of time working on building feature variables. Those variables are usually the results of combining several raw variables. For example, the ratio between the body weight and height is a better variable in predicting a patient's health than using body weight or height alone."
"My training in computer science is extremely helpful in these projects. I am able to write Java, Python and SQL scripts to perform tasks such as data cleansing, data merge, and data transform, etc. As we know, more than 80% of time in a project is typically spent on those tasks before we start building predictive models."
"We have tried many type of predictive models and found that gradient boosting trees have consistently perform the best."
The following is a summary of Yuyu's contribution in those two projects.
Kaggle Competition: Rossmann Store Sales Prediction (ranked top 5%) Oct 2015 – Dec 2015
- Built the Predictive Model for daily sales for Rossmann Stores using Python Machine Learning library.
- Conducted data cleaning and feature engineering for increasing data quality.
- Designed final prediction model by combining the multiple gradient boosting trees algorithms
- Prediction accuracy was ranked at 163 out of 3303 teams
Kaggle Competition: Property Risk Level Prediction (ranked top 3%) July 2015 – Aug 2015
- Developed Statistics models to predict risk level of properties which Liberty Mutual Inc is going to protect.
- Led the team and conducted cost and benefit analysis on new ideas.
- Implemented ideas using statistical packages from Python.
- Prediction accuracy was ranked at 71 out of 2236 teams.
Friday, March 04, 2016
We may run the following query from SYS to find the tablespace names for all tables, including IOTs (index organized table) and partitioned tables (replacing 'DMUSER' with your user name).
select u.name owner, o.name table_name,t.name tablspace_name from obj$ o, ts$ t, sys_objects s, user$ u where o.obj#=s.object_id and s.ts_number=t.ts# and o.owner#=u.user# and o.type#=2 and u.name='DMUSER' order by 1,2; OWNER TABLE_NAME TABLSPACE_NAME -------------------------------- -------------------------------- DMUSER AI_EXPLAIN_OUTPUT TBS_1 DMUSER AR_SH_SAMPLE_SETTINGS TBS_1 DMUSER DM$P0AR_SH_SAMPLE TBS_1 DMUSER DM$P0AR_SH_SAMPLE_2COL TBS_1 DMUSER DM$P0EM_SH_CLUS_SAMPLE TBS_1 DMUSER DM$P0NB_SH_CLAS_SAMPLE TBS_1 DMUSER DM$P0OC_SH_CLUS_SAMPLE TBS_1 DMUSER DM$P1EM_SH_CLUS_SAMPLE TBS_1 DMUSER DM$P1NB_SH_CLAS_SAMPLE TBS_1 DMUSER DM$P1OC_SH_CLUS_SAMPLE TBS_1
Wednesday, March 02, 2016
Wednesday, February 17, 2016
I shared my analytics experience with a group of students and professors from University of New Hampshire Analytics on February 25. I really enjoyed the interaction with professors and students. The following is the news on UNH Graduate Analytics LinkedIn page.
Dr. Zhou highlighted the advantages and disadvantages of using statistical models vs. intuition to solve business problems throughout his 10+ years of experience. Among his many accomplishments, Dr. Zhou built a fraud detection model for large banks, which detects 70% more fraud than using business intuition rules alone.
Bernard Marr wrote a post Big Data: 33 Brilliant And Free Data Sources For 2016. He gives a list of free data sources provided by governments, social websites and companies including Google and Facebook. The data types include census, climate, financial, health, news, etc. I have found the list very interesting. Springboard "19 Free Public Data Sets For Your First Data Science Project" has good information about free data sets such as United States Census Data, FBI Crime Data and CDC Cause of Death etc.