Wednesday, October 26, 2016

Ranking High in Kaggle Competition is a Huge Advantage for Job Seekers

For someone who is looking for a job in data analytics field, high rankings in Kaggle Competition will give him tremendous advantage. Employers see the competition winners having strong problem solving skills and hands-on expertise. Indeed, to be able to complete some projects for Kaggle competitions, participants have to put in a lot of effort to deal with data issues even before any predictive models can be built. This is very similar to that in the real world applications where 80% of time is spent on data cleanse and manipulation. It is not a surprise that some employers prefer Kaggle competition winners over PhD graduates whose skills are perceived as more theoretical.

Taking my nephew, Yuyu Zhou, as an example. He got a master degree in data analytics. While he was in school, he spent a few weeks with other classmates to participate in Kaggle competitions. His team has achieved the top 3% and 5% in two Kaggle prediction competitions respectively (see my blog post a Young Data Scientist- Kaggle Competition Top 5% Winner: Yuyu Zhou. Once he graduated, he was quickly hired by a prestigious company and has been earning PhD level salary. Those few weeks he spent on Kaggle Competition was the best time investment of his life.

Thursday, October 20, 2016

Oracle Ora_Hash function- Part 1 Random Sampling

Oracle ora_hash() is a very useful function. I have used it for different purposes such as generating random number. The following query generate 5 buckets from 0 to 4, each of them have the similar number of records.
First, we create a table and populate it with 1,000 numbers.

create table t_n (n number);

begin
for i in 1..1000
 loop 
 insert into t_n values(i);
 end loop;
 commit;
end;
In the query below, the parameter 5 of ora_hash defines the number of buckets is 5. As we see, each bucket has simlilar number of records.
with tbl as
(
select ora_hash(n, 5) bucket, n  from t_n)
select bucket, count(*), min(n), max(n) from tbl
group by bucket order by 1;
BBUCKET COUNT(*) MIN(N) MAX(N)
0 154          2  993
1 164          7  999
2 168          6  991
3 175          4  995
4 173          8 1000