Thursday, November 28, 2013

More on Taking Randomly Sampled Records From the Table

Problem

Random sampling is a very common task in a data analytics project. Sometime we want to sample precise number of records such as described in the post Take Randomly Sampled Records From the Table. Sometime, we randomly split a data set into training and testing sets at 70/30 (or whatever ratio). How do we efficiently perform multiple random sampling tasks?

Solution

I have found it is convenient to create a permanent table that contains the unique record id (or account id, card number, etc.) and a uniformly distributed random number. The following is an example of creating such a table described in Take Randomly Sampled Records From the Table. dbms_random.value returns uniformly distributed random number ranging from 0 to 1.

SQL> create table tbl_emp_id_rnd as select EMPLOYEE_ID, 
dbms_random.value rnd from EMPLOYEES;

Table created.

SQL> select * from tbl_emp_id_rnd where rownum <=10;

EMPLOYEE_ID        RND
----------- ----------
        100 .466996031
        101 .325172718
        102 .643593904
        103 .822225992
        104 .657242181
        105 .244060518
        106 .446914037
        107 .423664122
        108 .033736378
        109 .405546964

10 rows selected.
We can perform many types of random sampling based on such a table whenever we need to. For example, we can split the records into 70% training and 30% testing set. And we can change the ratio of 70/30 easily.
SQL> create view v_taining_emp_id as select EMPLOYEE_ID 
from tbl_emp_id_rnd where rnd <=0.7;

View created.

SQL> create view v_testing_emp_id as select EMPLOYEE_ID 
from tbl_emp_id_rnd where rnd >0.7;

View created.
I have found such a table is extremely helpful in a project where randomly sampling is performed multiple times.

No comments: