Problem
Random sampling is a very common task in a data analytics project. Sometime we want to sample precise number of records such as described in the post Take Randomly Sampled Records From the Table. Sometime, we randomly split a data set into training and testing sets at 70/30 (or whatever ratio). How do we efficiently perform multiple random sampling tasks?
Solution
I have found it is convenient to create a permanent table that contains the unique record id (or account id, card number, etc.) and a uniformly distributed random number. The following is an example of creating such a table described in Take Randomly Sampled Records From the Table. dbms_random.value returns uniformly distributed random number ranging from 0 to 1.
SQL> create table tbl_emp_id_rnd as select EMPLOYEE_ID,
dbms_random.value rnd from EMPLOYEES;
Table created.
SQL> select * from tbl_emp_id_rnd where rownum <=10;
EMPLOYEE_ID RND
----------- ----------
100 .466996031
101 .325172718
102 .643593904
103 .822225992
104 .657242181
105 .244060518
106 .446914037
107 .423664122
108 .033736378
109 .405546964
10 rows selected.
We can perform many types of random sampling based on such a table whenever we need to. For example, we can split the records into 70% training and 30% testing set. And we can change the ratio of 70/30 easily.
SQL> create view v_taining_emp_id as select EMPLOYEE_ID from tbl_emp_id_rnd where rnd <=0.7; View created. SQL> create view v_testing_emp_id as select EMPLOYEE_ID from tbl_emp_id_rnd where rnd >0.7; View created.I have found such a table is extremely helpful in a project where randomly sampling is performed multiple times.



