Thursday, November 28, 2013

Take Randomly Sampled Records From the Table

Problem

How do we randomly sample, say 20 records, from the table below that contains 107 employees?
SQL> select EMPLOYEE_ID, FIRST_NAME, LAST_NAME from EMPLOYEES where rownum <=10;

EMPLOYEE_ID FIRST_NAME           LAST_NAME
----------- -------------------- -------------------------
        100 Steven               King
        101 Neena                Kochhar
        102 Lex                  De Haan
        103 Alexander            Hunold
        104 Bruce                Ernst
        105 David                Austin
        106 Valli                Pataballa
        107 Diana                Lorentz
        108 Nancy                Greenberg
        109 Daniel               Faviet

Solution

We use the method describe in Randomly Sample a Precise Number of Records and More on random sampling in Oracle. Let's follow the steps.
1. Verify that employee_id is indeed the unique identifier.

SQL> select count(*), count(distinct employee_id) from EMPLOYEES;

  COUNT(*) COUNT(DISTINCTEMPLOYEE_ID)
---------- --------------------------
       107                        107
2. Generate a random number for each employee_id using dbms_random.value. Function dbms_random.value generates uniformly distributed random number greater than or equal to 0 and less than 1.
SQL> create table tbl_emp_id_rnd as select EMPLOYEE_ID, dbms_random.value rnd 
from EMPLOYEES;

Table created.
The following is what tbl_emp_id_rnd looks like.
SQL> select * from tbl_emp_id_rnd where rownum <=10;

EMPLOYEE_ID        RND
----------- ----------
        100 .466996031
        101 .325172718
        102 .643593904
        103 .822225992
        104 .657242181
        105 .244060518
        106 .446914037
        107 .423664122
        108 .033736378
        109 .405546964

10 rows selected.
3. Generate unique rank based on the random number for each record.
SQL> create table tbl_emp_id_rnd_rnk as select a.*, 
row_number() over(order by rnd) rnk from tbl_emp_id_rnd a;

Table created.
The following is what the table with rank looks like. To get the randomly sampled 20 records, we simply select records with ranks less or equal 20. The trick is based on that the rank of random numbers are also random.
SQL> select * from tbl_emp_id_rnd_rnk where rnk <=20 order by rnk;

EMPLOYEE_ID        RND        RNK
----------- ---------- ----------
        177 .001160678          1
        121 .001293396          2
        151 .013037966          3
        126 .018864319          4
        108 .033736378          5
        158  .09466332          6
        168 .095874461          7
        165 .116139108          8
        166 .116236033          9
        149 .121777964         10
        136 .126158543         11
        116 .140726813         12
        193 .147478226         13
        143 .164627124         14
        194 .175347727         15
        145 .206125327         16
        176 .207418722         17
        160 .207918621         18
        201  .21827842         19
        183 .218636576         20

20 rows selected.

Conclusions

We use Oracle dbms_random.value to first generate a random number of each record in the table and then row_number functions to calculate the rank of the random number. We then use the rank to select the desired number of records. Please notice that it is not a good practice to use the employee_id in the table directly for sampling purpose because we are not sure if it is randomly generated.

No comments: