## Thursday, November 28, 2013

### Take Randomly Sampled Records From the Table

#### Problem

How do we randomly sample, say 20 records, from the table below that contains 107 employees?
SQL> select EMPLOYEE_ID, FIRST_NAME, LAST_NAME from EMPLOYEES where rownum <=10;

EMPLOYEE_ID FIRST_NAME           LAST_NAME
----------- -------------------- -------------------------
100 Steven               King
101 Neena                Kochhar
102 Lex                  De Haan
103 Alexander            Hunold
104 Bruce                Ernst
105 David                Austin
106 Valli                Pataballa
107 Diana                Lorentz
108 Nancy                Greenberg
109 Daniel               Faviet

#### Solution

We use the method describe in Randomly Sample a Precise Number of Records and More on random sampling in Oracle. Let's follow the steps.
1. Verify that employee_id is indeed the unique identifier.

SQL> select count(*), count(distinct employee_id) from EMPLOYEES;

COUNT(*) COUNT(DISTINCTEMPLOYEE_ID)
---------- --------------------------
107                        107
2. Generate a random number for each employee_id using dbms_random.value. Function dbms_random.value generates uniformly distributed random number greater than or equal to 0 and less than 1.
SQL> create table tbl_emp_id_rnd as select EMPLOYEE_ID, dbms_random.value rnd
from EMPLOYEES;

Table created.
The following is what tbl_emp_id_rnd looks like.
SQL> select * from tbl_emp_id_rnd where rownum <=10;

EMPLOYEE_ID        RND
----------- ----------
100 .466996031
101 .325172718
102 .643593904
103 .822225992
104 .657242181
105 .244060518
106 .446914037
107 .423664122
108 .033736378
109 .405546964

10 rows selected.
3. Generate unique rank based on the random number for each record.
SQL> create table tbl_emp_id_rnd_rnk as select a.*,
row_number() over(order by rnd) rnk from tbl_emp_id_rnd a;

Table created.
The following is what the table with rank looks like. To get the randomly sampled 20 records, we simply select records with ranks less or equal 20. The trick is based on that the rank of random numbers are also random.
SQL> select * from tbl_emp_id_rnd_rnk where rnk <=20 order by rnk;

EMPLOYEE_ID        RND        RNK
----------- ---------- ----------
177 .001160678          1
121 .001293396          2
151 .013037966          3
126 .018864319          4
108 .033736378          5
158  .09466332          6
168 .095874461          7
165 .116139108          8
166 .116236033          9
149 .121777964         10
136 .126158543         11
116 .140726813         12
193 .147478226         13
143 .164627124         14
194 .175347727         15
145 .206125327         16
176 .207418722         17
160 .207918621         18
201  .21827842         19
183 .218636576         20

20 rows selected.

#### Conclusions

We use Oracle dbms_random.value to first generate a random number of each record in the table and then row_number functions to calculate the rank of the random number. We then use the rank to select the desired number of records. Please notice that it is not a good practice to use the employee_id in the table directly for sampling purpose because we are not sure if it is randomly generated.