Problem
How do we randomly sample, say 20 records, from the table below that contains 107 employees?
SQL> select EMPLOYEE_ID, FIRST_NAME, LAST_NAME from EMPLOYEES where rownum <=10;
EMPLOYEE_ID FIRST_NAME LAST_NAME
----------- -------------------- -------------------------
100 Steven King
101 Neena Kochhar
102 Lex De Haan
103 Alexander Hunold
104 Bruce Ernst
105 David Austin
106 Valli Pataballa
107 Diana Lorentz
108 Nancy Greenberg
109 Daniel Faviet
Solution
We use the method describe in Randomly Sample a Precise Number of Records and More on random sampling in Oracle. Let's follow the steps.
1. Verify that employee_id is indeed the unique identifier.
SQL> select count(*), count(distinct employee_id) from EMPLOYEES;
COUNT(*) COUNT(DISTINCTEMPLOYEE_ID)
---------- --------------------------
107 107
2. Generate a random number for each employee_id using dbms_random.value. Function dbms_random.value generates uniformly distributed random number greater than or equal to 0 and less than 1.
SQL> create table tbl_emp_id_rnd as select EMPLOYEE_ID, dbms_random.value rnd from EMPLOYEES; Table created.The following is what tbl_emp_id_rnd looks like.
SQL> select * from tbl_emp_id_rnd where rownum <=10;
EMPLOYEE_ID RND
----------- ----------
100 .466996031
101 .325172718
102 .643593904
103 .822225992
104 .657242181
105 .244060518
106 .446914037
107 .423664122
108 .033736378
109 .405546964
10 rows selected.
3. Generate unique rank based on the random number for each record.
SQL> create table tbl_emp_id_rnd_rnk as select a.*, row_number() over(order by rnd) rnk from tbl_emp_id_rnd a; Table created.The following is what the table with rank looks like. To get the randomly sampled 20 records, we simply select records with ranks less or equal 20. The trick is based on that the rank of random numbers are also random.
SQL> select * from tbl_emp_id_rnd_rnk where rnk <=20 order by rnk;
EMPLOYEE_ID RND RNK
----------- ---------- ----------
177 .001160678 1
121 .001293396 2
151 .013037966 3
126 .018864319 4
108 .033736378 5
158 .09466332 6
168 .095874461 7
165 .116139108 8
166 .116236033 9
149 .121777964 10
136 .126158543 11
116 .140726813 12
193 .147478226 13
143 .164627124 14
194 .175347727 15
145 .206125327 16
176 .207418722 17
160 .207918621 18
201 .21827842 19
183 .218636576 20
20 rows selected.
Conclusions
We use Oracle dbms_random.value to first generate a random number of each record in the table and then row_number functions to calculate the rank of the random number. We then use the rank to select the desired number of records. Please notice that it is not a good practice to use the employee_id in the table directly for sampling purpose because we are not sure if it is randomly generated.
No comments:
Post a Comment