Deep Data Mining Blog: More on Taking Randomly Sampled Records From the Table

Thursday, November 28, 2013

More on Taking Randomly Sampled Records From the Table

Problem

Random sampling is a very common task in a data analytics project. Sometime we want to sample precise number of records such as described in the post Take Randomly Sampled Records From the Table. Sometime, we randomly split a data set into training and testing sets at 70/30 (or whatever ratio). How do we efficiently perform multiple random sampling tasks?

Solution

I have found it is convenient to create a permanent table that contains the unique record id (or account id, card number, etc.) and a uniformly distributed random number. The following is an example of creating such a table described in Take Randomly Sampled Records From the Table. dbms_random.value returns uniformly distributed random number ranging from 0 to 1.

SQL> create table tbl_emp_id_rnd as select EMPLOYEE_ID, 
dbms_random.value rnd from EMPLOYEES;

Table created.

SQL> select * from tbl_emp_id_rnd where rownum <=10;

EMPLOYEE_ID        RND
----------- ----------
        100 .466996031
        101 .325172718
        102 .643593904
        103 .822225992
        104 .657242181
        105 .244060518
        106 .446914037
        107 .423664122
        108 .033736378
        109 .405546964

10 rows selected.

We can perform many types of random sampling based on such a table whenever we need to. For example, we can split the records into 70% training and 30% testing set. And we can change the ratio of 70/30 easily.

SQL> create view v_taining_emp_id as select EMPLOYEE_ID 
from tbl_emp_id_rnd where rnd <=0.7;

View created.

SQL> create view v_testing_emp_id as select EMPLOYEE_ID 
from tbl_emp_id_rnd where rnd >0.7;

View created.

I have found such a table is extremely helpful in a project where randomly sampling is performed multiple times.

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics

Popular Topics

Thursday, November 28, 2013

More on Taking Randomly Sampled Records From the Table

Problem

Solution

No comments: