Deep Data Mining Blog: random sampling

Showing posts with label random sampling. Show all posts

Wednesday, September 22, 2021

Online Course: Oracle SQL for Random Sampling

Since many companies store their critical business data in Oracle databases, it is advantageous to perform random sampling within the same environment using SQL without data movement. For example, it is time-consuming to pull a large data set out of a database and do random sampling using Python on a laptop computer. In addition, the data are prone to various security issues once they are not protected by the database.

A Competition-winning data scientist and long-time Oracle SQL practitioner Dr. Jay Zhou creates an online course and shares his expertise in performing random sampling using Oracle SQL. Students will learn practical skills that can be applied immediately in their work. There were hundreds of people from 85 countries who took the course.

The course begins with a description of scenarios where random sampling is necessary. A number of useful Oracle SQL random functions are introduced. The course uses examples and presents SQL scripts to perform the following common tasks.

How to quickly view random samples of the data. There are multiple ways to do this task.
How to select a precise number of samples randomly.
How to split data randomly. This is a necessary task when we build a machine learning model and need to produce three data sets, i.e., training, testing, and validation sets.
How to select random samples by groups. For example, we want to randomly select 100 students, 50 of them female and 50 male, from a school.

Are there any course requirements or prerequisites?

Very basic Oracle SQL knowledge

Who this course is for?

SQL developers, data analysts, data scientists, statisticians

Please take the course here.

Thursday, October 20, 2016

Oracle Ora_Hash function- Part 1 Random Sampling

Oracle ora_hash() is a very useful function. I have used it for different purposes such as generating random number. The following query generate 5 buckets from 0 to 4, each of them have the similar number of records.
First, we create a table and populate it with 1,000 numbers.

create table t_n (n number);

begin
for i in 1..1000
 loop 
 insert into t_n values(i);
 end loop;
 commit;
end;

In the query below, the parameter 5 of ora_hash defines the number of buckets is 5. As we see, each bucket has simlilar number of records.

with tbl as
(
select ora_hash(n, 5) bucket, n  from t_n)
select bucket, count(*), min(n), max(n) from tbl
group by bucket order by 1;
BBUCKET COUNT(*) MIN(N) MAX(N)
0 154          2  993
1 164          7  999
2 168          6  991
3 175          4  995
4 173          8 1000

Tuesday, December 11, 2012

More on Data Preparation for Logistic Regression Models

In the earlier post Data preparation for building logistic regression models , we talked about converting data into more compact format so that we can use memory intensive software like R to handle large number of cases.

Another issue that we commonly face is how to deal with "unbalanced" data. The following are some examples:
1. Only 7 to 12 fraudulent ones out of 10,000 transactions.
2. Online ads has only 1 to 3 clicks per 10,000 impressions.
3. Mobile phone account non-payment rate is 12%.
4. It is assumed the 6% of medical claims are fraudulent.

For the sake of reducing data volume without sacrificing model accuracy, it makes sense to reduce the cases in the major classes (good card transactions, non-click impressions, good mobile phone accounts, normal medical claims) through random sampling. For example, to gather 1,000 fraudulent bank card transactions within certain period, there will be 10 million good transactions from the same period. We can use the approach described in an earlier post More on random sampling in Oracle. However, the final model training data sets do NOT have to be perfectly balanced, i.e., fraud and good transactions do NOT have to be 50% and 50%. For example, it is OK that the model training data set has 80% good transactions and 20% fraudulent ones. For example, we can build two logistic regression models:

Model A. training set: 50% good, 50% fraud
Model B. training set: 80% good, 20% fraud

Model A and B will produce different probabilities of being fraud for an individual transaction(Probability of being fraud produced by Model A will most likely be higher than that produced by Model B). However, the relative ranks of transactions' probabilities of being fraud for a whole data set given by both models could be the same. For example, the top 1% riskiest transactions identified by both models are the same. Thus both models are equivalent. As we can see in many applications what matters is the relative ranks (of being fraud, being non-payment, etc.) produced by the predictive models. The quality/usefulness of relative ranks can be depicted by gain charts.

Wednesday, October 03, 2012

More on random sampling in Oracle

The earlier post Random Sampling using SQL uses dbms_random.random to generate random numbers. A better function is dbms_random.value which generates uniformly distributed random number greater than or equal to 0 and less than 1.
We can rewrite random sampling scripts as the following.
Step 1 . Create random numbers.
create table tbl_app_id_w_rnd as select app_id, dbms_random.value rnd from CELL_PHONE_SERVICE_APPS;
Step 2. Create a view that contains 30% of the records.
create view v_data_testing as select a.* from CELL_PHONE_SERVICE_APPS a, tbl_app_id_w_rnd b where a.app_id=b.app_id and rnd <=0.3;

Wednesday, September 12, 2012

A random sampling method for medical claim audit

Random sampling processes could become very sophisticated. One of such examples is a stratified random sampling method developed by the Department of Social and Health Services (DSHS) to audit hospital Medicaid claims. This method is described in a paper "Analysis of the Regression Estimator
and its Implementation by DSHS"( http://www.hca.wa.gov/documents/rfp/recovery_audit/exhibit_b.pdf)

Hospitals provide services for patients who are covered by Medicaid and then file claims to the states to get the payment. State governments want to perform post-payment reviews to make sure that they do not overpay hospitals. Because the size of claims from a hospital could be very large, only a randomly sampled claims will be audited.

The claims of highest 25 amounts are always reviewed. For the remaining claims, the process first divides them into many buckets, or strata, based on their claim amount. Random sampling will be performed for each stratum. The higher the variance in claim amount in a stratum, the more samples will be drawn from it. Please notice that the sample size is NOT proportional to the population size in the stratum.

The most common ways to implement the above methodology are probably using SAS or SQL. I have seen the SAS version of it. I also implemented two SQL versions. One SQL version strictly follows the steps in the SAS package. Another SQL version uses Oracle analytic functions heavily that results in 75% reduction in the volume of code. These analytic functions include row_number() over(), width_bucket () , sum() over(), count() over(), etc.

Friday, August 31, 2012

Random Sampling in Oracle using Sample Clause

In an earlier post about "Random Sampling using SQL", we use functions mode and dbms_random.random to randomly sample 30% of records. This can also be done easily with oracle sample clause as the following:
select * from TBL_TEST sample(30);
Sample clause is also a great way to test the performance of a query before we run it on large tables.

Wednesday, August 29, 2012

Random Sampling using SQL

The following SQL scripts create a view by taking 30% random samples from a table.

Step 1 . Create random numbers within range of 0 and 99 (100 bins).
create table tbl_app_id_w_rnd as select app_id,
mod(abs(dbms_random.random()),100) rnd from CELL_PHONE_SERVICE_APPS;

Step 2. Create a view that contains 30% of the records.
create view v_data_testing
as select a.* from CELL_PHONE_SERVICE_APPS a,
tbl_app_id_w_rnd b where a.app_id=b.app_id and rnd < 30;

Please see a newer post. dbms_random.value is a better function than dbms_random.random.

Monday, May 28, 2012

Randomly Sample a Precise Number of Records

With Oracle dbms_random.random and row_number function, we can easily randomly select a certain number of records from the data.

The following scripts select 50 records randomly from each group (strata). First, a number number is generate for each record. Then a rank is generated by row_number function based on the random number for each strata. Finally, only records ranked top 50 in each strata are selected. The rank of random number is also random.

create table tbl_id_w_rnd as select id, strata, dbms_random.random() rnd from tbl_data;

create materialized view mv_precise_50_samples as
With tbl as
(
Select a.*, row_number() over(partition by strata order by rnd) rnk from tbl_id_w_rnd a
)
Select * from tbl where rnk<=50;
--test
select strata, count(1) from mv_precise_50_samples group by strata order by strata;

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics