Oracle Ntile function divides the records into the same number of rows (off at most by 1). Please notice that the records with same variable values (ties) may be placed into different buckets. The following query result shows that the maximum value in a bucket may overlap with the minimum value in the next bucket.
SQL> with tbl as (select ntile(5) over(order by price) nt, price from STOCK_PRICE) select nt, count(1), min(price), max(price) from tbl group by nt
order by nt;
NT COUNT(1) MIN(PRICE) MAX(PRICE)
---------- ---------- ---------- ----------
1 36387 9.58 13.68
2 36387 13.68 16.72
3 36386 16.72 20.87
4 36386 20.87 26.87
5 36386 26.87 89.53
We can also use function cume_dist combined with ceil function to divide the records into similar number of rows. Ties will be placed into the same buckets. The following query result shows that there is no overlap of values between different buckets.
SQL> with tbl as (select ceil(cume_dist() over(order by price nulls first)*5) nt, price from MV_VIX_15S_AVG) select nt, count(1), min(price), max(price) from tbl group by nt order by nt;
NT COUNT(1) MIN(PRICE) MAX(PRICE)
---------- ---------- ---------- ----------
1 36358 9.58 13.675
2 36332 13.68 16.71
3 36413 16.72 20.86
4 36429 20.87 26.865
5 36400 26.87 89.53
Popular Topics
Popular Topics
Showing posts with label Oracle analytic functions. Show all posts
Showing posts with label Oracle analytic functions. Show all posts
Friday, December 14, 2012
Wednesday, September 12, 2012
A random sampling method for medical claim audit
Random sampling processes could become very sophisticated. One of such examples is a stratified random sampling method developed by the Department of Social and Health Services (DSHS) to audit hospital Medicaid claims. This method is described in a paper "Analysis of the Regression Estimator
and its Implementation by DSHS"( http://www.hca.wa.gov/documents/rfp/recovery_audit/exhibit_b.pdf)
Hospitals provide services for patients who are covered by Medicaid and then file claims to the states to get the payment. State governments want to perform post-payment reviews to make sure that they do not overpay hospitals. Because the size of claims from a hospital could be very large, only a randomly sampled claims will be audited.
The claims of highest 25 amounts are always reviewed. For the remaining claims, the process first divides them into many buckets, or strata, based on their claim amount. Random sampling will be performed for each stratum. The higher the variance in claim amount in a stratum, the more samples will be drawn from it. Please notice that the sample size is NOT proportional to the population size in the stratum.
The most common ways to implement the above methodology are probably using SAS or SQL. I have seen the SAS version of it. I also implemented two SQL versions. One SQL version strictly follows the steps in the SAS package. Another SQL version uses Oracle analytic functions heavily that results in 75% reduction in the volume of code. These analytic functions include row_number() over(), width_bucket () , sum() over(), count() over(), etc.
and its Implementation by DSHS"( http://www.hca.wa.gov/documents/rfp/recovery_audit/exhibit_b.pdf)
Hospitals provide services for patients who are covered by Medicaid and then file claims to the states to get the payment. State governments want to perform post-payment reviews to make sure that they do not overpay hospitals. Because the size of claims from a hospital could be very large, only a randomly sampled claims will be audited.
The claims of highest 25 amounts are always reviewed. For the remaining claims, the process first divides them into many buckets, or strata, based on their claim amount. Random sampling will be performed for each stratum. The higher the variance in claim amount in a stratum, the more samples will be drawn from it. Please notice that the sample size is NOT proportional to the population size in the stratum.
The most common ways to implement the above methodology are probably using SAS or SQL. I have seen the SAS version of it. I also implemented two SQL versions. One SQL version strictly follows the steps in the SAS package. Another SQL version uses Oracle analytic functions heavily that results in 75% reduction in the volume of code. These analytic functions include row_number() over(), width_bucket () , sum() over(), count() over(), etc.
Sunday, May 27, 2012
Remove Duplicates From Data
It is quite common to have duplicates in the data. From example, an application is submitted or a claim is filed multiple times. With Oracle function row_number function, we can easily remove the duplicates.
The following SQL first calculates the rank of records by app_date for each app_id, then it only selects those records with ranks equals 1. The only earliest record for each app_id will be selected.
with tbl as
(
select a.*,
row_number() over(partition by app_id order by APP_DATE ) rnk from CELL_PHONE_SERVICE_APPS a
)
select * from tbl where rnk=1;
If we want to keep the latest record for each app_id, we simply generate rank for each app_id by the descending order of app_date as shown below.
with tbl as
(
select a.*,
row_number() over(partition by app_id order by APP_DATE desc ) rnk from CELL_PHONE_SERVICE_APPS a
)
select * from tbl where rnk=1;
Subscribe to:
Posts (Atom)