Showing posts with label Oracle analytic functions. Show all posts
Showing posts with label Oracle analytic functions. Show all posts

Friday, December 14, 2012

Oracle NTILE function

Oracle Ntile function divides the records into the same number of rows (off at most by 1). Please notice that the records with same variable values (ties) may be placed into different buckets. The following query result shows that the maximum value in a bucket may overlap with the minimum value in the next bucket.

SQL> with tbl as (select ntile(5) over(order by price) nt, price from STOCK_PRICE) select nt, count(1), min(price), max(price) from tbl group by nt
 order by nt;

        NT   COUNT(1) MIN(PRICE) MAX(PRICE)
---------- ---------- ---------- ----------
         1      36387       9.58      13.68
         2      36387      13.68      16.72
         3      36386      16.72      20.87
         4      36386      20.87      26.87
         5      36386      26.87      89.53

We can also use function cume_dist combined with ceil function to divide the records into  similar number of rows. Ties will be placed into the same buckets. The following query result shows that there is no overlap of values between different buckets.

SQL> with tbl as (select ceil(cume_dist() over(order by price nulls first)*5)  nt, price from MV_VIX_15S_AVG) select nt, count(1), min(price), max(price) from tbl group by nt order by nt;

        NT   COUNT(1) MIN(PRICE) MAX(PRICE)
---------- ---------- ---------- ----------
         1      36358       9.58     13.675
         2      36332      13.68      16.71
         3      36413      16.72      20.86
         4      36429      20.87     26.865
         5      36400      26.87      89.53

Wednesday, September 12, 2012

A random sampling method for medical claim audit

Random sampling processes could become very sophisticated. One of such examples is a stratified random sampling method developed by the Department of Social and Health Services (DSHS) to audit hospital Medicaid claims. This method is described in a paper "Analysis of the Regression Estimator
and its Implementation by DSHS"( http://www.hca.wa.gov/documents/rfp/recovery_audit/exhibit_b.pdf)

Hospitals provide services for patients who are covered by Medicaid and then file claims to the states to get the payment. State governments want to perform post-payment reviews to make sure that they do not overpay hospitals. Because the size of claims from a hospital could be very large, only a randomly sampled claims will be audited.

The claims of highest 25 amounts are always reviewed. For the remaining claims, the process first divides them into many buckets, or strata, based on their claim amount. Random sampling will be performed for each stratum. The higher the variance in claim amount in a stratum, the more samples will be drawn from it. Please notice that the sample size is NOT proportional to the population size in the stratum.

The most common ways to implement the above methodology are probably using SAS or SQL. I have seen the SAS version of it. I also implemented two SQL versions. One SQL version strictly follows the steps in the SAS package. Another SQL version uses Oracle analytic functions heavily that results in 75% reduction in the volume of code. These analytic functions include row_number() over(), width_bucket () , sum() over(), count() over(), etc.

Sunday, May 27, 2012

Remove Duplicates From Data


It is quite common to have duplicates in the data. From example, an application is submitted or a claim is filed  multiple times. With Oracle function row_number function, we can easily remove the duplicates.

The following SQL first calculates the rank of records by app_date for each app_id, then it only selects those records with ranks equals 1. The only earliest record for each app_id will be selected.

with tbl as
(
select a.*,
row_number() over(partition by app_id order by APP_DATE ) rnk  from CELL_PHONE_SERVICE_APPS a
)
select * from tbl where rnk=1;

If we want to keep the latest record for each app_id, we simply generate rank for each app_id by the descending order of app_date as shown below.


with tbl as
(
select a.*,
row_number() over(partition by app_id order by APP_DATE desc ) rnk  from CELL_PHONE_SERVICE_APPS a
)
select * from tbl where rnk=1;