Deep Data Mining Blog: Oracle analytic functions

Showing posts with label Oracle analytic functions. Show all posts

Friday, December 14, 2012

Oracle NTILE function

Oracle Ntile function divides the records into the same number of rows (off at most by 1). Please notice that the records with same variable values (ties) may be placed into different buckets. The following query result shows that the maximum value in a bucket may overlap with the minimum value in the next bucket.

SQL> with tbl as (select ntile(5) over(order by price) nt, price from STOCK_PRICE) select nt, count(1), min(price), max(price) from tbl group by nt
order by nt;

NT COUNT(1) MIN(PRICE) MAX(PRICE)
---------- ---------- ---------- ----------
1 36387 9.58 13.68
2 36387 13.68 16.72
3 36386 16.72 20.87
4 36386 20.87 26.87
5 36386 26.87 89.53

We can also use function cume_dist combined with ceil function to divide the records into similar number of rows. Ties will be placed into the same buckets. The following query result shows that there is no overlap of values between different buckets.

SQL> with tbl as (select ceil(cume_dist() over(order by price nulls first)*5) nt, price from MV_VIX_15S_AVG) select nt, count(1), min(price), max(price) from tbl group by nt order by nt;

NT COUNT(1) MIN(PRICE) MAX(PRICE)
---------- ---------- ---------- ----------
1 36358 9.58 13.675
2 36332 13.68 16.71
3 36413 16.72 20.86
4 36429 20.87 26.865
5 36400 26.87 89.53

Wednesday, September 12, 2012

A random sampling method for medical claim audit

Random sampling processes could become very sophisticated. One of such examples is a stratified random sampling method developed by the Department of Social and Health Services (DSHS) to audit hospital Medicaid claims. This method is described in a paper "Analysis of the Regression Estimator
and its Implementation by DSHS"( http://www.hca.wa.gov/documents/rfp/recovery_audit/exhibit_b.pdf)

Hospitals provide services for patients who are covered by Medicaid and then file claims to the states to get the payment. State governments want to perform post-payment reviews to make sure that they do not overpay hospitals. Because the size of claims from a hospital could be very large, only a randomly sampled claims will be audited.

The claims of highest 25 amounts are always reviewed. For the remaining claims, the process first divides them into many buckets, or strata, based on their claim amount. Random sampling will be performed for each stratum. The higher the variance in claim amount in a stratum, the more samples will be drawn from it. Please notice that the sample size is NOT proportional to the population size in the stratum.

The most common ways to implement the above methodology are probably using SAS or SQL. I have seen the SAS version of it. I also implemented two SQL versions. One SQL version strictly follows the steps in the SAS package. Another SQL version uses Oracle analytic functions heavily that results in 75% reduction in the volume of code. These analytic functions include row_number() over(), width_bucket () , sum() over(), count() over(), etc.

Sunday, May 27, 2012

Remove Duplicates From Data

It is quite common to have duplicates in the data. From example, an application is submitted or a claim is filed multiple times. With Oracle function row_number function, we can easily remove the duplicates.

The following SQL first calculates the rank of records by app_date for each app_id, then it only selects those records with ranks equals 1. The only earliest record for each app_id will be selected.

with tbl as
(
select a.*,
row_number() over(partition by app_id order by APP_DATE ) rnk from CELL_PHONE_SERVICE_APPS a
)
select * from tbl where rnk=1;

If we want to keep the latest record for each app_id, we simply generate rank for each app_id by the descending order of app_date as shown below.

with tbl as
(
select a.*,
row_number() over(partition by app_id order by APP_DATE desc ) rnk from CELL_PHONE_SERVICE_APPS a
)
select * from tbl where rnk=1;

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics