Deep Data Mining Blog: row

Showing posts with label row_number. Show all posts

Wednesday, September 12, 2012

A random sampling method for medical claim audit

Random sampling processes could become very sophisticated. One of such examples is a stratified random sampling method developed by the Department of Social and Health Services (DSHS) to audit hospital Medicaid claims. This method is described in a paper "Analysis of the Regression Estimator
and its Implementation by DSHS"( http://www.hca.wa.gov/documents/rfp/recovery_audit/exhibit_b.pdf)

Hospitals provide services for patients who are covered by Medicaid and then file claims to the states to get the payment. State governments want to perform post-payment reviews to make sure that they do not overpay hospitals. Because the size of claims from a hospital could be very large, only a randomly sampled claims will be audited.

The claims of highest 25 amounts are always reviewed. For the remaining claims, the process first divides them into many buckets, or strata, based on their claim amount. Random sampling will be performed for each stratum. The higher the variance in claim amount in a stratum, the more samples will be drawn from it. Please notice that the sample size is NOT proportional to the population size in the stratum.

The most common ways to implement the above methodology are probably using SAS or SQL. I have seen the SAS version of it. I also implemented two SQL versions. One SQL version strictly follows the steps in the SAS package. Another SQL version uses Oracle analytic functions heavily that results in 75% reduction in the volume of code. These analytic functions include row_number() over(), width_bucket () , sum() over(), count() over(), etc.

Wednesday, August 15, 2012

Rank in Oracle: rank() and dense_rank()

Unlike row_number() function, rank() and dense_rank() may produce duplicate ranks.

1. The data
select num from TBL_TEST order by num;
NUM
----------
1
1
2
2
2
3
4
4
5
2. rank() may have gaps.
select num, rank() over(order by num) as rnk from TBL_TEST order by num;
NUM RNK
---------- ----------
1 1
1 1
2 3
2 3
2 3
3 6
4 7
4 7
5 9
3. dense_ran() does not have gap in ranks.
select num, dense_rank() over(order by num) as rnk from TBL_TEST order by num;

NUM RNK

---------- ----------
1 1
1 1
2 2
2 2
2 2
3 3
4 4
4 4
5 5

Thursday, August 02, 2012

Rank in Oracle

There are several ways to rank records based on certain variables. One of the is to use row_number

select * from TBL_TEST order by grp, num;
GRP NUM
---------- ----------
A 1
A 2
A 2
A 3
A 4
B 1
B 2
B 4
B 5

Row_number() generates unique sequece numbers without gap. Ties are given different numbers. I use row_number() to generate unique id.

Example 1. generate rank separately for GRP
with tbl as(
select a.grp, a.num, row_number() over(partition by GRP order by num) rnk from TBL_TEST a)
select * from tbl order by grp, rnk ;
GRP NUM RNK
---------- ---------- ----------
A 1 1
A 2 2
A 2 3
A 3 4
A 4 5
B 1 1
B 2 2
B 4 3
B 5 4

Example 2. Generate rank without partition by GRP
with tbl as(
select a.grp, a.num, row_number() over(order by grp, num) rnk from TBL_TEST a)
select * from tbl order by grp, rnk ;
GRP NUM RNK
---------- ---------- ----------
A 1 1
A 2 2
A 2 3
A 3 4
A 4 5
B 1 6
B 2 7
B 4 8
B 5 9

I will discuss rank functions including dense_rank(), ran() in another post.

Tuesday, June 05, 2012

SAS proc rank and its Oracle SQL equivalent

The following SAS rank and Oracle SQL produce the same results. In SAS, when there are ties, the average rank is used. To do the same in Oracle, we calculate the unique row numbers and then its average based on group id and variable (amt).

SAS rank.
proc rank data=tbl_in out=tbl_out ties=mean descending;
by grp;
var amt;
ranks amt_rnk;
run;

Oracle SQL rank.
create table tbl_out
as with tbl as
(
select a.*,
row_number() over(partition by grp order by amt desc) amt_rnk0 from tbl_in a
) select a.*, (avg( amt_rnk0 ) over(partition by grp, amt )) amt_rnk
from tbl a;

Saturday, June 02, 2012

Five Ways of Creating Unique Record Identifier For Oracle Tables

I have used the following five ways to create a unique record identifier for an Oracle table using SQL.
1. Simply use Oracle pseudocolumn rowid
Oracle pseudocolumn rowid comes with every table. It looks like the following.
SQL> select rowid from table_test where rownum <3;
ROWID
------------------
AAAYLPAAVAAEvLTAAA
AAAYLPAAVAAEvLTAAB
2. Create a unique record id using rownum
create table table_test_new as select rownum as uniq_id, a.* from table_test a;
3. Create a unique record id using row_number()
create table table_test_new2 as select row_number() over(order by LOG_DATE) uniq_id, a.* from table_test a;
4. Create a unique record id using sequence.
create sequence rec_id_seq;
create table table_test_new3 as select rec_id_seq.nextval uniq_id, a.* from table_test a;
5. Create a unique record id using sys_guid()
According to Oracle document, SYS_GUID generates and returns a globally unique identifier (RAW value) made up of 16 bytes.
create table table_test_new4 as select sys_guid() uniq_id, a.* from table_test a;
What are other ways to create a unique record id that I am missing?

Sunday, May 27, 2012

Calculate Gain Chart Using SQL in Oracle

The following script calculates the gain chart based on score and payment_ind (target variable, payment or non payment). The mod function is to reduce the number of data points in the gain chart

with tbl2 as
(select a.*, row_number() over(order by score desc ) risk_rank from tbl_testing_data a),
tbl3 as
( select a.*,
count(1) over(order by risk_rank) total_num,
sum(case when payment_ind='Y' then 1 else 0 end ) over(order by risk_rank) total_bad,
count(1) over(order by risk_rank)/count(1) over() pcnt_tot,
sum(case when payment_ind='Y' then 1 else 0 end ) over(order by risk_rank)/ sum(case when payment_ind='Y' then 1 else 0 end ) over()
pcnt_bad
from tbl2 a
)
select * from tbl3 where mod(risk_rank,100)=1 order by risk_rank;

Remove Duplicates From Data

It is quite common to have duplicates in the data. From example, an application is submitted or a claim is filed multiple times. With Oracle function row_number function, we can easily remove the duplicates.

The following SQL first calculates the rank of records by app_date for each app_id, then it only selects those records with ranks equals 1. The only earliest record for each app_id will be selected.

with tbl as
(
select a.*,
row_number() over(partition by app_id order by APP_DATE ) rnk from CELL_PHONE_SERVICE_APPS a
)
select * from tbl where rnk=1;

If we want to keep the latest record for each app_id, we simply generate rank for each app_id by the descending order of app_date as shown below.

with tbl as
(
select a.*,
row_number() over(partition by app_id order by APP_DATE desc ) rnk from CELL_PHONE_SERVICE_APPS a
)
select * from tbl where rnk=1;

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics