Deep Data Mining Blog: August 2013

Thursday, August 29, 2013

Generate HTML Web Pages from SQL Query Outputs

Sometime we want to put our SQL query results on web pages. With sqlplus option "set markup html on", this can be done easily. The following query returns plain text output.

Now we turn on the html option and run the same query. The output of the select query becomes a html document.

I copy the html document into this post, change width to 40% and get the following table as you see.

ITEM	NAME
Apple	John
Banana	John
Banana	John
Apple	John
Apple	John
Pear	John
Pear	John
Peach	John
Peach	John
Peach	John

We can turn off the option by running "set markup html off",

Thursday, August 08, 2013

Calculate Statistical Mode- the Most Frequent Value

Statistical mode is the value that happens most often (as shown in the chart below). For discrete variables, mode can be calculated using the same approach describe in post Find the most frequent items using SQL. We simply select the top 1 most frequent item.

For continuous variables, we can use the function width_bucket() to divide the data into segments and calculate the frequency in each segment. We can then pick the average value of the data points from the most frequent segment as the mode. We use the 1,000 random numbers described in the post Generate 1000 Normally Distributed Numbers. The following query calculate the frequency and average value for each of the segment defined by width_bucket(variable, min_value, max_value, number_of_buckets). Buckets 0 and 12 are for values lower than -3 and higher than 3, respectively.

SQL> select s, count(1), avg(num) from (select width_bucket(num, -3,3, 11) s, num from TBL_1K_RND) group by s order by s;

S COUNT(1) AVG(NUM)

---------- ---------- ----------

0 2 -3.0538402

1 5 -2.6387896

2 29 -2.1748723

3 58 -1.6073189

4 115 -1.0457551

5 185 -.53460548

6 208 .010066623

7 194 .525553927

8 119 1.05966651

9 47 1.5985678

10 32 2.14458861

11 4 2.67165765

12 2 3.06640121

Again, we use row_number() function to generate rank based on frequency and calculate the mode as follows.

with tbl_1 as
(
select s, count(1) cnt, avg(num) num from (select width_bucket(num, -3,3, 11) s, num from TBL_1K_RND) group by s order by s),
tbl_2 as
(
select a.*, row_number() over(order by cnt desc) rnk from tbl_1 a
)
select num from tbl_2 where rnk=1;

NUM
----------

.010066623

Generate 1000 Normally Distributed Numbers

In the earlier post Oracle Normal Distribution Function, we show that Oracle function dbms_random.normal generate normally distributed numbers with zero mean and 1 standard deviation. The following PL/SQL scripts generate 1,000 random number using the function.

create table tbl_1k_rnd (num number);

begin
for i in 1..1000
loop
insert into tbl_1k_rnd select dbms_random.normal from dual;
end loop;
commit;
end;

/

The following query confirms that their mean and standard deviation are indeed 0 and 1, respectively.

SQL> select avg(num), stddev(num), count(*) from TBL_1K_RND;

AVG(NUM) STDDEV(NUM) COUNT(*)
---------- ----------- ----------
-.00403068 1.02280961 1000

Wednesday, August 07, 2013

Find the most frequent items using SQL (continued)

In the earlier post Find the most frequent items using SQL, we use "group by" query to calculate the frequency of items and then functions row_number() or dense_rank() to generate rank based on the frequency to pick the top N items. The difference between row_number() or dense_rank() is that row_number() generate unique rank for each row even if there are ties.

Now, we want to solve a more sophisticate problem. We want to select the top 2 most frequent items in table tbl_items_by_person (shown below) for each person, Emily and John (instead of global top 2 most frequent items).

Table tbl_items_by_person

The first step is to calculate the item frequency "group by" name and item.

SQL> select name, item, count(*) as num from TBL_ITEMS_BY_PERSON group by name, item order by name, count(*) desc;

NAME ITEM NUM

-------------------------------- -------------------------------- ----------

Emily Peach 8

Emily Apple 4

Emily Banana 3

Emily Pear 2

John Orange 5

John Apple 3

John Peach 3

John Banana 2

John Pear 2

The second step is to use row_number() or dense_rank() to generate rank for each name separately. This is done by using the clause over(partition by name order by num desc).

SQL> select name, item, num, row_number() over(partition by name order by num desc) as rnk from (select name, item, count(*) as num from TBL_ITEMS_BY_

PERSON group by name, item ) order by name, num desc;

NAME ITEM NUM RNK

-------------------------------- -------------------------------- ---------- ----------

Emily Peach 8 1

Emily Apple 4 2

Emily Banana 3 3

Emily Pear 2 4

John Orange 5 1

John Apple 3 2

John Peach 3 3

John Banana 2 4

John Pear 2 5

We can pick the top 2 frequent items for Emily and John using the following query.

SQL> with tbl as (select name, item, num, row_number() over(partition by name order by num desc) as rnk from (select name, item, count(*) as num from

TBL_ITEMS_BY_PERSON group by name, item )) select * from tbl where rnk<=2 order by name, num desc;

NAME ITEM NUM RNK

-------------------------------- -------------------------------- ---------- ----------

Emily Peach 8 1

Emily Apple 4 2

John Orange 5 1

John Peach 3 2

Or, we can use dense_rank() to generate rank which will assign the same rank for ties.

SQL> select name, item, num, dense_rank() over(partition by name order by num desc) as rnk from (select name, item, count(*) as num from TBL_ITEMS_BY_

PERSON group by name, item ) order by name, num desc;

NAME ITEM NUM RNK

-------------------------------- -------------------------------- ---------- ----------

Emily Peach 8 1

Emily Apple 4 2

Emily Banana 3 3

Emily Pear 2 4

John Orange 5 1

John Apple 3 2

John Peach 3 2

John Banana 2 3

John Pear 2 3

For John, there are 3 peaches and Apples. Both of them are ranked as 2. Thus they are both selected as shown below.

SQL> with tbl as (select name, item, num, dense_rank() over(partition by name order by num desc) as rnk from (select name, item, count(*) as num from

TBL_ITEMS_BY_PERSON group by name, item )) select * from tbl where rnk<=2 order by name, num desc;

NAME ITEM NUM RNK

-------------------------------- -------------------------------- ---------- ----------

Emily Peach 8 1

Emily Apple 4 2

John Orange 5 1

John Peach 3 2

John Apple 3 2

Find the most frequent items using SQL

It is a common task to find the most frequent items. For example, we want the top 2 frequently purchased items from the following table.
SQL> select * from TBL_ITEMS_PURCHASED;

ITEM

Apple

Banana

Apple

Pear

Peach

Orange

We can run a "group by" query to calculate the frequency of items as shown below.
SQL> select item, count(*) as num from tbl_items_purchased group by item order by num;

ITEM	NUM
Orange	5
Apple	3
Peach	3
Banana	2
Pear	2

Now the tricky part is to pick the top 2 most frequent ones. There are two situations.

1. Pick precisely two items and ignore the tie. We can use row_number() function to generate rank based on the frequency and pick the top 2. See the following two queries.
SQL> select a.*, row_number() over(order by num desc) rnk from (select item, count(*) as num from tbl_items_purchased group by item ) a order by rnk;

ITEM	NUM	RANK
Orange	5	1
Apple	3	2
Peach	3	3
Banana	2	4
Pear	2	5

SQL> with tbl as (select a.*, row_number() over(order by num desc) rnk from (select item, count(*) as num from tbl_items_purchased group by item ) a ) select * from tbl where rnk <=2;

ITEM	NUM	RANK
Orange	5	1
Apple	3	2

2.Pick items with the top 2 frequencies, regardless how many different items. In this case, we use dense_rank() function to generate rank based on the frequency. See the following two queries.
SQL> select a.*, dense_rank() over(order by num desc) rnk from (select item, count(*) as num from tbl_items_purchased group by item ) a order by rnk;

ITEM	NUM	RANK
Orange	5	1
Apple	3	2
Peach	3	2
Banana	2	3
Pear	2	3

SQL> with tbl as (select a.*, dense_rank() over(order by num desc) rnk from (select item, count(*) as num from tbl_items_purchased group by item ) a ) select * from tbl where rnk<=2;

ITEM	NUM	RANK
Orange	5	1
Apple	3	2
Peach	3	2

Tuesday, August 06, 2013

Produce Analytic Reports Using Oracle Apex vs. Excel Spreadsheet

For those who know how to write SQL scripts, Oracle Apex is a great tool to present the analytic result online. There is no need to learn another web authoring language. Almost everything is done using SQL, reports, charts, Checkboxes, radioes, select lists. Once I spent a few days and built an internal employee fraud monitoring tool for a bank.

I have found that building report online using Apex is far more efficient than using Excel spreadsheet.

Creating Spreadsheet involves manual editing which is error prone. With Apex, reports or charts are simply SQL queries.
It is harder to distribute spreadsheet files. Typically, they are distributed through emails as attachments. To keep them up to date is a problem. Any changes in the report will require the spreadsheet files to be resent and it may cause a lot of confusions for the receiver.With Apex, all we need is a URL link to the report.
Most importantly, reports in the form of spreadsheet have to be "fixed" which means we have to create them in advance and they become unchanged once created. With Oracle apex, reports are generated on-the-fly based on the users' inputs. In database terms, reports are simply SQL queries that are executed in real time. Thus, the number of reports that can be produced by a Apex page is huge(for example, our reports are based on the users' selection of date ranges, States,product types, etc.). It is hard to produce and navigate a spreadsheet file that has more than, say 100, tabs.

The following screenshot is from one of the reports that I built using Oracle Apex.

Friday, August 02, 2013

My Check Fraud Detection Model Reduces Fraud by 60%

A project manager from a bank just told me that the model developed by me has resulted in 60% reduction in fraud loss. He said "everyone was surprised at how effective it was." It was a model that I developed two years ago for a top 15 bank.(Of course, I was not surprised at all since the model was tested on holdout data sets and showed similar performance).

The model was built and deployed on Oracle databases. The main reason for the success of the model was that I spent huge amount of effort building model variables that captures the fraud patterns. Luckily, I used Oracle analytic functions to build those variables easily. (Please see my posts "How to build predictive models that win competitions" and "Recency, Frequency, Monetary (RFM) Analysis: Part 1"

Another important advantage of using an uniform platform, i.e., Oracle databases, is that the deployment is easy. I simply deployed the model as a set of SQL query. See my posts "Build Predictive Models Using PL/SQL" and "Logistic Regression Model Implemented in SQL"

Thursday, August 01, 2013

Get the source code for Oracle database views and materialized view

Oracle user_views/user_mviews contains source code about the view/materialized views.
SQL> select text from user_views where view_name='V_R_SUM';
TEXT
--------------------------------------------------------------------------------

select to_char(dt_tm, 'YYYY') yy, to_char(DT_TM, 'MM') mm, to_char(DT_TM,'DD') dd, to_char(DT_TM, 'HH24') hh, to_char(DT_TM, 'MI') mi, to_char(DT_TM,'SS') ss, min(price) mi_p, avg(price) a_p, max(price) mx_p from MV_VIX3 group by to_char(dt_tm, 'YYYY'),to_char(DT_TM, 'MM'),to_char(DT_TM,'DD'),to_char(DT_TM, 'HH24'),to_char(DT_TM, 'MI'), to_char(DT_TM,'SS')

SQL> select query from user_mviews where mview_name='MV_ALL_FOR_SUM';
QUERY
--------------------------------------------------------------------------------
select
a.TXN_ROWID,
a.TRN_CUSTOMER_ACCOUNT_NUM,
a.TRN_TRANSACTION_DT daily_txn_dt,
a.TRN_TRANSACTION_AMT,
a.STAR_ROWID,
a.CUSTOMER_ID_NUMBER,
a.STAR_TRAN_DATETIME,
b.TRANSACTION_AMT frd_clm_amt, b.TRANSACTION_DT frd_clm_dt, b.clm_rowid,
c.fee, c.od_rowid
from
tbl_m_txn_star_final a,
tbl_m_txn_star_clm b,
mv_m_txn_star_od c
where a.txn_rowid=b.txn_rowid(+) and
a.txn_rowid=c.txn_rowid(+)

It is very helpful as we can always find out the logic that is used to created the view. This is one of the advantage of using views over tables.

Get source code for Oracle database objects: user_source view

We can query Oracle user_source to get the source code about functions, procedures, triggers, etc. User_source contains 4 columns, i.e., name, type, line and text.

On my schema, the following query shows that there are 55 lines of source code for functions, 50 lines of source code for procedures, etc.

SQL> select type, count(*) from user_source group by type order by type;
TYPE COUNT(*)
------------ ----------
FUNCTION 55
PACKAGE 31
PACKAGE BODY 469
PROCEDURE 50

The following query shows the definition of a function z_cap. Thus, user_source is a very handy way to retrieve the source code.

SQL> select line, text from user_source where name='Z_CAP';
LINE TEXT
---------- --------------------------------------------------------------------------------
1 FUNCTION z_cap (
2 p_data number,
3 low number,
4 hi number
5 )
6 RETURN NUMBER IS
7 v_number NUMBER;
8 BEGIN
9 if p_data 10 elsif p_data>hi then v_number:= hi;
11 else v_number:= p_data;
12 end if;
13 return v_number;
14 END z_cap;

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics

Popular Topics

Thursday, August 29, 2013

Generate HTML Web Pages from SQL Query Outputs

Thursday, August 08, 2013

Calculate Statistical Mode- the Most Frequent Value

Generate 1000 Normally Distributed Numbers

Wednesday, August 07, 2013

Find the most frequent items using SQL (continued)

Find the most frequent items using SQL

Tuesday, August 06, 2013

Produce Analytic Reports Using Oracle Apex vs. Excel Spreadsheet

Friday, August 02, 2013

My Check Fraud Detection Model Reduces Fraud by 60%

Thursday, August 01, 2013

Get the source code for Oracle database views and materialized view

Get source code for Oracle database objects: user_source view