Deep Data Mining Blog

Tuesday, October 01, 2013

A Quick Way to Import Spreadsheet Into a Relational Database

I just made a youtube video to show readers how to do this. We often need to import data from Excel spreadsheet, such as the one shown below, into a database.

A quick way that works for any databases is to simply generate SQL insert statements for those rows using formula similar to: Formula: =CONCATENATE("insert into tbl_dataset values(",A2,",","'",B2,"');") as shown in column C of the picture below.

Formula: =CONCATENATE("insert into tbl_dataset values(",A2,",","'",B2,"');")

We create a destination table. Then we copy those "insert into " statements from Excel spreadsheet and paste them into SQL client tool such as SQL Developer or SQLPLUS to run them. Do not forget to commit the inserts.This approach has advantages: 1. It works for any relational databases as the insert statements are standard SQL (if not can adjust the spreadsheet formula slightly). 2. It does not require any data import tools. All we need are Excel spreadsheet and a SQL client to run the create table and insert statements.

SQL> create table tbl_dataset (col1 number, col2 varchar2(8));
Table created.

Run the following insert statements. If there are many lines, we can put them in a script file and run the script file.

insert into tbl_dataset values(1,'A');
insert into tbl_dataset values(2,'B');
insert into tbl_dataset values(3,'C');
insert into tbl_dataset values(4,'D');
insert into tbl_dataset values(5,'E');
insert into tbl_dataset values(6,'F');
insert into tbl_dataset values(7,'G');
insert into tbl_dataset values(8,'H');
insert into tbl_dataset values(9,'I');

SQL> insert into tbl_dataset values(1,'A');
1 row created.
SQL> insert into tbl_dataset values(2,'B');
1 row created.
SQL> insert into tbl_dataset values(3,'C');
1 row created.
SQL> insert into tbl_dataset values(4,'D');
1 row created.
SQL> insert into tbl_dataset values(5,'E');
1 row created.
SQL> insert into tbl_dataset values(6,'F');
1 row created.
SQL> insert into tbl_dataset values(7,'G');
1 row created.
SQL> insert into tbl_dataset values(8,'H');
1 row created.
SQL> insert into tbl_dataset values(9,'I');
1 row created.

Do not forget to commit the changes.

SQL> commit;
Commit complete.

Data are imported into the database.

SQL> select * from tbl_dataset;

COL1	COL2
1	A
2	B
3	C
4	D
5	E
6	F
7	G
8	H
9	I

9 rows selected.

Saturday, September 28, 2013

Generate SQL Create Table/View Queries for Existing Tables/Views

Sometimes, it is useful to keep a copy of the DDL statements, i.e., create table/view, for all or some of the tables/views so that we can recreate them. Function dbms_metadata.get_ddl() can be used here.

SQL> select dbms_metadata.get_ddl('VIEW','V_6K_OBS') from dual;

DBMS_METADATA.GET_DDL('VIEW','V_6K_OBS')
CREATE OR REPLACE FORCE VIEW "BDM"."V_6K_OBS" ("TABLE_NAME", "COLUMN_NAME", " DATA_TYPE", "DATA_TYPE_MOD", "DATA_TYPE_OWNER", "DATA_LENGTH", "DATA_PRECISION", "DATA_SCALE", "NULLABLE", "COLUMN_ID", "DEFAULT_LENGTH", "DATA_DEFAULT", "NUM_D ISTINCT", "LOW_VALUE", "HIGH_VALUE", "DENSITY", "NUM_NULLS", "NUM_BUCKETS", "LAS T_ANALYZED", "SAMPLE_SIZE", "CHARACTER_SET_NAME", "CHAR_COL_DECL_LENGTH", "GLOBA L_STATS", "USER_STATS", "AVG_COL_LEN", "CHAR_LENGTH", "CHAR_USED", "V80_FMT_IMAG E", "DATA_UPGRADED", "HISTOGRAM") AS select "TABLE_NAME","COLUMN_NAME","DATA_ TYPE","DATA_TYPE_MOD","DATA_TYPE_OWNER","DATA_LENGTH","DATA_PRECISION","DATA_SCA LE","NULLABLE","COLUMN_ID","DEFAULT_LENGTH","DATA_DEFAULT","NUM_DISTINCT","LOW_V ALUE","HIGH_VALUE","DENSITY","NUM_NULLS","NUM_BUCKETS","LAST_ANALYZED","SAMPLE_S IZE","CHARACTER_SET_NAME","CHAR_COL_DECL_LENGTH","GLOBAL_STATS","USER_STATS","AV G_COL_LEN","CHAR_LENGTH","CHAR_USED","V80_FMT_IMAGE","DATA_UPGRADED","HISTOGRAM" from user_tab_columns where rownum <=6000

DBMS_METADATA.GET_DDL('VIEW','V_6K_OBS')

CREATE OR REPLACE FORCE VIEW "BDM"."V_6K_OBS" ("TABLE_NAME", "COLUMN_NAME", " DATA_TYPE", "DATA_TYPE_MOD", "DATA_TYPE_OWNER", "DATA_LENGTH", "DATA_PRECISION", "DATA_SCALE", "NULLABLE", "COLUMN_ID", "DEFAULT_LENGTH", "DATA_DEFAULT", "NUM_D ISTINCT", "LOW_VALUE", "HIGH_VALUE", "DENSITY", "NUM_NULLS", "NUM_BUCKETS", "LAS T_ANALYZED", "SAMPLE_SIZE", "CHARACTER_SET_NAME", "CHAR_COL_DECL_LENGTH", "GLOBA L_STATS", "USER_STATS", "AVG_COL_LEN", "CHAR_LENGTH", "CHAR_USED", "V80_FMT_IMAG E", "DATA_UPGRADED", "HISTOGRAM") AS select "TABLE_NAME","COLUMN_NAME","DATA_ TYPE","DATA_TYPE_MOD","DATA_TYPE_OWNER","DATA_LENGTH","DATA_PRECISION","DATA_SCA LE","NULLABLE","COLUMN_ID","DEFAULT_LENGTH","DATA_DEFAULT","NUM_DISTINCT","LOW_V ALUE","HIGH_VALUE","DENSITY","NUM_NULLS","NUM_BUCKETS","LAST_ANALYZED","SAMPLE_S IZE","CHARACTER_SET_NAME","CHAR_COL_DECL_LENGTH","GLOBAL_STATS","USER_STATS","AV G_COL_LEN","CHAR_LENGTH","CHAR_USED","V80_FMT_IMAGE","DATA_UPGRADED","HISTOGRAM" from user_tab_columns where rownum <=6000

SQL> select dbms_metadata.get_ddl('TABLE','MV_UNIVAR_STS') from dual;

DBMS_METADATA.GET_DDL('TABLE','MV_UNIVAR_STS')
CREATE TABLE "BDM"."MV_UNIVAR_STS" ( "FILENAME" VARCHAR2(32), "C" NUMBE R, "TOT" NUMBER, "TOT_DIS" NUMBER, "MI_VAL" VARCHAR2(512), "MX_VAL" VARC HAR2(512) ) SEGMENT CREATION IMMEDIATE PCTFREE 10 PCTUSED 40 INITRANS 1 MA XTRANS 255 NOCOMPRESS LOGGING STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645 PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT) TABLESPACE "BDM"

DBMS_METADATA.GET_DDL('TABLE','MV_UNIVAR_STS')

CREATE TABLE "BDM"."MV_UNIVAR_STS" ( "FILENAME" VARCHAR2(32), "C" NUMBE R, "TOT" NUMBER, "TOT_DIS" NUMBER, "MI_VAL" VARCHAR2(512), "MX_VAL" VARC HAR2(512) ) SEGMENT CREATION IMMEDIATE PCTFREE 10 PCTUSED 40 INITRANS 1 MA XTRANS 255 NOCOMPRESS LOGGING STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645 PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT) TABLESPACE "BDM"

Generate SQL Queries "Automatically"

We do not have to write every SQL query manually. It is very efficient to generate SQL statements "automatically" using queries. Of course, we can use SQL queries to generate of statements in other programming languages such C/C++. This was what I did when I took the Computer Software Engineering course in the university. In a number of projects, I used SQL queries to generate large quantity of C++ code for many object classes in neat format automatically. I earned a good grade.

For example, the following query generates a number of queries that calculate the number of records and the average values for table names beginning with "TBL" and column data type is number.

SQL> select 'select '||''''||table_name||''''||','||''''||column_name||''''||', count(*), avg('||column_name||') avg_value from '|| table_name||';' from user_tab_columns where table_name like 'TBL%' and data_type = 'NUMBER' and column_name like '%AMT1';

select 'TBL_FRAUDDETAIL_HIST','MD_TRAN_AMT1', count(*), avg(MD_TRAN_AMT1) avg_value from TBL_FRAUDDETAIL_HIST;

select 'TBL_MATCHED_SO_FAR1124','MD_TRAN_AMT1', count(*), avg(MD_TRAN_AMT1) avg_value from TBL_MATCHED_SO_FAR1124;

select 'TBL_MATCHED_SO_FAR1201','MD_TRAN_AMT1', count(*), avg(MD_TRAN_AMT1) avg_value from TBL_MATCHED_SO_FAR1201;

select 'TBL_MATCHED_SO_FAR1226','MD_TRAN_AMT1', count(*), avg(MD_TRAN_AMT1) avg_value from TBL_MATCHED_SO_FAR1226;

select 'TBL_TXN_4_POC1','MD_TRAN_AMT1', count(*), avg(MD_TRAN_AMT1) avg_value from TBL_TXN_4_POC1;

select 'TBL_TXN_4_POC2','MD_TRAN_AMT1', count(*), avg(MD_TRAN_AMT1) avg_value from TBL_TXN_4_POC2;

select 'TBL_TXN_4_POC3','MD_TRAN_AMT1', count(*), avg(MD_TRAN_AMT1) avg_value from TBL_TXN_4_POC3;

select 'TBL_TXN_4_POC4','MD_TRAN_AMT1', count(*), avg(MD_TRAN_AMT1) avg_value from TBL_TXN_4_POC4;

select 'TBL_TXN_4_POC5','MD_TRAN_AMT1', count(*), avg(MD_TRAN_AMT1) avg_value from TBL_TXN_4_POC5;

select 'TBL_TXN_4_POC6','MD_TRAN_AMT1', count(*), avg(MD_TRAN_AMT1) avg_value from TBL_TXN_4_POC6;

select 'TBL_TXN_FOR_POC_EXT','MD_TRAN_AMT1', count(*), avg(MD_TRAN_AMT1) avg_value from TBL_TXN_FOR_POC_EXT;

select 'TBL_TXN_FOR_POC_EXT2','MD_TRAN_AMT1', count(*), avg(MD_TRAN_AMT1) avg_value from TBL_TXN_FOR_POC_EXT2;

12 rows selected.

The following query generates queries to count the distinctive values for all table names starting with "DEMO" and data type is character.

SQL> select 'select '||''''||table_name||''''||','||''''||column_name||''''||', count(distinct '||column_name||') from '||table_name||';' from use r_tab_columns where table_name like 'DEMO%' and data_type like 'VAR%';

select 'DEMO_CUSTOMERS_LOCAL','CUST_FIRST_NAME', count(distinct CUST_FIRST_NAME) from DEMO_CUSTOMERS_LOCAL;

select 'DEMO_CUSTOMERS_LOCAL','CUST_LAST_NAME', count(distinct CUST_LAST_NAME) from DEMO_CUSTOMERS_LOCAL;

select 'DEMO_CUSTOMERS_LOCAL','CUST_STREET_ADDRESS1', count(distinct CUST_STREET_ADDRESS1) from DEMO_CUSTOMERS_LOCAL;

select 'DEMO_CUSTOMERS_LOCAL','CUST_STREET_ADDRESS2', count(distinct CUST_STREET_ADDRESS2) from DEMO_CUSTOMERS_LOCAL;

select 'DEMO_CUSTOMERS_LOCAL','CUST_CITY', count(distinct CUST_CITY) from DEMO_CUSTOMERS_LOCAL;

select 'DEMO_CUSTOMERS_LOCAL','CUST_STATE', count(distinct CUST_STATE) from DEMO_CUSTOMERS_LOCAL;

select 'DEMO_CUSTOMERS_LOCAL','CUST_POSTAL_CODE', count(distinct CUST_POSTAL_CODE) from DEMO_CUSTOMERS_LOCAL;

select 'DEMO_CUSTOMERS_LOCAL','PHONE_NUMBER1', count(distinct PHONE_NUMBER1) from DEMO_CUSTOMERS_LOCAL;

select 'DEMO_CUSTOMERS_LOCAL','PHONE_NUMBER2', count(distinct PHONE_NUMBER2) from DEMO_CUSTOMERS_LOCAL;

select 'DEMO_CUSTOMERS_LOCAL','CUST_EMAIL', count(distinct CUST_EMAIL) from DEMO_CUSTOMERS_LOCAL;

10 rows selected.

Create Database Link to DB on Amazon EC2 Instance

From a database, we can create a database link to another remote database such as one on Amazon EC2 virtual server as shown below. Here ec2-12-34-567-899.compute-1.amazonaws.com is the amazon EC2 Linux instance's Public DNS.

SQL> create database link dl_aws_ec2 connect to prod_DB identified by PWDXXX using '(description=(address=(protocol= TCP)(host=ec2-12-34-567-899.compute-1.amazonaws.com)(port=1521)) (connect_data=(sid=XE)))';
Database link created.

SQL> select count(1) from user_tables@dl_aws_ec2;

COUNT(1)
15

Just like we query any databases, we can see a few tables with names starting with the word "DEMO", count the number of records, and if we want, make a local copy of the tables.

SQL> select table_name from user_tables@dl_aws_ec2 where table_name like 'DEMO%' and rownum <5 order by table_name;

TABLE_NAME
DEMO_CUSTOMERS
DEMO_ORDERS
DEMO_ORDER_ITEMS
DEMO_PAGE_HIERARCHY

SQL> select count(*) from DEMO_CUSTOMERS@dl_aws_ec2;

COUNT(*)
7

SQL> create table DEMO_CUSTOMERS_LOCAL as select * from DEMO_CUSTOMERS@dl_aws_ec2;
Table created.

Sunday, September 22, 2013

Trim Function- Remove Leading and Trailing Blanks

Leading and trailing banks can be removed by Oracle trim function as shown below.

If we look at the lengths of the original and trimmed string (columns 4 and 5), we notice that the fourth record has 4 banks in the original string. However, the string is replaced with a NULL (length zero). If it is desirable that we want to keep one blank for the record, we can use NVL function to replace the NULL with a single blank.

In a project, I used the query similar to the following and fixed the data fields in debit card transaction.

nvl(substr( trim(SD_TERM_NAME_LOC),1,18),' ')

What I did was to first remove leading and trailing banks, then extract the first 18 characters (in case the terminal name is too long). In the case that the terminal name is all blanks or NULL, I replace them with a single blank. This data preparation step is necessary before we build a predictive model.

Some Observations on NULL Value Handling in Oracle SQL

We need to be aware of how NULL/missing values are handled in SQL query so that we will not be surprised by query results that appear "wrong". This is descried using the following simple table as an example. The fifth record has a NULL value.

SQL> select id, value from tbl_data order by id;

ID	VALUE
1	-1
2	0
3	1
4	2
5

If we calculate the total number of records in the table, number of records with values>=0 and values <0, they are 5, 3 and 1, respectively, as shown below. As we can see, the number of records for values >=0 (3) plus that <0 (1) is less than the total number of records (5). This is because NULL values appear in the SQL where clause exclude records from the consideration.

SQL> select count(*) from tbl_data;

COUNT(1)
5

SQL> select count(*) from tbl_data where value>=0;

COUNT(1)
3

SQL> select count(*) from tbl_data where value<0;

COUNT(1)
1

A better way to calcluate this kind of statisitcs is to use "case when" instead of where clause as shown below.

SQL> select count(*) total, sum(case when value>=0 then 1 else 0 end) non_negative, sum(case when value<0 then 1 else 0 end) negative, sum(case whe n value is null then 1 else 0 end) n_missing from tbl_data;

TOTAL	NON_NEGATIVE	NEGATIVE	N_MISSING
5	3	1	1

Or even better, we combine "case when" with "group by" to calculate the statistics. "Group by" is one of my favorites as it gives the complete picture (including NULL values) about the data.

SQL> select v, count(*) from (select case when value>=0 then 'non-negative' when value <0 then 'negative' else 'missing' end v from tbl_data) group by v;

V	COUNT(1)
negative	1
non-negative	3
missing	1

In summary, if we are aware of how NULL values are handles in the database, we will not be surprised by query results that appear "wrong".

Thursday, August 29, 2013

Generate HTML Web Pages from SQL Query Outputs

Sometime we want to put our SQL query results on web pages. With sqlplus option "set markup html on", this can be done easily. The following query returns plain text output.

Now we turn on the html option and run the same query. The output of the select query becomes a html document.

I copy the html document into this post, change width to 40% and get the following table as you see.

ITEM	NAME
Apple	John
Banana	John
Banana	John
Apple	John
Apple	John
Pear	John
Pear	John
Peach	John
Peach	John
Peach	John

We can turn off the option by running "set markup html off",

Thursday, August 08, 2013

Calculate Statistical Mode- the Most Frequent Value

Statistical mode is the value that happens most often (as shown in the chart below). For discrete variables, mode can be calculated using the same approach describe in post Find the most frequent items using SQL. We simply select the top 1 most frequent item.

For continuous variables, we can use the function width_bucket() to divide the data into segments and calculate the frequency in each segment. We can then pick the average value of the data points from the most frequent segment as the mode. We use the 1,000 random numbers described in the post Generate 1000 Normally Distributed Numbers. The following query calculate the frequency and average value for each of the segment defined by width_bucket(variable, min_value, max_value, number_of_buckets). Buckets 0 and 12 are for values lower than -3 and higher than 3, respectively.

SQL> select s, count(1), avg(num) from (select width_bucket(num, -3,3, 11) s, num from TBL_1K_RND) group by s order by s;

S COUNT(1) AVG(NUM)

---------- ---------- ----------

0 2 -3.0538402

1 5 -2.6387896

2 29 -2.1748723

3 58 -1.6073189

4 115 -1.0457551

5 185 -.53460548

6 208 .010066623

7 194 .525553927

8 119 1.05966651

9 47 1.5985678

10 32 2.14458861

11 4 2.67165765

12 2 3.06640121

Again, we use row_number() function to generate rank based on frequency and calculate the mode as follows.

with tbl_1 as
(
select s, count(1) cnt, avg(num) num from (select width_bucket(num, -3,3, 11) s, num from TBL_1K_RND) group by s order by s),
tbl_2 as
(
select a.*, row_number() over(order by cnt desc) rnk from tbl_1 a
)
select num from tbl_2 where rnk=1;

NUM
----------

.010066623

Generate 1000 Normally Distributed Numbers

In the earlier post Oracle Normal Distribution Function, we show that Oracle function dbms_random.normal generate normally distributed numbers with zero mean and 1 standard deviation. The following PL/SQL scripts generate 1,000 random number using the function.

create table tbl_1k_rnd (num number);

begin
for i in 1..1000
loop
insert into tbl_1k_rnd select dbms_random.normal from dual;
end loop;
commit;
end;

/

The following query confirms that their mean and standard deviation are indeed 0 and 1, respectively.

SQL> select avg(num), stddev(num), count(*) from TBL_1K_RND;

AVG(NUM) STDDEV(NUM) COUNT(*)
---------- ----------- ----------
-.00403068 1.02280961 1000

Wednesday, August 07, 2013

Find the most frequent items using SQL (continued)

In the earlier post Find the most frequent items using SQL, we use "group by" query to calculate the frequency of items and then functions row_number() or dense_rank() to generate rank based on the frequency to pick the top N items. The difference between row_number() or dense_rank() is that row_number() generate unique rank for each row even if there are ties.

Now, we want to solve a more sophisticate problem. We want to select the top 2 most frequent items in table tbl_items_by_person (shown below) for each person, Emily and John (instead of global top 2 most frequent items).

Table tbl_items_by_person

The first step is to calculate the item frequency "group by" name and item.

SQL> select name, item, count(*) as num from TBL_ITEMS_BY_PERSON group by name, item order by name, count(*) desc;

NAME ITEM NUM

-------------------------------- -------------------------------- ----------

Emily Peach 8

Emily Apple 4

Emily Banana 3

Emily Pear 2

John Orange 5

John Apple 3

John Peach 3

John Banana 2

John Pear 2

The second step is to use row_number() or dense_rank() to generate rank for each name separately. This is done by using the clause over(partition by name order by num desc).

SQL> select name, item, num, row_number() over(partition by name order by num desc) as rnk from (select name, item, count(*) as num from TBL_ITEMS_BY_

PERSON group by name, item ) order by name, num desc;

NAME ITEM NUM RNK

-------------------------------- -------------------------------- ---------- ----------

Emily Peach 8 1

Emily Apple 4 2

Emily Banana 3 3

Emily Pear 2 4

John Orange 5 1

John Apple 3 2

John Peach 3 3

John Banana 2 4

John Pear 2 5

We can pick the top 2 frequent items for Emily and John using the following query.

SQL> with tbl as (select name, item, num, row_number() over(partition by name order by num desc) as rnk from (select name, item, count(*) as num from

TBL_ITEMS_BY_PERSON group by name, item )) select * from tbl where rnk<=2 order by name, num desc;

NAME ITEM NUM RNK

-------------------------------- -------------------------------- ---------- ----------

Emily Peach 8 1

Emily Apple 4 2

John Orange 5 1

John Peach 3 2

Or, we can use dense_rank() to generate rank which will assign the same rank for ties.

SQL> select name, item, num, dense_rank() over(partition by name order by num desc) as rnk from (select name, item, count(*) as num from TBL_ITEMS_BY_

PERSON group by name, item ) order by name, num desc;

NAME ITEM NUM RNK

-------------------------------- -------------------------------- ---------- ----------

Emily Peach 8 1

Emily Apple 4 2

Emily Banana 3 3

Emily Pear 2 4

John Orange 5 1

John Apple 3 2

John Peach 3 2

John Banana 2 3

John Pear 2 3

For John, there are 3 peaches and Apples. Both of them are ranked as 2. Thus they are both selected as shown below.

SQL> with tbl as (select name, item, num, dense_rank() over(partition by name order by num desc) as rnk from (select name, item, count(*) as num from

TBL_ITEMS_BY_PERSON group by name, item )) select * from tbl where rnk<=2 order by name, num desc;

NAME ITEM NUM RNK

-------------------------------- -------------------------------- ---------- ----------

Emily Peach 8 1

Emily Apple 4 2

John Orange 5 1

John Peach 3 2

John Apple 3 2

Find the most frequent items using SQL

It is a common task to find the most frequent items. For example, we want the top 2 frequently purchased items from the following table.
SQL> select * from TBL_ITEMS_PURCHASED;

ITEM

Apple

Banana

Apple

Pear

Peach

Orange

We can run a "group by" query to calculate the frequency of items as shown below.
SQL> select item, count(*) as num from tbl_items_purchased group by item order by num;

ITEM	NUM
Orange	5
Apple	3
Peach	3
Banana	2
Pear	2

Now the tricky part is to pick the top 2 most frequent ones. There are two situations.

1. Pick precisely two items and ignore the tie. We can use row_number() function to generate rank based on the frequency and pick the top 2. See the following two queries.
SQL> select a.*, row_number() over(order by num desc) rnk from (select item, count(*) as num from tbl_items_purchased group by item ) a order by rnk;

ITEM	NUM	RANK
Orange	5	1
Apple	3	2
Peach	3	3
Banana	2	4
Pear	2	5

SQL> with tbl as (select a.*, row_number() over(order by num desc) rnk from (select item, count(*) as num from tbl_items_purchased group by item ) a ) select * from tbl where rnk <=2;

ITEM	NUM	RANK
Orange	5	1
Apple	3	2

2.Pick items with the top 2 frequencies, regardless how many different items. In this case, we use dense_rank() function to generate rank based on the frequency. See the following two queries.
SQL> select a.*, dense_rank() over(order by num desc) rnk from (select item, count(*) as num from tbl_items_purchased group by item ) a order by rnk;

ITEM	NUM	RANK
Orange	5	1
Apple	3	2
Peach	3	2
Banana	2	3
Pear	2	3

SQL> with tbl as (select a.*, dense_rank() over(order by num desc) rnk from (select item, count(*) as num from tbl_items_purchased group by item ) a ) select * from tbl where rnk<=2;

ITEM	NUM	RANK
Orange	5	1
Apple	3	2
Peach	3	2

Tuesday, August 06, 2013

Produce Analytic Reports Using Oracle Apex vs. Excel Spreadsheet

For those who know how to write SQL scripts, Oracle Apex is a great tool to present the analytic result online. There is no need to learn another web authoring language. Almost everything is done using SQL, reports, charts, Checkboxes, radioes, select lists. Once I spent a few days and built an internal employee fraud monitoring tool for a bank.

I have found that building report online using Apex is far more efficient than using Excel spreadsheet.

Creating Spreadsheet involves manual editing which is error prone. With Apex, reports or charts are simply SQL queries.
It is harder to distribute spreadsheet files. Typically, they are distributed through emails as attachments. To keep them up to date is a problem. Any changes in the report will require the spreadsheet files to be resent and it may cause a lot of confusions for the receiver.With Apex, all we need is a URL link to the report.
Most importantly, reports in the form of spreadsheet have to be "fixed" which means we have to create them in advance and they become unchanged once created. With Oracle apex, reports are generated on-the-fly based on the users' inputs. In database terms, reports are simply SQL queries that are executed in real time. Thus, the number of reports that can be produced by a Apex page is huge(for example, our reports are based on the users' selection of date ranges, States,product types, etc.). It is hard to produce and navigate a spreadsheet file that has more than, say 100, tabs.

The following screenshot is from one of the reports that I built using Oracle Apex.

Friday, August 02, 2013

My Check Fraud Detection Model Reduces Fraud by 60%

A project manager from a bank just told me that the model developed by me has resulted in 60% reduction in fraud loss. He said "everyone was surprised at how effective it was." It was a model that I developed two years ago for a top 15 bank.(Of course, I was not surprised at all since the model was tested on holdout data sets and showed similar performance).

The model was built and deployed on Oracle databases. The main reason for the success of the model was that I spent huge amount of effort building model variables that captures the fraud patterns. Luckily, I used Oracle analytic functions to build those variables easily. (Please see my posts "How to build predictive models that win competitions" and "Recency, Frequency, Monetary (RFM) Analysis: Part 1"

Another important advantage of using an uniform platform, i.e., Oracle databases, is that the deployment is easy. I simply deployed the model as a set of SQL query. See my posts "Build Predictive Models Using PL/SQL" and "Logistic Regression Model Implemented in SQL"

Thursday, August 01, 2013

Get the source code for Oracle database views and materialized view

Oracle user_views/user_mviews contains source code about the view/materialized views.
SQL> select text from user_views where view_name='V_R_SUM';
TEXT
--------------------------------------------------------------------------------

select to_char(dt_tm, 'YYYY') yy, to_char(DT_TM, 'MM') mm, to_char(DT_TM,'DD') dd, to_char(DT_TM, 'HH24') hh, to_char(DT_TM, 'MI') mi, to_char(DT_TM,'SS') ss, min(price) mi_p, avg(price) a_p, max(price) mx_p from MV_VIX3 group by to_char(dt_tm, 'YYYY'),to_char(DT_TM, 'MM'),to_char(DT_TM,'DD'),to_char(DT_TM, 'HH24'),to_char(DT_TM, 'MI'), to_char(DT_TM,'SS')

SQL> select query from user_mviews where mview_name='MV_ALL_FOR_SUM';
QUERY
--------------------------------------------------------------------------------
select
a.TXN_ROWID,
a.TRN_CUSTOMER_ACCOUNT_NUM,
a.TRN_TRANSACTION_DT daily_txn_dt,
a.TRN_TRANSACTION_AMT,
a.STAR_ROWID,
a.CUSTOMER_ID_NUMBER,
a.STAR_TRAN_DATETIME,
b.TRANSACTION_AMT frd_clm_amt, b.TRANSACTION_DT frd_clm_dt, b.clm_rowid,
c.fee, c.od_rowid
from
tbl_m_txn_star_final a,
tbl_m_txn_star_clm b,
mv_m_txn_star_od c
where a.txn_rowid=b.txn_rowid(+) and
a.txn_rowid=c.txn_rowid(+)

It is very helpful as we can always find out the logic that is used to created the view. This is one of the advantage of using views over tables.

Get source code for Oracle database objects: user_source view

We can query Oracle user_source to get the source code about functions, procedures, triggers, etc. User_source contains 4 columns, i.e., name, type, line and text.

On my schema, the following query shows that there are 55 lines of source code for functions, 50 lines of source code for procedures, etc.

SQL> select type, count(*) from user_source group by type order by type;
TYPE COUNT(*)
------------ ----------
FUNCTION 55
PACKAGE 31
PACKAGE BODY 469
PROCEDURE 50

The following query shows the definition of a function z_cap. Thus, user_source is a very handy way to retrieve the source code.

SQL> select line, text from user_source where name='Z_CAP';
LINE TEXT
---------- --------------------------------------------------------------------------------
1 FUNCTION z_cap (
2 p_data number,
3 low number,
4 hi number
5 )
6 RETURN NUMBER IS
7 v_number NUMBER;
8 BEGIN
9 if p_data 10 elsif p_data>hi then v_number:= hi;
11 else v_number:= p_data;
12 end if;
13 return v_number;
14 END z_cap;

Tuesday, July 30, 2013

Graphical User Interface Data Mining Tools

Many data miners are aware of graphical user interface data mining tools such as Enterprise Miner from SAS and SPSS Clementine. I have found that the GUI-based data mining tool in Oracle Advanced Analytics is great. With Oracle's data mining tool, everything is stored in the database which give me a peace of mind knowing that my data and processes are secure. The following is the flow that I built last year to show how to train and deploy a cell phone customer acquisition model.

Sunday, June 30, 2013

Calculate correlation matrix using SQL (continued)

In the earlier post Calculate correlation matrix using SQL, we showed a simple trick of calculating correlation matrix for unlimited number of rows and columns. We need to convert the data table into a simple table with only three columns: rec_id (record id), var_id (column_id), val (value). I was asked how I did the conversion.

If the original table is already in an Oracle database, we can write PL/SQL scripts to convert the original table into the simple table.

In the past experience, my original data were in text file format. So I wrote awk scripts similar to the following to convert the text file into simple format and then loaded them into Oracle. In the script, I included file name as the first column so that I could put data from multiple files into a single destination file.

Original data:
$ cat sample.csv
record_id, field1, field2, field3
1001,1,2,3
1002,4,5,6
1003,7,8,9
$ awk -F"," '{for (i=1;i<=NF;i++) print FILENAME","FNR-1","i","$i;}' sample.csv
sample.csv,0,1,record_id
sample.csv,0,2, field1
sample.csv,0,3, field2
sample.csv,0,4, field3
sample.csv,1,1,1001
sample.csv,1,2,1
sample.csv,1,3,2
sample.csv,1,4,3
sample.csv,2,1,1002
sample.csv,2,2,4
sample.csv,2,3,5
sample.csv,2,4,6
sample.csv,3,1,1003
sample.csv,3,2,7
sample.csv,3,3,8
sample.csv,3,4,9

Tuesday, May 14, 2013

How Are 10 Most Influential People in Data Analytics Selected?

I was asked by Gregory at KDnuggets about how I came up with "the 10 Most Influential People in Data Analytics" in my last blog post. The following was my reply.

To select 10 most influential people in data analytics, the following considerations are taken into account regarding an individual's contribution.

The contribution is significant.
The contribution is active/regular.
A large number of people are impacted by the contribution.
The focus is on the non-academic field.

I performed online research first to find qualified people. KDnuggets.com, other websites of social networking, data analytics conferences, consulting firms, and Amazon are just a few examples of good sources for information. I also took advantage of my own network. Being in the industry for 15 years, I have known many great data analytics professionals. They provided me with many names of qualified people. After I compiled a preliminary list, I sent it to a number of experts for their feedback. The final list was the result of several iterations.

Monday, May 13, 2013

10 Most Influential People in Data Analytics

We have identified 10 most influential people whose significant contributions have greatly enriched the data analytics community. This is the result of months of research. The following is the list (in alphabetical order of the last name).


Dean Abbott	Michael Berry	Tom Davenport	John Elder	Rayid Ghani

Anthony Goldbloom	Vincent Granville	Gregory Piatetsky-Shapiro	Karl Rexer	Eric Siegel

Dean Abbott is President of Abbott Analytics, Inc. in San Diego, California. Mr. Abbott is an internationally recognized data mining and predictive analytics expert with over two decades experience applying advanced data mining algorithms, data preparation techniques, and data visualization methods to real-world problems, including fraud detection, risk modeling, text mining, personality assessment, response modeling, survey analysis, planned giving, and predictive toxicology. He is also Chief Scientist of SmarterRemarketer, a startup company focusing on behaviorally- and data-driven marketing attribution and web analytics.

Mr. Abbott is a highly regarded and popular speaker at Predictive Analytics and Data Mining conferences, including Predictive Analytics World, Predictive Analytics Summit, the Predictive Analytics Center of Excellence, SAS Institute, DM Radio, and INFORMS.

He has served on the program committees for the KDD Industrial Track and Data Mining Case Studies workshop and is on the Advisory Boards for the UC/Irvine Predictive Analytics Certificate and the UCSD Data Mining Certificate programs. Mr. Abbott has taught applied data mining and text mining courses using IBM SPSS Modeler, Statsoft Statistica, Salford Systems SPM, SAS Enterprise Miner, Tibco Spotfire Miner, IBM Affinium Model, Megaputer Polyanalyst, KNIME, and RapidMiner.

Michael Berry is a recognized authority on business applications of data mining. He is the author (with Gordon Linoff) of several well-regarded books in the field including Data Mining Techniques for Marketing, Sales, and Customer Relationship Management which is now in its third edition.

He is currently responsible for analytics and business intelligence for the business-to-business side of Tripadvisor (www.tripadvisor.com). Mr. Berry is a co-founder of Data Miners, Inc. (www.data-miners.com), a consultancy specializing in the analysis of large volumes of data for marketing and CRM purposes.

Tom Davenport is a Visiting Professor at Harvard Business School. He is also the President’s Distinguished Professor of Information Technology and Management at Babson College, the co-founder of the International Institute for Analytics, and a Senior Advisor to Deloitte Analytics. He has published on the topics of analytics in business, process management, information and knowledge management, and enterprise systems. He pioneered the concept of “competing on analytics” with his best-selling 2006 Harvard Business Review article (and his 2007 book by the same name). His most recent book is Keeping Up with the Quants:Your Guide to Understanding and Using Analytics, with Jinho Kim. He wrote or edited fifteen other books, and over 100 articles for Harvard Business Review, Sloan Management Review, the Financial Times, and many other publications. In 2003 he was named one of the world’s “Top 25 Consultants” by Consulting magazine. In 2005 Optimize magazine’s readers named him among the top 3 business/technology analysts in the world. In 2007 and 2008 he was named one of the 100 most influential people in the IT industry by Ziff-Davis magazines. In 2012 he was named one of the world’s top fifty business school professors by Fortune magazine.

John Elder founded and leads America’s largest and most experienced data mining consultancy. Founded in 1995, Elder Research (http://www.datamininglab.com) has offices in Charlottesville Virginia and Washington DC and has solved projects in a huge variety of areas of mining data, text, and links. Dr. Elder co-authored 3 books (on practical data mining, ensembles, and text mining), two of which won PROSE awards for top book of the year in Mathematics or Computer Science. John has authored some data mining tools, was one of the discoverers of ensemble methods, has chaired international conferences, and is a frequent keynote speaker. He’s probably best known for explaining complex analytic concepts with clarity, humor, and enthusiasm.

Dr. Elder has degrees in Engineering (Systems PhD, UVA + Electrical Masters & BS, Rice) and is an occasional Adjunct Professor at UVA. He was honored to be named by President Bush to serve 5 years on a panel to guide technology for national security. Lastly, John is grateful to be a follower of Christ and the father of five.

Rayid Ghani is currently at the Computation Institute and the Harris School of Public Policy at the University of Chicago. Rayid is also the co-founder of Edgeflip, an analytics startup building social media analytics products that allow non-profits and social good organizations to better use social networks to raise money, recruit, engage, and mobilize volunteers, and do targeted outreach and advocacy.

Rayid Ghani was the Chief Scientist at Obama for America 2012 campaign focusing on analytics, technology, and data. His work focused on improving different functions of the campaign including fundraising, volunteer, and voter mobilization using analytics, social media, and machine learning. Before joining the campaign, Rayid was a Senior Research Scientist and Director of Analytics research at Accenture Labs where he led a technology research team focused on applied R&D in analytics, machine learning, and data mining for large-scale & emerging business problems in various industries including healthcare, retail & CPG, manufacturing, intelligence, and financial services.

In addition, Rayid serves as an adviser to several analytics start-ups, is an active speaker, organizer of, and participant in academic and industry analytics conferences, and publishes regularly in machine learning and data mining conferences and journals.

Anthony Goldbloom is the founder and CEO of Kaggle. Before founding Kaggle, Anthony worked in the macroeconomic modeling areas of the Reserve Bank of Australia and before that the Australian Treasury.

He holds a first class honours degree in economics and econometrics from the University of Melbourne and has published in The Economist magazine and the Australian Economic Review.

In 2011, Forbes Magazine cited Anthony as one of the 30 under 30 in technology and Fast Company featured him as one of the innovative thinkers who are changing the future of business.

Dr. Vincent Granville is a visionary data scientist with 15 years of big data, predictive modeling, digital and business analytics experience. Vincent is widely recognized as the leading expert in scoring technology, fraud detection and web traffic optimization and growth. Over the last ten years, he has worked in real-time credit card fraud detection with Visa, advertising mix optimization with CNET, change point detection with Microsoft, online user experience with Wells Fargo, search intelligence with InfoSpace, automated bidding with eBay, click fraud detection with major search engines, ad networks and large advertising clients.

Most recently, Vincent launched Data Science Central, the leading social network for big data, business analytics and data science practitioners. Vincent is a former post-doctorate of Cambridge University and the National Institute of Statistical Sciences. He was among the finalists at the Wharton School Business Plan Competition and at the Belgian Mathematical Olympiads. Vincent has published 40 papers in statistical journals and is an invited speaker at international conferences. He also developed a new data mining technology known as hidden decision trees, owns multiple patents, published the first data science book, and raised $6MM in start-up funding. Vincent is a top 20 big data influencers according to Forbes, was featured on CNN, and is #1 in Gil Press' A-List of data scientists.

Gregory Piatetsky-Shapiro, Ph.D. (@kdnuggets) is the Editor of KDnuggets.com, a leading site for Analytics, Big Data, Data Mining, and Data Science. He is also a well-known expert and an independent consultant in this field. Previously, he led a data mining teams at GTE Laboratories, and was a Chief Scientist for two start-ups. He has extensive experience in applying analytic and data mining methods to many areas — including customer modeling, healthcare data analysis, fraud detection, bioinformatics and Web analytics — and worked for a number of leading banks, insurance companies, telcos, and pharmaceutical companies.

He coined the terms “KDD” and “Knowledge Discovery in Data” when he organized and chaired the first three KDD workshops. He later helped grow the workshops into ACM Conf. on Knowledge Discovery and Data Mining (kdd.org), the top research conference in the field. Dr. Piatetsky-Shapiro is also a co-founder of ACM SIGKDD, the leading professional organization for Knowledge Discovery and Data Mining and served as the Chair of SIGKDD (2005-2009). He received ACM SIGKDD and IEEE ICDM Distinguished Service Awards. He has over 60 publications with over 10,000 citations.

Karl Rexer, PhD is President of Rexer Analytics (www.RexerAnalytics.com). Founded in 2002, Rexer Analytics has delivered analytic solutions to dozens of companies. Solutions include fraud detection, customer attrition analysis and prediction, advertisement abandonment prediction, direct mail targeting, market basket analysis and survey research. Rexer Analytics also conducts and freely distributes the widely read Data Miner Survey. The survey has been written about and cited in over 12 languages. In the spring of 2013, over a thousand analytic professionals from around the world participated in the 6th Data Miner Survey.

Karl has served on the organizing and review committees of several international conferences (e.g., KDD), and is on the Board of Directors of Oracle's Business Intelligence, Warehousing, & Analytics (BIWA) Special Interest Group. He has served on IBM's Customer Advisory Board, is an Industry Advisor for Babson College's Business Analytics program, and is in the #1 position on LinkedIn's list of Top Predictive Analytics Professionals. He is frequently an invited speaker and moderator at conferences and universities. So far in 2013 he has conducted data mining trainings in California, China and London. Prior to founding Rexer Analytics, Karl held leadership and consulting positions at several consulting firms and two multi-national banks.

Eric Siegel, PhD, founder of Predictive Analytics World and Text Analytics World, author of Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die, and Executive Editor of the Predictive Analytics Times, makes the how and why of predictive analytics understandable and captivating. Eric is a former Columbia University professor who used to sing educational songs to his students, and a renowned speaker, educator and leader in the field.

Monday, April 22, 2013

Find the Most Important Variables In Predictive Models

A commonly used method in determining the most important variables is to examine how well each variable individually in predicting the target variable. However, this approach has its limits. The first limit is that it excludes variables that are actually good ones.

For example, to find if credit score is a good indicator of account default, we calculate the default rate for each credit class as shown below (or we may perform some statistical tests such as a Chi-Square test for that matter). As we can see, the low credit score class have a default rate of 18% vs that of 6% for the high credit score class. Thus, credit class is considered as a good variable to build a default model.

Credit score and account default.

However, this approach has its limit because it does not take the relationship among variables into consideration. This can be illustrated using an imaginary example. We want to asses if height and weight of people are indicative of getting a disease. We can calculate the following tables for height and weight, respectively. Since short or tall people have the same percentage of sick people (2%), we may conclude that height is not relevant to predicting the disease. Similarly, we also think weight is not important.

Height and Disease

Weight and Disease

If examining weight and height at the same time, we can develop the following matrix. There are four groups of people, high/heavy (normal), short/light (normal), high/light(abnormal), and short/heavy (abnormal). 11% of short/heavy or light/tall people are sick (orange cells). While the percentage of sick people from tall/heavy and short/light groups (green cells) is only 0.1%. Thus, height and weight are very good variables to be included in a predictive model.

Height/Weight and Disease

As we see, this approach may exclude variables that are actually good. When we determine the most important variables for building a predictive model, ideally we should take a set of variables as a whole into consideration. More often than not, it is the relationships between variables that provide the best predictive power. How to find or generate the most useful variables for predictive models is so crucial that we will talk more about it in upcoming blog posts. I have written another post More on How to Find the Most Important Variables for a Predictive Model using Oracle Attribute Importance function.

Sunday, April 07, 2013

Logistic Regression Model Implemented in SQL

In a project, we need to deploy a logistic regression model into a production system that only takes SQL scripts as its input. Two functions come in handy, decode() and nvl(). Decode() converts categorical value into a weight and nvl() conveniently replaces null with a desired value. The following SQL scripts is similar to what we delivered.

select transaction_id,

(1/
(1+
exp(-(
nvl(AMT1*.000019199,0)+
nvl(AMT3*(-.00002155),0)+
decode(nvl(substr((TXN_CODE1),1,18),' '),
'XX',-.070935,
'57',-.192319,
'1',-.053794,
'81',-.010813,
'NR',-.079628,
'PD',-.102987,
'P',-1.388433,
'Z6',-.106081,
'01',-1.1528,
'Z4',-.004237,
'T1',.697737,
'AK',-.490381,
'U2',.063712,
'NK',.054354,
'PR',.205336,
'51',-.286213,
'N',.075582,
' ',-.330585,
0)+
decode(nvl(substr( trim(TXN_CODE2),1,18),' '),
'U',-.11176,
0)+
decode(nvl(substr( trim(TXN_CODE3),1,18),' '),
'1',-.642605,
0)+
decode(nvl(substr( trim(TXN_CODE4),1,18),' '),
'00',-.084517,
'10',.057248,
0)

-6.8190776

) as score from tbl_data;

Thursday, March 07, 2013

Watch out NULL values when comparing data

It is a very common task to compare data values. For example, I was involved in project where we upgraded the scoring engine. We wanted to make sure the old and new scoring engines produce the same outputs given the same inputs. I use the following table to illustrate the problem. We want to make sure value_old and value_new are the same. (The blanks are NULL values.)

ID VALUE_OLD VALUE_NEW
---------- ---------- ----------
1 234
2 567
3 789 789

If we simply use the following query to count the number of discrepancies, the result will return zero. This is not what we expect.
select count(*) from tbl_data_a where VALUE_OLD < > VALUE_NEW;

COUNT(*)
----------
0
This is because rows with NULL values appearing in the comparison are ignored.

A better approach is to write a query considering all of the following five situations:
In the following cases, VALUE_OLD and VALUE_NEW are the same.
1. VALUE_OLD is null, and VALUE_NEW is null.
2.VALUE_OLD is not null, VALUE_NEW is not null and VALUE_OLD=VALUE_NEW.

In the following cases, VALUE_OLD and VALUE_NEW are the different.
3.VALUE_OLD is null, and VALUE_NEW is not null.
4.VALUE_OLD is not null, and VALUE_NEW is null.
5.VALUE_OLD is not null, VALUE_NEW is not null and VALUE_OLD < > VALUE_NEW.

Monday, March 04, 2013

Custom model score vs credit bureau score

In previous post, we compare the performance of different types of the models applied to the same data set. In reality, a more frequently encountered issue is to compare the performance of credit bureau scores and custom model scores. By custom model scores, we mean the model that is built based on client's own historical data. I have done numerous predictive modeling projects in the area across industries. My conclusion is that custom model scores are almost always better than generic credit bureau scores by large margins. The following gain charts are from a real project. For example, if we reject worst 20% customers based on their bureau scores, we can stop 26% of the loan default. If we reject 20% customers based on custom model scores, almost 40% of the loan default can be stopped. The can easily translate into big savings for a company with large number of customers.

The patterns are common. Thus, it is worthwhile to build a custom model that will almost always outperform generic credit bureau scores as long as the client has enough historical data.

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics

Popular Topics

Tuesday, October 01, 2013

Saturday, September 28, 2013

Sunday, September 22, 2013

Thursday, August 29, 2013

Thursday, August 08, 2013

Wednesday, August 07, 2013

Tuesday, August 06, 2013

Friday, August 02, 2013

Thursday, August 01, 2013

Tuesday, July 30, 2013

Sunday, June 30, 2013

Tuesday, May 14, 2013

Monday, May 13, 2013

Monday, April 22, 2013

Sunday, April 07, 2013

Thursday, March 07, 2013

Monday, March 04, 2013