Deep Data Mining Blog: November 2013

Thursday, November 28, 2013

More on Taking Randomly Sampled Records From the Table

Problem

Random sampling is a very common task in a data analytics project. Sometime we want to sample precise number of records such as described in the post Take Randomly Sampled Records From the Table. Sometime, we randomly split a data set into training and testing sets at 70/30 (or whatever ratio). How do we efficiently perform multiple random sampling tasks?

Solution

I have found it is convenient to create a permanent table that contains the unique record id (or account id, card number, etc.) and a uniformly distributed random number. The following is an example of creating such a table described in Take Randomly Sampled Records From the Table. dbms_random.value returns uniformly distributed random number ranging from 0 to 1.

SQL> create table tbl_emp_id_rnd as select EMPLOYEE_ID, 
dbms_random.value rnd from EMPLOYEES;

Table created.

SQL> select * from tbl_emp_id_rnd where rownum <=10;

EMPLOYEE_ID        RND
----------- ----------
        100 .466996031
        101 .325172718
        102 .643593904
        103 .822225992
        104 .657242181
        105 .244060518
        106 .446914037
        107 .423664122
        108 .033736378
        109 .405546964

10 rows selected.

We can perform many types of random sampling based on such a table whenever we need to. For example, we can split the records into 70% training and 30% testing set. And we can change the ratio of 70/30 easily.

SQL> create view v_taining_emp_id as select EMPLOYEE_ID 
from tbl_emp_id_rnd where rnd <=0.7;

View created.

SQL> create view v_testing_emp_id as select EMPLOYEE_ID 
from tbl_emp_id_rnd where rnd >0.7;

View created.

I have found such a table is extremely helpful in a project where randomly sampling is performed multiple times.

Take Randomly Sampled Records From the Table

Problem

How do we randomly sample, say 20 records, from the table below that contains 107 employees?

SQL> select EMPLOYEE_ID, FIRST_NAME, LAST_NAME from EMPLOYEES where rownum <=10;

EMPLOYEE_ID FIRST_NAME           LAST_NAME
----------- -------------------- -------------------------
        100 Steven               King
        101 Neena                Kochhar
        102 Lex                  De Haan
        103 Alexander            Hunold
        104 Bruce                Ernst
        105 David                Austin
        106 Valli                Pataballa
        107 Diana                Lorentz
        108 Nancy                Greenberg
        109 Daniel               Faviet

Solution

We use the method describe in Randomly Sample a Precise Number of Records and More on random sampling in Oracle. Let's follow the steps.
1. Verify that employee_id is indeed the unique identifier.

SQL> select count(*), count(distinct employee_id) from EMPLOYEES;

  COUNT(*) COUNT(DISTINCTEMPLOYEE_ID)
---------- --------------------------
       107                        107

2. Generate a random number for each employee_id using dbms_random.value. Function dbms_random.value generates uniformly distributed random number greater than or equal to 0 and less than 1.

SQL> create table tbl_emp_id_rnd as select EMPLOYEE_ID, dbms_random.value rnd 
from EMPLOYEES;

Table created.

The following is what tbl_emp_id_rnd looks like.

SQL> select * from tbl_emp_id_rnd where rownum <=10;

EMPLOYEE_ID        RND
----------- ----------
        100 .466996031
        101 .325172718
        102 .643593904
        103 .822225992
        104 .657242181
        105 .244060518
        106 .446914037
        107 .423664122
        108 .033736378
        109 .405546964

10 rows selected.

3. Generate unique rank based on the random number for each record.

SQL> create table tbl_emp_id_rnd_rnk as select a.*, 
row_number() over(order by rnd) rnk from tbl_emp_id_rnd a;

Table created.

The following is what the table with rank looks like. To get the randomly sampled 20 records, we simply select records with ranks less or equal 20. The trick is based on that the rank of random numbers are also random.

SQL> select * from tbl_emp_id_rnd_rnk where rnk <=20 order by rnk;

EMPLOYEE_ID        RND        RNK
----------- ---------- ----------
        177 .001160678          1
        121 .001293396          2
        151 .013037966          3
        126 .018864319          4
        108 .033736378          5
        158  .09466332          6
        168 .095874461          7
        165 .116139108          8
        166 .116236033          9
        149 .121777964         10
        136 .126158543         11
        116 .140726813         12
        193 .147478226         13
        143 .164627124         14
        194 .175347727         15
        145 .206125327         16
        176 .207418722         17
        160 .207918621         18
        201  .21827842         19
        183 .218636576         20

20 rows selected.

Conclusions

We use Oracle dbms_random.value to first generate a random number of each record in the table and then row_number functions to calculate the rank of the random number. We then use the rank to select the desired number of records. Please notice that it is not a good practice to use the employee_id in the table directly for sampling purpose because we are not sure if it is randomly generated.

Wednesday, November 27, 2013

Calculate Standard Deviation Using SQL

Problem

How do we calculate the standard deviation of a variable, such as salary in the following table, using SQL?

SQL> select employee_id, department_id, hire_date, salary 
from EMPLOYEES where rownum <=20;

EMPLOYEE_ID DEPARTMENT_ID HIRE_DATE   SALARY
----------- ------------- --------- --------
        100            90 17-JUN-87  24000.0
        101            90 21-SEP-89  17000.0
        102            90 13-JAN-93  17000.0
        103            60 03-JAN-90   9000.0
        104            60 21-MAY-91   6000.0
        105            60 25-JUN-97   4800.0
        106            60 05-FEB-98   4800.0
        107            60 07-FEB-99   4200.0
        108           100 17-AUG-94  12000.0
        109           100 16-AUG-94   9000.0
        110           100 28-SEP-97   8200.0
        111           100 30-SEP-97   7700.0
        112           100 07-MAR-98   7800.0
        113           100 07-DEC-99   6900.0
        114            30 07-DEC-94  11000.0
        115            30 18-MAY-95   3100.0
        116            30 24-DEC-97   2900.0
        117            30 24-JUL-97   2800.0
        118            30 15-NOV-98   2600.0
        119            30 10-AUG-99   2500.0

20 rows selected.

Solution

We can use function stddev() to calculate the standard deviation.

SQL> select stddev(salary) from EMPLOYEES;

STDDEV(SALARY)
--------------
    3909.36575

The result can be verified using the following query.


SQL> with tbl as (select avg(salary) avg_sal, 
count(*) n from employees) 
select sqrt(sum((salary-avg_sal)*(salary-avg_sal)/(n-1))) 
from EMPLOYEES , tbl;

SQRT(SUM((SALARY-AVG_SAL)*(SALARY-AVG_SAL)/(N-1)))
--------------------------------------------------
                                        3909.36575

Monday, November 25, 2013

Calculate the Correlation Coefficient Using SQL

Problem

We want to calculate the correlation coefficient between two variables, such as the length of employment and salary within a company. If the data are Oracle database table, how do we do it using SQL?

SQL> select employee_id, department_id, hire_date, salary 
from EMPLOYEES where rownum <=20;

EMPLOYEE_ID DEPARTMENT_ID HIRE_DATE   SALARY
----------- ------------- --------- --------
        100            90 17-JUN-87  24000.0
        101            90 21-SEP-89  17000.0
        102            90 13-JAN-93  17000.0
        103            60 03-JAN-90   9000.0
        104            60 21-MAY-91   6000.0
        105            60 25-JUN-97   4800.0
        106            60 05-FEB-98   4800.0
        107            60 07-FEB-99   4200.0
        108           100 17-AUG-94  12000.0
        109           100 16-AUG-94   9000.0
        110           100 28-SEP-97   8200.0
        111           100 30-SEP-97   7700.0
        112           100 07-MAR-98   7800.0
        113           100 07-DEC-99   6900.0
        114            30 07-DEC-94  11000.0
        115            30 18-MAY-95   3100.0
        116            30 24-DEC-97   2900.0
        117            30 24-JUL-97   2800.0
        118            30 15-NOV-98   2600.0
        119            30 10-AUG-99   2500.0

20 rows selected.

Solution

We use Oracle corr function to calculate the correlation coefficient. The following is the query (since the data is old, we assume that the current date is May 30, 2000.)


SQL> select  corr(salary, to_date('20000530','YYYYMMDD')-hire_date) cor  from EMPLOYEES;

       COR
----------
.497838041

We can verify the correlation coefficient using the following query based on the equation for calculating correlation coefficient.

SQL> with 
tbl as (select avg(salary) m1, stddev(salary) s1, 
avg(to_date('20000530','YYYYMMDD')-hire_date) m2, 
stddev(to_date('20000530','YYYYMMDD')-hire_date) s2 ,
count(1) n from EMPLOYEES), 
tbl2 as (select salary, 
to_date('20000530','YYYYMMDD')-hire_date l from employees) 
select sum( (salary-m1)*(l-m2)/(n-1)/(s1*s2) ) from tbl, tbl2;

SUM((SALARY-M1)*(L-M2)/(N-1)/(S1*S2))
-------------------------------------
                           .497838041

Calculate the Difference of Purchase Amounts Between Transactions

Problem

It is interesting to compare purchase amount between the current transaction and the previous one. For example, the difference between a customer's current purchase amount and previous purchase amount may indicate frauds or may be used to predict customer's future spending trend. Taking the following table as an example, how to do calculate the difference in order amount between the current transaction and previous transaction for each customer_id?


SQL> select CUSTOMER_ID, order_id, ORDER_TOTAL, 
ORDER_TIMESTAMP from DEMO_ORDERS 
order by CUSTOMER_ID, ORDER_TIMESTAMP desc;

CUSTOMER_ID   ORDER_ID ORDER_TOTAL ORDER_TIME
----------- ---------- ----------- ----------
          1          1        1200 2013-08-15
          2          2         599 2013-08-10
          2          3        1999 2013-08-05
          3          4         750 2013-07-31
          3          5          40 2013-07-26
          4          6         250 2013-07-21
          5          7        3800 2013-07-16
          6          8          40 2013-07-11
          6          9         450 2013-07-06
          7         10         500 2013-07-01

10 rows selected.

Solution

We use Oracle analytic function lag as shown below. "lag(ORDER_TOTAL,1)" means getting the 1 previous row's order_total. "order by ORDER_TIMESTAMP" indicates that the "previous" is determined by order_timestamp. "partition by CUSTOMER_ID" specifies that the order is done independently by customer_id.

SQL> select CUSTOMER_ID, order_id, ORDER_TOTAL, ORDER_TIMESTAMP, ORDER_TOTAL-lag(ORDER_TOTAL,1) over(partition by CUSTOMER_ID 
order by ORDER_TIMESTAMP) delta 
from DEMO_ORDERS 
order by CUSTOMER_ID, ORDER_TIMESTAMP desc;

CUSTOMER_ID   ORDER_ID ORDER_TOTAL ORDER_TIME      DELTA
----------- ---------- ----------- ---------- ----------
          1          1        1200 2013-08-15
          2          2         599 2013-08-10      -1400
          2          3        1999 2013-08-05
          3          4         750 2013-07-31        710
          3          5          40 2013-07-26
          4          6         250 2013-07-21
          5          7        3800 2013-07-16
          6          8          40 2013-07-11       -410
          6          9         450 2013-07-06
          7         10         500 2013-07-01

10 rows selected.

As we see, customer 2 has a big drop in order_total (-1400) on 2013-08-10. While customer 3 has a jump in order_total (710) on 2013-07-31. Most of transactions in the above table do not have previous transactions.

Sunday, November 24, 2013

More on Caculating Time Elapsed Between Two Dates

Problem

Time since last transaction, or recency, is an important factor for building predictive models such as fraud detection. How do we calculate the time elapsed between two transactions, i.e., time since last transaction, when the dates/time stamps are located in different rows as shown below?

SQL> select customer_id, TXN_DT, TXN_AMT from TBL_TXN_SMALLER order by CUSTOMER_ID, TXN_DT desc;

CUSTOMER_ID          TXN_DT               TXN_AMT
-------------------- ----------------- ----------
AAAAAAAA00000849171  20111111:16:11:41       5.23
AAAAAAAA00000849171  20111111:12:19:38      13.45
AAAAAAAA00000849171  20111107:12:36:53       12.2
AAAAAAAA00000849171  20111106:20:43:28       9.99
AAAAAAAA00000849171  20111104:23:00:02      19.95
AAAAAAAA00000849171  20111104:23:00:01          0
AAAAAAAA00000849171  20111102:17:08:20      42.77
AAAAAAAA00000849171  20111102:17:05:41      36.33
AAAAAAAA00000849171  20111102:16:00:25      14.07
AAAAAAAA00000849171  20111101:16:41:56      16.32
AAAAAAAA00003868233  20111110:16:07:54      35.16
AAAAAAAA00003868233  20111110:06:12:14      83.69
AAAAAAAA00003868233  20111110:06:12:09          0
AAAAAAAA00003868233  20111109:16:00:38      19.95
AAAAAAAA00003868233  20111104:06:21:50         54
AAAAAAAA00003868233  20111103:06:23:42      29.71
AAAAAAAA00003868233  20111103:04:42:13         79
AAAAAAAA00003868233  20111102:13:17:10        489
AAAAAAAA00003868233  20111102:02:10:28       9.95
AAAAAAAA00003868233  20111031:16:03:42      714.1
AAAAAAAA00004821778  20111002:02:19:17        100

21 rows selected.

Solution

One of the solutions is to use

 SQL> select customer_id, TXN_DT, lag(TXN_DT,1) over(partition txn_dt-lag(TXN_DT,1) over(partition days_since_last from TBL_TXN_SMALLER  CUSTOMER_ID          TXN_DT -------------------- ----------------- AAAAAAAA00000849171  20111111:16:11:41 AAAAAAAA00000849171  20111111:12:19:38 AAAAAAAA00000849171  20111107:12:36:53 AAAAAAAA00000849171  20111106:20:43:28 AAAAAAAA00000849171  20111104:23:00:02 AAAAAAAA00000849171  20111104:23:00:01 AAAAAAAA00000849171  20111102:17:08:20 AAAAAAAA00000849171  20111102:17:05:41 AAAAAAAA00000849171  20111102:16:00:25 AAAAAAAA00000849171  20111101:16:41:56 AAAAAAAA00003868233  20111110:16:07:54 AAAAAAAA00003868233  20111110:06:12:14 AAAAAAAA00003868233  20111110:06:12:09 AAAAAAAA00003868233  20111109:16:00:38 AAAAAAAA00003868233  20111104:06:21:50 AAAAAAAA00003868233  20111103:06:23:42 AAAAAAAA00003868233  20111103:04:42:13 AAAAAAAA00003868233  20111102:13:17:10 AAAAAAAA00003868233  20111102:02:10:28 AAAAAAAA00003868233  20111031:16:03:42 AAAAAAAA00004821778  20111002:02:19:17

21 rows selected.

Oracle function lag. In the following query, lag(txn_dt,1) gets the previous row for the customer_id indepedently (partition by CUSTOMER_ID) order by txn_date. Please notice the first transaction from each customer is always null. This makes sense because there is no previous transaction for the first transaction. TXN_AMT, by CUSTOMER_ID order by TXN_DT) prev_dt, by CUSTOMER_ID order by TXN_DT) order by CUSTOMER_ID, TXN_DT desc; TXN_AMT PREV_DT DAYS_SINCE_LAST ---------- ----------------- --------------- 5.23 20111111:12:19:38 .161145833 13.45 20111107:12:36:53 3.98802083 12.2 20111106:20:43:28 .662094907 9.99 20111104:23:00:02 1.90516204 19.95 20111104:23:00:01 .000011574 0 20111102:17:08:20 2.24422454 42.77 20111102:17:05:41 .001840278 36.33 20111102:16:00:25 .045324074 14.07 20111101:16:41:56 .971168981 16.32 35.16 20111110:06:12:14 .413657407 83.69 20111110:06:12:09 .00005787 0 20111109:16:00:38 .591331019 19.95 20111104:06:21:50 5.40194444 54 20111103:06:23:42 .998703704 29.71 20111103:04:42:13 .070474537 79 20111102:13:17:10 .642395833 489 20111102:02:10:28 .462986111 9.95 20111031:16:03:42 1.42136574 714.1 100

Conclusion

Using Oracle analytic function lag, we can calculate the recency factors such as time since last transaction within a single query without complex coding that may involve joining or looping.

Saturday, November 23, 2013

Select the Most Recent N Transactions for Each Account

Problem

There is a customer purchase table that contains many transactions as show below. We want to only select the most recent 10 transactions for each customer. How do we do it?

SQL> select * from V_ALL_TXNS where rownum <20;

CUSTOMER_ID          TXN_DT       TXN_AMT
-------------------- --------- ----------
11738368607          02-DEC-10       4.59
11738368607          06-DEC-10        115
11738368607          06-DEC-10      12.04
11738368607          06-DEC-10        8.7
11738368607          06-DEC-10      15.96
11738368607          07-DEC-10          7
11738368607          07-DEC-10          8
11738368607          07-DEC-10         50
11738368607          13-DEC-10       2.12
11738368607          23-DEC-10      14.95
11738368607          28-DEC-10      209.8
11738368607          28-DEC-10      18.41
11738368607          30-DEC-10      43.18
11738368607          29-DEC-10      49.98
11738368607          29-DEC-10       9.83
11738368607          28-DEC-10      11.65
11738368607          30-DEC-10         20
11738368607          06-DEC-10      30.79
11738368607          01-DEC-10      19.23

Solution

We use Oracle analytic function row_number() to generate rank for each customer_id based on the transaction time. The query is shown below.

SQL> with
    tbl_temp as
    (select a.*,
    row_number() over 
    (partition by customer_id order by txn_dt desc) as rnk
    from V_ALL_TXNS a)
    select customer_id, txn_dt, txn_amt from
    tbl_temp
    where rnk <=10
    order by customer_id, txn_dt desc;

CUSTOMER_ID          TXN_DT       TXN_AMT
-------------------- --------- ----------
11685988069          30-DEC-10      269.5
11685988069          30-DEC-10     150.56
11685988069          30-DEC-10      46.66
11685988069          29-DEC-10      12.94
11685988069          29-DEC-10      16.04
11685988069          28-DEC-10      18.33
11685988069          28-DEC-10      23.33
11685988069          23-DEC-10     174.71
11685988069          22-DEC-10      23.76
11685988069          20-DEC-10     194.79
11738368607          30-DEC-10      43.18
11738368607          30-DEC-10       15.2
11738368607          30-DEC-10         20
11738368607          29-DEC-10      49.98
11738368607          29-DEC-10       9.83
11738368607          28-DEC-10      209.8
11738368607          28-DEC-10      11.65
11738368607          28-DEC-10      18.41
11738368607          23-DEC-10      14.95
11738368607          20-DEC-10      36.99
11768488237          31-DEC-10       6.69
11768488237          31-DEC-10       4.62
11768488237          31-DEC-10        1.4
11768488237          30-DEC-10       6.69
11768488237          30-DEC-10        1.4
11768488237          29-DEC-10       2.92
11768488237          29-DEC-10        1.7
11768488237          29-DEC-10        2.8
11768488237          29-DEC-10       3.55
11768488237          28-DEC-10        1.7

Tuesday, November 19, 2013

Logically Merge Data From Different Sources- Combine Records

Problem

In the real world application, data are collected by different systems. For example, some of the credit card fraudulent transactions are detected by banks through their fraud operation. Other fraudulent transactions are reported by customers and collected by claim department. To get the totality of credit cards that are involved in fraudulent activities, we have to combine the records from above two data sources. We describe the problem using a simple example. There are two tables, tbl_soccer_kids and tbl_tennis_kids, that record the names of kids who join soccer and tennis clubs. We want answers to the following questions:
1. Who join either soccer or tennis clubs?
2. Who join soccer and tennis clubs?
3. Who join soccer club but not tennis club?
4. Who join tennis club but not soccer club?

SQL> select * from tbl_soccer_kids order by name;

NAME
--------------------------------
CLARK
FORD
JAMES
JONES
MARTIN
SCOTT
TURNER

7 rows selected.

SQL> select * from tbl_tennis_kids order by name;

NAME
--------------------------------
ADAMS
ALLEN
BLAKE
CLARK
MARTIN
MILLER
TURNER
WARD

8 rows selected.

Solution

1. Who join either soccer or tennis clubs?
We use the "union" to combine the names from two tables. The duplicate names will be automatically removed. The following query shows that 12 kids are joining either soccer or tennis clubs.

SQL> select name from tbl_soccer_kids union select name from tbl_tennis_kids order by name;

NAME
--------------------------------
ADAMS
ALLEN
BLAKE
CLARK
FORD
JAMES
JONES
MARTIN
MILLER
SCOTT
TURNER
WARD

12 rows selected.

2. Who join soccer and tennis clubs?
We use "intersect" to find the common names in both tables.

SQL> select name from tbl_soccer_kids intersect select name from tbl_tennis_kids order by name;

NAME
--------------------------------
CLARK
MARTIN
TURNER

There are other approaches to find the common set and I will talk about them in another blog post.
3. Who join soccer club but not tennis club?
We use "minus" to find out records that are in the first table but not in the second table.

SQL> select name from tbl_soccer_kids minus select name from tbl_tennis_kids order by name;

NAME
--------------------------------
FORD
JAMES
JONES
SCOTT

4. Who join tennis club but not soccer club?
This is similar to question 3. We use "minus" and simply switch the order of two tables.

SQL> select name from tbl_tennis_kids minus select name from tbl_soccer_kids order by name;

NAME
--------------------------------
ADAMS
ALLEN
BLAKE
MILLER
WARD

How to Calculate Stock Price 200 Day Moving Average

Problem

It is a common task to calculate the moving average of a variable. For example, the 200 day stock price moving average is considered significant by some traders. We have data containing stock symbol, date and close price for 2011. The following query shows a few records for MSFT. How do we calculate 200 day stock price moving average for each symbol?

SQL> select symbol, dt, close from TBL_STOCK_2011 where symbol='MSFT' and dt between to_date('20111215','YYYYMMDD') and  to_date('20111231','YYYYMMDD'
) order by dt desc;

SYMBOL   DT             CLOSE
-------- --------- ----------
MSFT     30-DEC-11      25.25
MSFT     29-DEC-11       25.3
MSFT     28-DEC-11      25.11
MSFT     27-DEC-11      25.32
MSFT     23-DEC-11      25.31
MSFT     22-DEC-11       25.1
MSFT     21-DEC-11      25.05
MSFT     20-DEC-11      25.31
MSFT     19-DEC-11      24.83
MSFT     16-DEC-11      25.28
MSFT     15-DEC-11      24.86

11 rows selected.

Solution

One of the approaches is to use Oracle analytic window function, avg() over() as shown below.


SQL> select symbol, dt, close, avg(close) over(partition by symbol order by dt range between 199 preceding and current row) avg from TBL_STOCK_2011 or
der by dt desc;

SYMBOL   DT             CLOSE        AVG
-------- --------- ---------- ----------
MSFT     30-DEC-11      25.25 25.0448571
MSFT     29-DEC-11       25.3 25.0292143
MSFT     28-DEC-11      25.11 25.0272662
MSFT     27-DEC-11      25.32 25.0266667
MSFT     23-DEC-11      25.31 24.9653901
MSFT     22-DEC-11       25.1 24.9492199
MSFT     21-DEC-11      25.05 24.9481429
MSFT     20-DEC-11      25.31 24.9474101
MSFT     19-DEC-11      24.83 24.9302878
........................................

In the above query, "partition by symbol" means the calculation is done independently by symbols. We can verify the 200 moving average of 25.0448571 on 30-DEC-11 using the following query.

SQL> select avg(close) from tbl_stock_2011 where dt  between to_date('20111230','YYYYMMDD')-199 and to_date('20111230','YYYYMMDD');

AVG(CLOSE)
----------
25.0448571

As we can see, the 200 day average prices using two approaches are the same. A nice thing about analytic window function is that it easily creates moving average for every day. It is extremely useful when we are dealing with time series data. I have used analytic window functions extensively when building bank card or check fraud predictive models.

Monday, November 18, 2013

More on Calculating Z-Score

Problem

In the earlier post Calculate Z-Score using SQL in Oracle, we show that we can normalize a variable using z-score which is defined as (variable value-mean)/Standard Deviation. How do we calculate the Z-scores for employee's salaries for the following table?

SQL>  select EMPNO, ENAME, JOB, SAL from emp order by job, sal desc;

     EMPNO ENAME      JOB              SAL
---------- ---------- --------- ----------
      7902 FORD       ANALYST         3000
      7788 SCOTT      ANALYST         3000
      7934 MILLER     CLERK           1300
      7876 ADAMS      CLERK           1100
      7900 JAMES      CLERK            950
      7369 SMITH      CLERK            800
      7566 JONES      MANAGER         2975
      7698 BLAKE      MANAGER         2850
      7782 CLARK      MANAGER         2450
      7839 KING       PRESIDENT       5000
      7499 ALLEN      SALESMAN        1600
      7844 TURNER     SALESMAN        1500
      7521 WARD       SALESMAN        1250
      7654 MARTIN     SALESMAN        1250

14 rows selected.

Solution

We can use the following query to calculate Z-score.

SQL> with tbl_mean_std as
    (
    select avg(sal) m, stddev(sal) std from EMP
    )
    select EMPNO, ENAME, JOB, SAL, (sal-m)/std as z_score
    from EMP, tbl_mean_std order by job, sal desc;

     EMPNO ENAME      JOB              SAL    Z_SCORE
---------- ---------- --------- ---------- ----------
      7902 FORD       ANALYST         3000 .783748996
      7788 SCOTT      ANALYST         3000 .783748996
      7934 MILLER     CLERK           1300 -.65387922
      7876 ADAMS      CLERK           1100 -.82301195
      7900 JAMES      CLERK            950  -.9498615
      7369 SMITH      CLERK            800  -1.076711
      7566 JONES      MANAGER         2975 .762607405
      7698 BLAKE      MANAGER         2850 .656899448
      7782 CLARK      MANAGER         2450 .318633985
      7839 KING       PRESIDENT       5000 2.47507631
      7499 ALLEN      SALESMAN        1600 -.40018012
      7844 TURNER     SALESMAN        1500 -.48474649
      7521 WARD       SALESMAN        1250  -.6961624
      7654 MARTIN     SALESMAN        1250  -.6961624

14 rows selected.

Not surprisingly, the president's salary has a z-score of 2.47 which is almost two and half standard deviation above the mean. Mean always has a z-score of zero. If we want to save the mean and standard deviation derived from this data set, we may create a permanent table to store them. This table can be used to calculate the z-score for any future data points. The following queries create a permanent table containing mean and standard deviation. It is then used to create a view that calculates the z-score.

SQL> create table tbl_mean_std_perm as select avg(sal) m, stddev(sal) std from EMP;
SQL> create view v_emp_zscore as select EMPNO,  ENAME,  JOB,   SAL,        (sal-m)/std as z_score from EMP, tbl_mean_std_perm;

We can take a look at the actual values of mean and standard deviation.

SQL> select * from tbl_mean_std_perm;

         M        STD
---------- ----------
2073.21429 1182.50322

How to Calculate Percentile

Problem

Sometimes, we are more interested in the percentile than the value itself. For example, parents of new born baby want to find out their baby's percentiles for weight and height. Using the employee salary table below, how do we calculate the percentiles for employee's salaries?

SQL>  select EMPNO, ENAME, JOB, SAL from emp order by job, sal desc;

     EMPNO ENAME      JOB              SAL
---------- ---------- --------- ----------
      7902 FORD       ANALYST         3000
      7788 SCOTT      ANALYST         3000
      7934 MILLER     CLERK           1300
      7876 ADAMS      CLERK           1100
      7900 JAMES      CLERK            950
      7369 SMITH      CLERK            800
      7566 JONES      MANAGER         2975
      7698 BLAKE      MANAGER         2850
      7782 CLARK      MANAGER         2450
      7839 KING       PRESIDENT       5000
      7499 ALLEN      SALESMAN        1600
      7844 TURNER     SALESMAN        1500
      7521 WARD       SALESMAN        1250
      7654 MARTIN     SALESMAN        1250

14 rows selected.

Solution

We can use Oracle analytics function cume_dist for convert the value of salary into the percentile as shown below.

SQL>  select EMPNO, ENAME, JOB, SAL, cume_dist() over(order by sal) percentile from emp order by 5 desc;

     EMPNO ENAME      JOB              SAL PERCENTILE
---------- ---------- --------- ---------- ----------
      7839 KING       PRESIDENT       5000          1
      7788 SCOTT      ANALYST         3000 .928571429
      7902 FORD       ANALYST         3000 .928571429
      7566 JONES      MANAGER         2975 .785714286
      7698 BLAKE      MANAGER         2850 .714285714
      7782 CLARK      MANAGER         2450 .642857143
      7499 ALLEN      SALESMAN        1600 .571428571
      7844 TURNER     SALESMAN        1500         .5
      7934 MILLER     CLERK           1300 .428571429
      7521 WARD       SALESMAN        1250 .357142857
      7654 MARTIN     SALESMAN        1250 .357142857
      7876 ADAMS      CLERK           1100 .214285714
      7900 JAMES      CLERK            950 .142857143
      7369 SMITH      CLERK            800 .071428571

14 rows selected.

In the above output, the salary of 1500 corresponds to percentile of 50%. This means that 50% of employees have salaries less or equal to 1500. We can verify this by running the following query. The number of records with salaries less or equal to 1500 is 7 which is 50% of the total size 14.

SQL> select count(*) from emp where sal <= 1500;

  COUNT(*)
----------
         7

More on Finding Records with Highest/Lowest Values by Category

Problem

In the earlier post Find Records with Highest/Lowest Values by Category, we use row_number() to generate the unique rank for each record. When there are ties in records, those records will receive different ranks randomly. For example, both Ford and Scott have the highest salary of 3000. However, row_number() generates different ranks for them. How do we find all the records with the highest salary by job including ties?

SQL> select EMPNO, ENAME, JOB, SAL, 
row_number() over(partition by job order by sal desc) sal_rank 
from emp order by job, sal desc;

     EMPNO ENAME      JOB              SAL   SAL_RANK
---------- ---------- --------- ---------- ----------
      7902 FORD       ANALYST         3000          1
      7788 SCOTT      ANALYST         3000          2
      7934 MILLER     CLERK           1300          1
      7876 ADAMS      CLERK           1100          2
      7900 JAMES      CLERK            950          3
      7369 SMITH      CLERK            800          4
      7566 JONES      MANAGER         2975          1
      7698 BLAKE      MANAGER         2850          2
      7782 CLARK      MANAGER         2450          3
      7839 KING       PRESIDENT       5000          1
      7499 ALLEN      SALESMAN        1600          1
      7844 TURNER     SALESMAN        1500          2
      7521 WARD       SALESMAN        1250          3
      7654 MARTIN     SALESMAN        1250          4

14 rows selected.

Solution

Two approaches are described here. The first one uses Oracle analytic function and the second one does not.
Approach 1. In stead of using Oracle analytic function row_number, we use dense_rank. If there are ties, dense_rank will produce the same rank for them.

SQL> select EMPNO, ENAME, JOB, SAL, 
dense_rank() over(partition by job order by sal desc) sal_rank 
from emp order by job, sal desc;

     EMPNO ENAME      JOB              SAL   SAL_RANK
---------- ---------- --------- ---------- ----------
      7902 FORD       ANALYST         3000          1
      7788 SCOTT      ANALYST         3000          1
      7934 MILLER     CLERK           1300          1
      7876 ADAMS      CLERK           1100          2
      7900 JAMES      CLERK            950          3
      7369 SMITH      CLERK            800          4
      7566 JONES      MANAGER         2975          1
      7698 BLAKE      MANAGER         2850          2
      7782 CLARK      MANAGER         2450          3
      7839 KING       PRESIDENT       5000          1
      7499 ALLEN      SALESMAN        1600          1
      7844 TURNER     SALESMAN        1500          2
      7521 WARD       SALESMAN        1250          3
      7654 MARTIN     SALESMAN        1250          3

14 rows selected.

Our final selection will be the following.


SQL> select * from (select EMPNO, ENAME, JOB, SAL,
 dense_rank() over(partition by job order by sal desc) sal_rank 
from emp order by job, sal desc)   where sal_rank=1 order by job;

     EMPNO ENAME      JOB              SAL   SAL_RANK
---------- ---------- --------- ---------- ----------
      7788 SCOTT      ANALYST         3000          1
      7902 FORD       ANALYST         3000          1
      7934 MILLER     CLERK           1300          1
      7566 JONES      MANAGER         2975          1
      7839 KING       PRESIDENT       5000          1
      7499 ALLEN      SALESMAN        1600          1

6 rows selected.

Approach 2. We calculate a temporary table containing the highest salary by job and then perform inner joining of the temporary table with the original table. As a result of the inner joining, only those employees with the highest salaries the jobs will be selected.

SQL> with
  2  tbl_temp as
  3  ( select job, max(sal) max_sal from emp group by job)
  4  select a.EMPNO, a.ENAME, a.JOB, a.SAL from
  5  emp a, tbl_temp b
  6  where a.job=b.job and a.sal=b.max_sal
  7  order by a.job;

     EMPNO ENAME      JOB              SAL
---------- ---------- --------- ----------
      7788 SCOTT      ANALYST         3000
      7902 FORD       ANALYST         3000
      7934 MILLER     CLERK           1300
      7566 JONES      MANAGER         2975
      7839 KING       PRESIDENT       5000
      7499 ALLEN      SALESMAN        1600

6 rows selected.

As we see, the results produced by the two approaches are the same.

Sunday, November 17, 2013

Find Records with Highest/Lowest Values by Category

Problem

We use the emp table as an example. How to we find the records for the highest paid employees by job titles?

SQL> select EMPNO, ENAME, JOB, SAL from emp    order by job, sal desc;

     EMPNO ENAME      JOB              SAL
---------- ---------- --------- ----------
      7902 FORD       ANALYST         3000
      7788 SCOTT      ANALYST         3000
      7934 MILLER     CLERK           1300
      7876 ADAMS      CLERK           1100
      7900 JAMES      CLERK            950
      7369 SMITH      CLERK            800
      7566 JONES      MANAGER         2975
      7698 BLAKE      MANAGER         2850
      7782 CLARK      MANAGER         2450
      7839 KING       PRESIDENT       5000
      7499 ALLEN      SALESMAN        1600
      7844 TURNER     SALESMAN        1500
      7521 WARD       SALESMAN        1250
      7654 MARTIN     SALESMAN        1250

14 rows selected.

Solution

We use Oracle analytic function row_number() which generates unique ranks for records. The rank will be generated independently by job ("partition by job") and will be based on the descending order of salary ("order by sal desc"). So the record with the highest salary in each job will receive a rank of 1 as shown below.

SQL> select EMPNO, ENAME, JOB, SAL, row_number() over(partition by job    order by sal desc) sal_rank from emp order by job, sal desc;

     EMPNO ENAME      JOB              SAL   SAL_RANK
---------- ---------- --------- ---------- ----------
      7902 FORD       ANALYST         3000          1
      7788 SCOTT      ANALYST         3000          2
      7934 MILLER     CLERK           1300          1
      7876 ADAMS      CLERK           1100          2
      7900 JAMES      CLERK            950          3
      7369 SMITH      CLERK            800          4
      7566 JONES      MANAGER         2975          1
      7698 BLAKE      MANAGER         2850          2
      7782 CLARK      MANAGER         2450          3
      7839 KING       PRESIDENT       5000          1
      7499 ALLEN      SALESMAN        1600          1
      7844 TURNER     SALESMAN        1500          2
      7521 WARD       SALESMAN        1250          3
      7654 MARTIN     SALESMAN        1250          4

14 rows selected.

To show only those records having highest salaries by jobs, we add a condition that only records with the rank of salary equal one will be selected.

SQL> select * from (select EMPNO, ENAME, JOB, SAL, row_number() over(      partition by job order by sal   desc) sal_rank from emp order by job, sal desc) where sal_rank=1;

     EMPNO ENAME      JOB              SAL   SAL_RANK
---------- ---------- --------- ---------- ----------
      7788 SCOTT      ANALYST         3000          1
      7934 MILLER     CLERK           1300          1
      7566 JONES      MANAGER         2975          1
      7839 KING       PRESIDENT       5000          1
      7499 ALLEN      SALESMAN        1600          1

The above method using row_number to generates rank will select one record for each category. If more then one records have the highest salaries in their group such as Analyst Scott and Ford, only one of them will be selected. In this case, it is Scott. In the post More on Finding Records with Highest/Lowest Values by Category, I will describe how to select all records with the highest value when there are ties.

Three Ways of Calculating Percentage of Quantity Using Oracle SQL

Problem

Similar to the problem mentioned in In the earlier post, Three Ways of Calculating Percentage of Counts Using Oracle SQL, we often need to calculate percentage based on the quantity such as revenue by product type, total expense by department, etc. In the following example, how do we calculate the percentage of salary expense by job title?

SQL> select EMPNO, ENAME, JOB, SAL from emp;

     EMPNO ENAME      JOB              SAL
---------- ---------- --------- ----------
      7839 KING       PRESIDENT       5000
      7698 BLAKE      MANAGER         2850
      7782 CLARK      MANAGER         2450
      7566 JONES      MANAGER         2975
      7788 SCOTT      ANALYST         3000
      7902 FORD       ANALYST         3000
      7369 SMITH      CLERK            800
      7499 ALLEN      SALESMAN        1600
      7521 WARD       SALESMAN        1250
      7654 MARTIN     SALESMAN        1250
      7844 TURNER     SALESMAN        1500
      7876 ADAMS      CLERK           1100
      7900 JAMES      CLERK            950
      7934 MILLER     CLERK           1300

14 rows selected.

Solution

In the same way as described in Three Ways of Calculating Percentage of Counts Using Oracle SQL, three approaches are described here to calculate percentage by quantity.

Approach 1.

We use Oracle analytic ratio_to_report function and pass the function sum(sal) to it. This is the simplest solution.

SQL> select JOB, ratio_to_report(sum(SAL)) over() as percentage from emp group by job order by JOB;

JOB       PERCENTAGE
--------- ----------
ANALYST   .206718346
CLERK     .142980189
MANAGER   .285099053
PRESIDENT .172265289
SALESMAN  .192937123

Approach 2.

We calculate total salary first in a sub query.

SQL> select JOB, sum(SAL)/(select sum(SAL) as total from emp) as percentage from emp group by job order by JOB;

JOB       PERCENTAGE
--------- ----------
ANALYST   .206718346
CLERK     .142980189
MANAGER   .285099053
PRESIDENT .172265289
SALESMAN  .192937123

Approach 3.

Again, we make use of new_value to generate SQLPLUS variable &total_sal.

SQL> column total_sal new_value total_sal
SQL> select sum(SAL) as total_sal from emp;

 TOTAL_SAL
----------
     29025

SQL> select JOB, sum(SAL)/&total_sal as percentage from emp group by job order by JOB;
old   1: select JOB, sum(SAL)/&total_sal as percentage from emp group by job order by JOB
new   1: select JOB, sum(SAL)/     29025 as percentage from emp group by job order by JOB

JOB       PERCENTAGE
--------- ----------
ANALYST   .206718346
CLERK     .142980189
MANAGER   .285099053
PRESIDENT .172265289
SALESMAN  .192937123

Three Ways of Calculating Percentage of Counts Using Oracle SQL

Problem

We are often asked to calculate summary information concerning percentage of counts, such as the percentage of our customers living in each state, percentage of credit card transactions by Merchant Category Code, percentage of employees by job tile, etc. For example, how do we calculate the percentage of number of employees by job title for the following table?


SQL> select EMPNO, ENAME, JOB, SAL from emp;

     EMPNO ENAME      JOB              SAL
---------- ---------- --------- ----------
      7839 KING       PRESIDENT       5000
      7698 BLAKE      MANAGER         2850
      7782 CLARK      MANAGER         2450
      7566 JONES      MANAGER         2975
      7788 SCOTT      ANALYST         3000
      7902 FORD       ANALYST         3000
      7369 SMITH      CLERK            800
      7499 ALLEN      SALESMAN        1600
      7521 WARD       SALESMAN        1250
      7654 MARTIN     SALESMAN        1250
      7844 TURNER     SALESMAN        1500
      7876 ADAMS      CLERK           1100
      7900 JAMES      CLERK            950
      7934 MILLER     CLERK           1300

14 rows selected.

Solution

Approach 1.

The simplest solution is to use Oracle analytic function ratio_to_report function with count(*) as the input parameter.

SQL> select JOB, ratio_to_report(count(*)) over() as percentage from emp    group by job order by JOB;

JOB       PERCENTAGE
--------- ----------
ANALYST   .142857143
CLERK     .285714286
MANAGER   .214285714
PRESIDENT .071428571
SALESMAN  .285714286

Approach 2.

If we do not use anlytic function ratio_to_report, we need to do it in two steps.

SQL> select JOB, count(*)/(select count(*) as total from emp) as percentage from emp   group by job order by JOB;

JOB       PERCENTAGE
--------- ----------
ANALYST   .142857143
CLERK     .285714286
MANAGER   .214285714
PRESIDENT .071428571
SALESMAN  .285714286

Approach 3.

We can take advantage of "new_value" feature to define a variable total_num, fill it with the actual total counts of employees and use it ( as &total_num) in calculating the percentage.

SQL> column total new_value total_num
SQL> select count(*) as total from emp;

     TOTAL
----------
        14

SQL> select JOB, count(*)/&total_num as percentage from emp   group by job order by JOB;
old   1: select JOB, count(*)/&total_num as percentage from emp group by job order by JOB
new   1: select JOB, count(*)/        14 as percentage from emp group by job order by JOB

JOB       PERCENTAGE
--------- ----------
ANALYST   .142857143
CLERK     .285714286
MANAGER   .214285714
PRESIDENT .071428571
SALESMAN  .285714286

Saturday, November 16, 2013

How to Generate Globally Unique Identifier on the Fly

Problem

How do we generate an identifier that is globally unique across databases and machines? A globally unique identifier is very useful. For example, if we mark a credit card transaction uniquely, we trace it in different stages including authorization, settlement, dispute etc.

Solution

In the earlier post Five Ways of Creating Unique Record Identifier For Oracle Tables we mentioned unique identifiers in Oracle can be created by using: 1. Oracle pseudocolumn rowid; 2. Oracle rownum; 3. row_number() function; 4. Sequence; 5. sys_guid().
The simplest solution to globally unique identifier is to call sys_guid() whenever we need to. It does not depend on a separate structure like sequence. Other mechanisms include rowid, rownum and row_number() are only guaranteed to be unique for a single table. I have seen a self-made globally unique identifier generator does not work 100% of the time.
The following query uses sys_guid function to generate globally unique identifier for every row in EMP table. Sys_guid returns 16 byte "raw" type.

SQL> select sys_guid() as gid, EMPNO, ENAME from EMP;

GID                                   EMPNO ENAME
-------------------------------- ---------- ----------
EB567FE077DC5427E040D00AD17F0769       7839 KING
EB567FE077DD5427E040D00AD17F0769       7698 BLAKE
EB567FE077DE5427E040D00AD17F0769       7782 CLARK
EB567FE077DF5427E040D00AD17F0769       7566 JONES
EB567FE077E05427E040D00AD17F0769       7788 SCOTT
EB567FE077E15427E040D00AD17F0769       7902 FORD
EB567FE077E25427E040D00AD17F0769       7369 SMITH
EB567FE077E35427E040D00AD17F0769       7499 ALLEN
EB567FE077E45427E040D00AD17F0769       7521 WARD
EB567FE077E55427E040D00AD17F0769       7654 MARTIN
EB567FE077E65427E040D00AD17F0769       7844 TURNER
EB567FE077E75427E040D00AD17F0769       7876 ADAMS
EB567FE077E85427E040D00AD17F0769       7900 JAMES
EB567FE077E95427E040D00AD17F0769       7934 MILLER

14 rows selected.

Oracle SQLPLUS Client Installation on Windows Troubleshooting

Problem

I wanted to install Oracle Instant Client On Windows for my laptop, a 32 bit system running Windows 7. I went to the Instant Client Downloads for Microsoft Windows (32-bit) and downloaded two zip files, instantclient-basic-nt-12.1.0.1.0.zip and instantclient-sqlplus-nt-12.1.0.1.0.zip. I unzipped them under the same directory. When I ran sqlplus, nothing happened.

$
$ sqlplus myuser/mypasswrod@192.168.1.5:1521/xe
$

Solution

Under the directory that I unzipped the downloaded files, I found a number of executable file such as adrci.exe in addition to sqlplus.exe. I ran adrci.exe and got the following error message.

$adric.exe 
"error while loading shared libraries: MSVCR100.dll"

After doing some research about MSVCR100.dll on the internet, I went to Microsoft Download Center: Microsoft Visual C++ 2010 Redistributable Package (x86). Once I downloaded and install the package, I can run sqlplus and connect to the database server without issue.

$ sqlplus myuser/mypasswrod@192.168.1.5:1521/xe

SQL*Plus: Release 12.1.0.1.0 Production on Sat Nov 16 13:07:04 2013

Copyright (c) 1982, 2013, Oracle.  All rights reserved.


Connected to:
Oracle Database 10g Express Edition Release 10.2.0.1.0 - Production

SQL>

I have written a newer post about the installation Oracle SQLPLUS Client Installation on Windows 64.

Thursday, November 14, 2013

Add Line Number to Text File

Problem

We have the following text file.

$ cat ads_log_small.csv
ADS_ID,DEVICE_OS,NUM_IMPRESSION,NUM_CONVERSION
 32, Android, 1, 0
 32, Android, 1, 0
 32, Android, 1, 0
 32, Android, 2, 0
 32, Android, 1, 0
 32, Android, 2, 0
 32, Android, 1, 0
 32, Android, 3, 0
 32, Android, 2, 0

If we load records in the text file into a database table, the original order of records in the file is lost. In the post SAS data set vs. relational database table, we mentioned that,Unlike SAS data set, a database table is a set where records no order (unless we explicitly sort them by keys). If we want to preserve the original sequence of the record, how do we add a line number to the original file?

Solution

We can use Linux (or Cygwin on Windows) command awk to add a line number to the file.

$ cat ads_log_small.csv | awk '{if (NR==1) print "LINE_NUMBER" ,",",$0;    else print NR-1,",",$0}'
LINE_NUMBER , ADS_ID,DEVICE_OS,NUM_IMPRESSION,NUM_CONVERSION
1 ,  32, Android, 1, 0
2 ,  32, Android, 1, 0
3 ,  32, Android, 1, 0
4 ,  32, Android, 2, 0
5 ,  32, Android, 1, 0
6 ,  32, Android, 2, 0
7 ,  32, Android, 1, 0
8 ,  32, Android, 3, 0
9 ,  32, Android, 2, 0

We can write the output to a new file.

$ cat ads_log_small.csv | awk '{if (NR==1) print "LINE_NUMBER" ,",",$0;    else print NR-1,",",$0}' > ads_log_new.csv

How to Find Out the Cutoff Value for Certain Percentile

Problem

We want to find out what answers to questions like:
What is the salary that top 1 percent of people are making?
What is the credit score that the lowest 15% have?

What is the cutoff predictive risky score that will raise alerts on the riskiest 0.1% of credit card transaction?

Solution

This problem can be solved using the same approach as described in Find the cutoff value for the top n records. We can always convert the percentile to the top-n-records (or bottom-n-records) by multiplying the size of the table and the percentile. However, there is a simpler way to do this. We can use Oracle function percentile_disc() to return the cutoff value given a certain percentile and ordering. Let's use a table that contains 100 random numbers as an example.
SQL> select * from TBL_100_RND where rownum <10;

NUM
-.58413539
1.31513578
2.16901993
.596910933
.208190221
.179899195
-.63028597
.525443855
-.52158424

The following query returns the cutoff value (v) for the top 10%.
SQL> select percentile_disc(0.1) within group(order by num desc) as cutoff from TBL_100_RND;

cutoff
1.13239998

The following query verifies that the cutoff value 1.13239998 indeed produces 10% of the records.
SQL> select count(1) from TBL_100_RND where num>=1.13239998;

COUNT(1)
10

How to Change Delimiters of Text File

Problem

A text file may contain many fields separated by delimiters. The delimiters could be tab, pipe(|), comma, etc. The following is an example of such a file. It has 4 fields separated by pipe (|). Some utilities only correctly recognize certain type of delimiters. How do we change the delimiter from one type to another, e.g., from pipe (|) to comma(,)?

$ head ads_log.txt
ADS_ID|DEVICE_OS|NUM_IMPRESSION|NUM_CONVERSION
 32| Android| 1| 0
 32| Android| 1| 0
 32| Android| 1| 0
 32| Android| 2| 0
 32| Android| 1| 0
 32| Android| 2| 0
 32| Android| 1| 0
 32| Android| 3| 0
 32| Android| 2| 0

Solution

We can use an editor such as notepad to do character replacement. However, if the file size is very big, using editor to do the work is very slow. A better way would be using Unix/Linux command tr.If we are running Windows system, we can install free cygwin which simulates Unix.

$ cat ads_log.txt | tr '|' ','
ADS_ID,DEVICE_OS,NUM_IMPRESSION,NUM_CONVERSION
 32, Android, 1, 0
 32, Android, 1, 0
 32, Android, 1, 0
 32, Android, 2, 0
 32, Android, 1, 0
 32, Android, 2, 0
 32, Android, 1, 0
 32, Android, 3, 0
 32, Android, 2, 0

We can write the output to the resulting file ads_log.txt.

$ cat ads_log.txt | tr '|' ',' > ads_log2.txt

Wednesday, November 13, 2013

Caculate Time Elapsed Between Two Dates

Problem

We want to find out how many seconds (or hours or days) elapsed between two time stamps. In building predictive models, variables like time since last credit card transaction, length of employment, etc., could be very useful in predicting fraudulent credit card transaction or claims. In the following example, we have a transaction begin and end time in our table. We want to find out how many minutes it takes to process a transaction.
SQL> select txn_id, to_char(txn_begin, 'YYYYMMDDHH24:MI:SS') txn_begin, to_char(txn_end, 'YYYYMMDDHH24:MI:SS') text_end from tbl_txn_small where ro wnum <=1;

TXN_ID	TXN_BEGIN	TEXT_END
1001	2013110509:21:03	2013110509:27:01

Solution

The days elapsed between two time stamps can be easily done by subtracting the end time from the beginning time. To convert the elapsed days in minutes or seconds we simple multiple it by 24X60 or 24X60X60.

SQL> select txn_id, to_char(txn_begin, 'YYYYMMDDHH24:MI:SS') txn_begin, to_char(txn_end, 'YYYYMMDDHH24:MI:SS') text_end, (txn_end-txn_begin)*24 hou rs_elpased, (txn_end-txn_begin)*24*60 minutes_elapsed, (txn_end-txn_begin)*24*60*60 sec_elapsed from tbl_txn_small where rownum <=1;

TXN_ID	TXN_BEGIN	TEXT_END	HOURS_ELPASED	MINUTES_ELAPSED	SEC_ELAPSED
1001	2013110509:21:03	2013110509:27:01	.099444444	5.96666667	358

"Fix" a Text File that Is in Unix Format

Problem

We expect a text file to be in the format similar to the following.

ADS_ID,DEVICE_OS,NUM_IMPRESSION,NUM_CONVERSION
 32, Android, 1, 0
 32, Android, 1, 0
 32, Android, 1, 0
 32, Android, 2, 0
 32, Android, 1, 0
 32, Android, 2, 0
 32, Android, 1, 0
 32, Android, 3, 0
 32, Android, 2, 0

However, when we open it in Microsoft Notepad, it look likes strange as shown below. The records are all in one line. How do we fix the text file that look "broken" in notepad?

Solution

This is most likely due to that the file is stored in Unix format instead of Windows.The line breaker for Unix text file is one character 10 ('\n' or '0A' in hexadecimal). While Windows text file uses two characters as the line breakers 10 and 14('\n\r', or '0A0D'). We can find out if a text file is in Unix or Windows format using the Linux command "od" (on Windows computer we can install free cygwin that allows us the use those Linux commands like "od")
$ cat ads_log_small.txt | od -c

0000000   A   D   S   _   I   D   ,   D   E   V   I   C   E   _   O   S
0000020   ,   N   U   M   _   I   M   P   R   E   S   S   I   O   N   ,
0000040   N   U   M   _   C   O   N   V   E   R   S   I   O   N  \n
0000060   3   2   ,       A   n   d   r   o   i   d   ,       1   ,
0000100   0  \n       3   2   ,       A   n   d   r   o   i   d   ,
0000120   1   ,       0  \n       3   2   ,       A   n   d   r   o   i
0000140   d   ,       1   ,       0  \n       3   2   ,       A   n   d
0000160   r   o   i   d   ,       2   ,       0  \n       3   2   ,
0000200   A   n   d   r   o   i   d   ,       1   ,       0  \n       3
0000220   2   ,       A   n   d   r   o   i   d   ,       2   ,       0
0000240  \n       3   2   ,       A   n   d   r   o   i   d   ,       1
0000260   ,       0  \n       3   2   ,       A   n   d   r   o   i   d
0000300   ,       3   ,       0  \n       3   2   ,       A   n   d   r
0000320   o   i   d   ,       2   ,       0  \n
0000332

We see that the line breaker is a single character \n. We can fix the by converting it to Windows format using Linux (or cyswin on Windows) command unix2dos.

$ unix2dos ads_log_small.txt
unix2dos: converting file ads_log_small.txt to DOS format ...

We run "od" command again. It shows that the line breakers are two characters \r\n. Now notepad can display the file correctly. We can convert a Widows text file back into a Unix file using dos2unix command.

$ cat ads_log_small.txt | od -c

0000000   A   D   S   _   I   D   ,   D   E   V   I   C   E   _   O   S
0000020   ,   N   U   M   _   I   M   P   R   E   S   S   I   O   N   ,
0000040   N   U   M   _   C   O   N   V   E   R   S   I   O   N  \r  \n
0000060       3   2   ,       A   n   d   r   o   i   d   ,       1   ,
0000100       0  \r  \n       3   2   ,       A   n   d   r   o   i   d
0000120   ,       1   ,       0  \r  \n       3   2   ,       A   n   d
0000140   r   o   i   d   ,       1   ,       0  \r  \n       3   2   ,
0000160       A   n   d   r   o   i   d   ,       2   ,       0  \r  \n
0000200       3   2   ,       A   n   d   r   o   i   d   ,       1   ,
0000220       0  \r  \n       3   2   ,       A   n   d   r   o   i   d
0000240   ,       2   ,       0  \r  \n       3   2   ,       A   n   d
0000260   r   o   i   d   ,       1   ,       0  \r  \n       3   2   ,
0000300       A   n   d   r   o   i   d   ,       3   ,       0  \r  \n
0000320       3   2   ,       A   n   d   r   o   i   d   ,       2   ,
0000340       0  \r  \n
0000344

Unlike Notepad, Microsoft WordPa is able to display text files in both Unix and Windows formats correctly.

Tuesday, November 12, 2013

Five Ways of Loading Text Files Into Oracle Database

Problem

Often we need to load text files, such as comma delimited files, into Oracle databases as tables.

Solution

The following are five ways to do it, i.e., external table, SQL loader, SQL insert, Oracle SQL Developer Import Data function and Oracle Apex Load Data. Actually, the real options are only the first three, external table, SQL loader, and SQL insert. SQL Developer and Apex use of the three options to import files. Of course, there are more than five ways to do it. For example, we can also use third party tools such ETL utilities, Microsoft Access, R etc. to load data into the database through ODBC connections.

External Table

We have described how to define external table in the database that points to text files in post Analyze Text Files in Real Time Using SQL Without Loading Them into Database. Once a external table is defined, we can simply define a permanent in database table using CTAS "create table as select". For example, we create a database table for the external TBL_DATA1_EXT as the following.

SQL> create table tbl_data1_real as select * from TBL_DATA1_EXT;

The above SQL statement creates a table of the same format as the external table and physically load the data into it. This is my favorite way of loading data. Using this approach, I am able to perform some analyze on the external tables and make sure them look right before I load them.

SQL Loader

We follow the three steps to use SQL loader to import data: 1.create a table; 2. Write a control file; and 3. run sqlldr to load the file.

Step 1. Create table using SQL.
create table tbl_data1_real
(
ads_id number,
device_os varchar2(32),
num_impression number,
num_click number
);
Step 2. Compile a control file like the following.
load data
infile 'c:\\projects\\log\\ads_log.csv'
append
into table TBL_DATA1_REAL
fields terminated by '|'
OPTIONALLY ENCLOSED BY '"' AND '"'
trailing nullcols
( ADS_ID ,
DEVICE_OS,
NUM_IMPRESSION,
NUM_CLICK
)
Step 3. Load the file into the table.
$ sqlldr user/password@localhost:1521/xe CONTROL=ads_log2.ctl skip=1
SQL*Loader: Release 11.2.0.1.0 - Production on Tue Nov 12 07:04:25 2013
Copyright (c) 1982, 2009, Oracle and/or its affiliates. All rights reserved.
Commit point reached - logical record count 64
Commit point reached - logical record count 128
Commit point reached - logical record count 192
Commit point reached - logical record count 256
Commit point reached - logical record count 320
Commit point reached - logical record count 384
......................

SQL Insert

Please see my post A Quick Way to Import Spreadsheet Into a Relational Database. We simply generate SQL insert statements for our data and run them. It is a great way o quickly import text files of small size.

Oracle SQL Developer Import Data

Oracle SQL Developer is a free development tool. We first create the destination table using SQL statement mentioned in the above section SQL Loader. Within SQL Developer, we select the table can right click to select Import Data.
There are three options to load the data, Insert Scripts, Staging External Table, and SQL Loader Utility as shown below. We can pick the one we like.

Oracle Apex Text File Import

With Apex 4.2, we open the web browser and log onto the admin account for the workspace. We then go to SQL Workshop/Utilities/Data Workshp. From there, we can use the data load function.

Conclusions

We described five ways of importing text files into Oracle database tables. My favorite way is Oracle external table because I can run SQL queries against the files to validate them before they are physically imported. Of course, there are more than five ways to do it. For example, we can also use third party tools such ETL utilities, Microsoft Access, R etc. to load data into the database through ODBC connections.

Sunday, November 10, 2013

More on How to Find the Most Important Variables for a Predictive Model

Problem

To make a predictive model, we need independent variables as inputs and a single variable as the target. Typically, both independent and target variables are stored in a single table. Often there are many independent variables, say 50 or 200 of them, such as age, sex, annual income, credit limits, and transaction variables etc. How do we select a small number of variables that are most predicative of the target variable and use them to build a model that is robust?

Solution

In the early post Find the Most Important Variables In Predictive Models, we described that there is a drawback to justify the importance of variable individually. Ideally we should take a set of variables as a whole into consideration. One of the good approaches is Oracle’s Attribute Importance model.

I built a credit card transaction data that contains column is_fraud as the target, id as the unique record identifier and other independent variables that I want to analyze. The data set is v_training_set. I write the following PL/SQL script to build an attribute importance model.

begin
DBMS_DATA_MINING.CREATE_MODEL(
model_name => 'VAR_IMPORTANCE',
mining_function => DBMS_DATA_MINING.ATTRIBUTE_IMPORTANCE,
data_table_name => 'v_training_set',
case_id_column_name => 'id',
target_column_name => 'is_fraud');
END;

When the attribute importance model,VAR_IMPORTANCE, is done, all independent, i.e., all variables except target and record identifiers, are assigned an importance value. The higher the value of a variable, the more important it is in predicting the target. We can review our result using the following SQL. (I deliberately masked the independent variable names because I think fraud detection is an sensitive matter and we do not want to give away too much information to fraudsters who might be reading this blog post.)

SQL> select attribute_name, IMPORTANCE_VALUE, rank from TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_AI('VAR_IMPORTANCE')) order by rank;

ATTRIBUTE_NAME	IMPORTANCE_VALUE	RANK
VAR__EMV___	.024118198	1
VAR__TER___	.014412195	2
VAR__RET___	.013569008	3
VAR__PT____	.008679484	4
VAR__TRA___	.008009003	5
VAR__TER___	.007207152	6
VAR__CHK___	.006322795	7
VAR__NRT___	.00591138	8
VAR__NRT___	.005564262	9
VAR__MSG___	.005332518	10
VAR__TRA___	.004457798	11
VAR__NRT___	.00409855	12
VAR__PIN___	.003852347	13
VAR__PRO___	.001177829	14
VAR__RES___	.000911889	15
VAR__TER___	.000767663	16
VAR__FIL___	.000448031	17
VAR__ACC___	.000172331	18
VAR__PRM___	.00009502	19
VAR__AUT___	.00009502	19
VAR__FRW___	.00009502	19
VAR__AUT___	.000092784	20
VAR__ACC___	.000012229	21
VAR__PT____	.000003078	22
VAR__ACC___	-6.946E-06	23
VAR__TRA___	-.02799773	24
VAR__TER___	-.16968261	25
VAR__TRA___	-.39228472	26
VAR__TIE___	-.57372309	27

Conclusion

Oracle’s Attribute Importance function ranks variables based on their importance in predicting the target. It is a great tool for selecting a small number of input variables out of many before we build a predictive model.

Saturday, November 09, 2013

Analyze Text Files in Real Time Using SQL Without Loading Them into Database

Problems

A log file that records the number of impressions and number of clicks for advertisements are constantly updated (growing). The file looks like the following:

ADS_ID|DEVICE_OS|NUM_IMPRESSION|NUM_CONVERSION
32| Android| 1| 0
32| Android| 1| 0
32| Android| 1| 0
32| Android| 2| 0
32| Android| 1| 0
32| Android| 2| 0
32| Android| 1| 0
32| Android| 3| 0
32| Android| 2| 0

We want to generate real time reports about the summarized performance of advertisements, such as the click through rate by ads_id, click through rate by operation system and ads_id, etc. The reports should reflect the real time changes in the log file.

Solutions

One of the best solutions is to use an Oracle external table and view. There are two tasks to be performed.

Task 1. Define an external table on the log file. An external table is just a pointer to the location of file and definition of its format. The file itself is not loaded into the database as a permanent table. Once the external table is defined, we can query it using SQL just like regular table. The data is read by Oracle on the fly. Thus any changes in the file will be reflected on the query result.

Task 2. Define views to summarize the external table and produce reports. Since views just store the process logic and only produce the output when we query it, the content of views always reflects the latest information in the external table which in turn captures the changes in the log file.

Task 1.Define an external table.

Step 1. If not yet, we need to assign "create any directory" privilege to current user.

SQL> grant create any directory to current_user;

Step 2. We create directory. It is assumed that the Oracle database server has the access to the directory where the log file is located.

SQL> create directory dir_files as '/home/log/data';

Step 3. We define the external table that points to the log file.

create table tbl_data1_ext
(
ads_id number,
operation_sys varchar2(32),
num_impression number,
num_click number
)
organization external
( type oracle_loader
default directory dir_files
access parameters
( records delimited by newline
skip 1
fields terminated by '|'
missing field values are null
)
location('ads_log.txt')
);
Once it is done, we can verify if the file is define correctly.
SQL> select count(1) from TBL_DATA1_EXT;

COUNT(1)
165201

We use Linux command wc to count the number of lines in the file. The text file has one more line which is the header. The header was skipped when we define the external table.

$ wc -l ads_log.txt
165202 ads_log.txt

Task 2. Create views. Once we have the external table, we can create views to summarized it.

SQL> create view v_ctr_for_ads as select ads_id, sum(NUM_IMPRESSION) num_impression, sum(NUM_CLICK) NUM_CLICK, sum(NUM_CLICK)/sum(NUM_IMPRESSION) clr from TBL_DATA1_EXT group by ads_id;
View created.

SQL> create view v_ctr_for_ads_os as select ads_id, OPERATION_SYS, sum(NUM_IMPRESSION) num_impression, sum(NUM_CLICK) NUM_CLICK, sum(NUM_CLICK)/su m(NUM_IMPRESSION) clr from TBL_DATA1_EXT group by OPERATION_SYS, ads_id;
View created.

The user can look at views the get the summary information about the log file in real time.
SQL> select * from v_ctr_for_ads where num_impression>100 order by clr desc;

ADS_ID	NUM_IMPRESSION	NUM_CLICK	CLR
32	4116	5	.001215

SQL> select * from v_ctr_for_ads_os where num_impression>100 order by clr desc;

ADS_ID	OPERATION_SYS	NUM_IMPRESSION	NUM_CLICK	CLR
32	Android	3360	5	.001488
32		756	0	0

Conclusions

With Oracle external tables that are just pointers to the location of text files and definition of file formats, we can perform SQL queries against text files without loading them into the database as physical database tables. It is a great solution when we want to repeatedly analyze text files that are constantly changing. Combining external tables and views, we can get query results that reflect the most recent content of text files. This solution is also very "clean" since there is no permanent database tables created.

Wednesday, November 06, 2013

Insert Records Into an Oracle Table and Rollback

After we insert records into a table, we can rollback and undo the changes. However, if we run any DDL queries afterwards such as "create table", "create view", "dbms_stats.gather_table_stats" etc., the data inserted are committed and can not be rollback. The following are some examples.

SQL> create table tbl_a (num number);
Table created.

SQL> insert into tbl_a values(1);
1 row created.

SQL> select * from tbl_a;

NUM
1

SQL> commit;
Commit complete.

We insert a new record and can rollback.

SQL> insert into tbl_a values(2);
1 row created.

SQL> select * from tbl_a;

NUM
1
2

SQL> rollback;
Rollback complete.

SQL> select * from tbl_a;

NUM
1

We insert another new record.

SQL> insert into tbl_a values(2);
1 row created.

We create a view on the table.

SQL> create view v_tbl_a as select * from tbl_a;
View created.

Because of "create view", data inserted is committed and can not be rolled back.

SQL> rollback;
Rollback complete.

SQL> select * from tbl_a;

NUM
1
2

We insert another new value.

SQL> insert into tbl_a values(3);
1 row created.

We run dbms_stats.gather_table_stats(). The data inserted can not be rolled back.

SQL> exec dbms_stats.gather_table_stats(null,'tbl_a');
PL/SQL procedure successfully completed.

SQL> rollback;
Rollback complete.

SQL> select * from tbl_a;

NUM
1
2
3

We insert another new value.

SQL> insert into tbl_a values(4);
1 row created.

We create a new table. Because "create table" is a DDL statement, we can not roll back the inserted data even if it is inserted into a different table.

SQL> create table tbl_xb (value varchar2(32));
Table created.

SQL> rollback;
Rollback complete.

SQL> select * from tbl_a;

NUM
1
2
3
4

We insert another new value.

SQL> insert into tbl_a values(5);
1 row created.

We run dbms_stats.gather_table_stats() on tbl_xb and we can not roll back the inserted data even if it is inserted into a different table.

SQL> exec dbms_stats.gather_table_stats(null,'tbl_xb');
PL/SQL procedure successfully completed.

SQL> rollback;
Rollback complete.

SQL> select * from tbl_a;

NUM
1
2
3
4
5

We insert another new value.

SQL> insert into tbl_a values(6);
1 row created.

We create a view on tbl_xb. As a result, data inserted into tbl_a is committed and can not be rolled back.

SQL> create view view_tbl_xb as select * from tbl_xb;
View created.

SQL> rollback;
Rollback complete.

SQL> select * from tbl_a;

NUM
1
2
3
4
5
6

6 rows selected.

Saturday, November 02, 2013

Find the cutoff value for the top n records

It is a very common task to find the cutoff value to get the top N records . The following are some of the examples:
1. Find the cutoff salary for the top 10 employees in a company.
2. Find a cutoff score for a risk model that generates alerts for the top 100 riskiest transactions.

We use the following table that has 20 records as an example.

SQL> select id, num from TBL_20 order by id;

ID	NUM
1	-.650222
2	-1.465297
3	-.689485
4	-1.547403
5	-1.791099
6	-1.270857
7	.988116
8	1.246141
9	.643606
10	-.515888
11	-.713859
12	-.587674
13	-1.634403
14	1.285847
15	-.08049
16	.231295
17	-.66065
18	.422664
19	-.134565
20	-1.773186

20 rows selected.

If we want to find out the cutoff value for the largest 5th column "num", we first use function row_number() to generate rank and then select the num that has a rank of 5.

SQL> with tbl as (select a.*, row_number() over(order by num desc) rnk from tbl_20 a) select num from tbl where rnk=5;

NUM
.422664

To verify that .422664 is indeed the cutoff value for the top 5 records, we run the following query.

SQL> select * from tbl_20 where num>=.422664 order by num desc;

NUM	ID
1.285847	14
1.246141	8
.988116	7
.643606	9
.422664	18

It is a good practice to always verify our results using another query. That way, the chance of making mistakes is greatly reduced.

Create a partitioned table from an existing table

Partitioned tables are extremely powerful to manage large data. We can combine "create table as select" and "partition by" to build a new partitioned table based on an existing table.

For example, we have a transaction table that includes account_number and transaction date. We can create a partitioned table that has one partition for each day.

SQL> create table tbl_txn_par_by_day partition by range(txn_date) interval(numtodsinterval(1,'day')) (partition p0 values less than (to_date('20131 001','YYYYMMDD'))) as select * from tbl_txn;
Table created.

Or we can create a partitioned table that has 20 partitions based on the hash value of account numbers.

SQL> create table tbl_txn_par_by_acct_num partition by hash(account_number) partitions 20 as select * from tbl_txn;
Table created.

We can find out the partition names.

SQL> select table_name, partition_name, high_value from user_tab_partitions where table_name='TBL_TXN_PAR_BY_DAY';

TABLE_NAME	PARTITION_NAME	HIGH_VALUE
TBL_TXN_PAR_BY_DAY	P0	TO_DATE(' 2013-10-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')
TBL_TXN_PAR_BY_DAY	SYS_P45	TO_DATE(' 2013-10-16 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')

We run SQL query against specific partitions.

SQL> select count(1) from TBL_TXN_PAR_BY_DAY partition(SYS_P45);

COUNT(1)
100125

Using partition tables, I was able to perform complex analysis on 50 million bank card transactions, including fuzzy matching multiple tables,etc., on a $600 desktop PC.

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics

Popular Topics

Thursday, November 28, 2013

Problem

Solution

Problem

Solution

Conclusions

Wednesday, November 27, 2013

Problem

Solution

Monday, November 25, 2013

Problem

Solution

Problem

Solution

Sunday, November 24, 2013

Problem

Solution

Conclusion

Saturday, November 23, 2013

Problem

Solution

Tuesday, November 19, 2013

Problem

Solution

Problem

Solution

Monday, November 18, 2013

Problem

Solution

Problem

Solution

Problem

Solution

Sunday, November 17, 2013

Problem

Solution

Problem

Solution

Approach 1.

Approach 2.

Approach 3.

Problem

Solution

Approach 1.

Approach 2.

Approach 3.

Saturday, November 16, 2013

Problem

Solution

Problem

Solution

Thursday, November 14, 2013

Problem

Solution

Problem

Solution

Problem

Solution

Wednesday, November 13, 2013

Problem

Solution

Problem

Solution

Tuesday, November 12, 2013

Problem

Solution

External Table

SQL Loader

SQL Insert

Oracle SQL Developer Import Data

Oracle Apex Text File Import

Conclusions

Sunday, November 10, 2013

Problem

Solution

Conclusion

Saturday, November 09, 2013

Problems