Wednesday, February 26, 2014

NULL Value and Order By

By default, when we order the records by columns, the null values are ranked last.

SQL> select * from TBL_TEST3 order by a, b;

        ID          A          B
---------- ---------- ----------
         1          1          2
         2          1
         4          2          2
         3
We can place rows with nulls first by specifying "nulls fist".
SQL> select * from TBL_TEST3 order by a nulls first, b;

        ID          A          B
---------- ---------- ----------
         3
         1          1          2
         2          1
         4          2          2

NULL Value and Logical Comparison

Logical comparisons in where clause of queries invloving NULL values will be ignored. We use the following table to illustrate this.

SQL> select * from tbl_test3;

        ID          A          B
---------- ---------- ----------
         1          1          2
         2          1
         3
         4          2          2

SQL> select * from tbl_test3 where a=b;

        ID          A          B
---------- ---------- ----------
         4          2          2

SQL> select * from tbl_test3 where a<>b;

        ID          A          B
---------- ---------- ----------
         1          1          2
It is interesting to notice that 1. NULL=NULL is not true; 2. NULL<>NULL is not true either.

NULL Value and Arithmetic Operations

Any arithmetic operations invlove NULL values will return NULL. We use the following table to illustrate this.

SQL> select * from tbl_test3;

        ID          A          B
---------- ---------- ----------
         1          1          2
         2          1
         3
         4          2          2

SQL> select t.*, a+b from tbl_test3 t;

        ID          A          B        A+B
---------- ---------- ---------- ----------
         1          1          2          3
         2          1
         3
         4          2          2          4

Tuesday, February 25, 2014

Four ways of Calculating Average Value

In the post Calculate Average Value - Watch Out NULL Values, we mentioned that we need to pay attention to how the NULL values are handled by the database. Using the following table as an example (that has two records with NULL values), we will show four ways of calculating average.

SQL> select * from tbl_test2;

        ID      VALUE
---------- ----------
         1          2
         2          3
         3          4
         4          5
         5
         6

6 rows selected.
Using sum(value)/count(*) will return the wrong result.
SQL> select sum(value)/count(*) from tbl_test2; (wrong!)

SUM(VALUE)/COUNT(*)
-------------------
         2.33333333
The following four queries are correct.
Method 1. Using avg() function.
SQL> select avg(value) from tbl_test2;

AVG(VALUE)
----------
       3.5
Method 2. Using sum(value)/count(value) function. Count(value) will ignore NULL values.
SQL> select sum(value)/count(value) from tbl_test2;

SUM(VALUE)/COUNT(VALUE)
-----------------------
                    3.5
Method 3. Using sum(value)/sum(case when value is null then 0 else 1 end).
SQL> select sum(value)/sum(case when value is null then 0 else 1 end) from tbl_test2;

SUM(VALUE)/SUM(CASEWHENVALUEISNULLTHEN0ELSE1END)
------------------------------------------------
                                             3.5
Method 4. Using sum(value)/count(*) and put "not null" condition in where clause.
SQL> select sum(value)/count(*) from tbl_test2 where value is not null;

SUM(VALUE)/COUNT(*)
-------------------
                3.5

Monday, February 24, 2014

Calculate Average Value - Watch Out NULL Values

Calculating the average or mean is one of the most common tasks. Average or mean of a variable is simply the sum of values divided by the number of data points. However if we do not understand how null values are handled by a database, mistakes may happen. Taking the following table as an example. It has 6 records and two of them have null in column value.

SQL> select * from tbl_test2;

        ID      VALUE
---------- ----------
         1          2
         2          3
         3          4
         4          5
         5
         6

6 rows selected.
We calculate the average value using avg() function.
SQL> select avg(value) from tbl_test2;

AVG(VALUE)
----------
       3.5
The result is 3.5. If we use the following query to calculate the total number of records (6), the sum of value(14) and the average as the sum of value divided by number of records (14/6), the average value is 2.33.
SQL> select  count(*) tot_num, sum(value) sum_value, sum(value)/count(*) as avg_2 
from tbl_test2;

   TOT_NUM  SUM_VALUE      AVG_2
---------- ---------- ----------
         6         14 2.33333333
We can see the average value calcuated by avg(value) and sum(value)/count(*) is very different. Why does this happen? The reason is that when Oracle calculates function avg(value), the records with null value are excluded from consideration. We can add a condition to the second query to specifiy only records with non-null value are taking into consideration as the following.
SQL> select  count(*) tot_num, sum(value) sum_value, 
            sum(value)/count(*) as avg_2  
from tbl_test2 where value is not null;

   TOT_NUM  SUM_VALUE      AVG_2
---------- ---------- ----------
         4         14        3.5
Now, the sum(value)/count(*) returns the same value as avg(value). As we can see, it is important to understand how NULL values are handled in a database.

Saturday, February 22, 2014

Display How Long to Run a Query

Sometimes, we want to measure how long it takes to run a query. In sqlplus, this can be done by command "set timing on". The time elapsed for runing a query will be displayed after "set timing on".

SQL> set timing on;
SQL> select count(1) from PLANET;

  COUNT(1)
----------
         9

Elapsed: 00:00:00.03
SQL> select count(1) from user_objects;

  COUNT(1)
----------
      1099

Elapsed: 00:00:00.09
SQL>

Friday, February 21, 2014

Create or Replace Oracle Table

Problem

We can use "create or replace view" to create a view if it does not exist or replace it if it exits. However, there is not such thing as "create or replace table". How do we create a table if it does not exist or replace the table if it exists?

Solution

The following PL/SQL does preisely that. We store the script as a file create_or_replace.sql. It first tries to drop the table. If it does not exist, an exception will be raised and nothing is done. Then the script continues to create the table. We put the exception inside the begin/end block so that the exception will only break the first block and will continue to execute the "create table" statement outside of the block.
begin
  begin
   execute immediate 'drop table tbl_test';
  exception when others then
   NULL;
  end;
  execute immediate 'create table tbl_test (id number, value number)';
end;
/
We run the scripts twice. In the first time, table tbl_test were created. In the second run, table tbl_test was dropped and then created.
SQL> @create_or_replace.sql

PL/SQL procedure successfully completed.

SQL> @create_or_replace.sql

PL/SQL procedure successfully completed.

Wednesday, February 19, 2014

More on Calculating Histogram Using Oracle Function

In the older post Calculate Histogram Using Oracle Function, we showed how to use functions width_bucket() and ratio_to_report(). To display histogram visually, function lpad() can be used.
SQL> with
  2   tbl as (
  3   select min(NUM) low, max(NUM) high
  4   from TBL_1K_RND),
  5   tbl2 as (
  6   select width_bucket(NUM, low, high, 10) s, NUM
  7   from TBL_1K_RND, tbl
  8   )
  9   select s,
 10   min(NUM) lower, max(NUM) upper,
 11   count(1) num,
 12   round((ratio_to_report(count(1)) over())*100,1) pcnt,
 13   lpad('*', round((ratio_to_report(count(1)) over())*100,0), '*') histogram
 14   from tbl2 group by s order by s;

         S      LOWER      UPPER        NUM       PCNT HISTOGRAM
---------- ---------- ---------- ---------- ---------- ------------------------------
         1 -3.0602851 -2.5082225          7         .7 *
         2 -2.4379864   -1.84617         35        3.5 ***
         3 -1.8150501 -1.2060425         79        7.9 ********
         4 -1.1991025 -.58413539        164       16.4 ****************
         5 -.58247189 .029088646        224       22.4 **********************
         6 .040683615  .65371762        242       24.2 ************************
         7 .672280301 1.27225368        151       15.1 ***************
         8 1.27500147 1.82004315         61        6.1 ******
         9 1.90959272 2.47850573         33        3.3 ***
        10 2.56404876 3.00088262          4         .4
        11  3.1319198  3.1319198          1         .1

11 rows selected.

Saturday, February 15, 2014

Lpad Function For Data Visualization- Making Bar Chart

Problem

Lpad() is a very cool function that can be used for data visualization. It takes three parameters as an example shown below. The first parameter is the string to display('hello'), the second one is the length of the resulting string (10), the third one if the character(*) that will be used to fill the spaces on the left.

SQL> select lpad('hello',10,'*') s from dual;

S
----------
*****hello
We can use lpad to display "bar chart" based on the data. For example, we have the following daily stock price data. How do we display the price variation visually?
SQL> select * from stock;

SYM DAT            PRICE
--- --------- ----------
XYZ 31-MAY-11         14
XYZ 01-JUN-11         19
XYZ 02-JUN-11         21
XYZ 03-JUN-11         23
XYZ 04-JUN-11         27
XYZ 05-JUN-11         14
XYZ 06-JUN-11         17
XYZ 07-JUN-11         22
XYZ 08-JUN-11         26
XYZ 09-JUN-11         27
XYZ 10-JUN-11         21
XYZ 11-JUN-11         17
XYZ 12-JUN-11         27
XYZ 13-JUN-11         27
XYZ 14-JUN-11         16
XYZ 15-JUN-11         14
XYZ 16-JUN-11         16
XYZ 17-JUN-11         26
XYZ 18-JUN-11         25
XYZ 19-JUN-11         24

20 rows selected.

Solution

We use the following query. Lpad('*', price, '*') means it will make a string "*" of the length of "price" with left spaces filled with '*'.

SQL> select dat, price, lpad('*', price, '*') price_bar 
from stock;

DAT            PRICE PRICE_BAR
--------- ---------- ----------------------------------------
31-MAY-11         14 **************
01-JUN-11         19 *******************
02-JUN-11         21 *********************
03-JUN-11         23 ***********************
04-JUN-11         27 ***************************
05-JUN-11         14 **************
06-JUN-11         17 *****************
07-JUN-11         22 **********************
08-JUN-11         26 **************************
09-JUN-11         27 ***************************
10-JUN-11         21 *********************
11-JUN-11         17 *****************
12-JUN-11         27 ***************************
13-JUN-11         27 ***************************
14-JUN-11         16 ****************
15-JUN-11         14 **************
16-JUN-11         16 ****************
17-JUN-11         26 **************************
18-JUN-11         25 *************************
19-JUN-11         24 ************************

20 rows selected.

Sunday, February 09, 2014

Calculate Confusion Matrix Using SQL

Problem

Confusion matrix is often used to measure the accuracy of a predictive model. For example, for a model that predicts binary outcomes, if we compare the prediction and actual outcome, say 0 or 1, there are four possibilities: true positive (predicted 1 and actual 1), false positive (predicted 1 and actual 0), true negative (predicted 0 and actual 0) and false negative (predicted 0 and actual 1)as shown in the diagram below.

How to calculate the confusion matrix using SQL?

Solution

We can easily calculate the confusion matrix using SQL "group by". For example, the following query first applies model GLM1031E to data set DATASET_TEST using prediction() function. Then "group by" is used to calculate the number of records for all combinations of predicted and actual outcomes.

SQL> with tbl as (select prediction(GLM1031E using * ) predicted, 
account_default as actual from DATASET_TEST) select predicted, actual, 
count(1) 
from tbl group by predicted, actual;

 PREDICTED     ACTUAL   COUNT(1)
---------- ---------- ----------
         1          0          3
         0          0        437
         1          1         41
         0          1          3
I have found that in real world application people pay a great deal of attention to false positive. For example, in credit card fraud detection, there are only a few fraudulent transactions out of ten thousand. If our predictive model can detect correctly one true frauds (true positive) while make three false alarms (false positive), it is considered pretty good.

Start Oracle 12c Multitenant Database

Problem

When Oracle 12c was released a few months ago, I downloaded a copy and installed it on my laptop. I built a number of predictive models within a pluggable database. Today, I wanted to review my models and could not connect to the pluggable database on my laptop. The following is the error message that I got

C:\projects\sql>sqlplus dev/XXXXXX@//localhost:1521/testdb01

SQL*Plus: Release 12.1.0.1.0 Production on Sun Feb 9 06:34:27 2014

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

ERROR:
ORA-01033: ORACLE initialization or shutdown in progress
Process ID: 0
Session ID: 0 Serial number: 0

Solution

I realized that I have rebooted my laptop and need to restart my pluggable database. To do this, I connected to the databbase as sys user.

C:\projects\sql>Sqlplus sys/xxxxx  as sysdba

SQL*Plus: Release 12.1.0.1.0 Production on Sun Feb 9 06:38:07 2014

Copyright (c) 1982, 2013, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL>
I took a look at the pluggable databases that I have created.
SQL> show pdbs

    CON_ID CON_NAME                       OPEN MODE  RESTRICTED
---------- ------------------------------ ---------- ----------
         2 PDB$SEED                       READ ONLY  NO
         3 PDBORCL                        MOUNTED
         4 TESTDB01                       MOUNTED
As we can see, TESTDB01 is not open. I open the database.
SQL> alter pluggable database TESTDB01 open;

Pluggable database altered.
SQL> show pdbs

    CON_ID CON_NAME                       OPEN MODE  RESTRICTED
---------- ------------------------------ ---------- ----------
         2 PDB$SEED                       READ ONLY  NO
         3 PDBORCL                        MOUNTED
         4 TESTDB01                       READ WRITE NO

I quit my sys session and I was able to connect to the plugable database TESTDB01 and check my predictive models built earlier.
SQL> quit
$sqlplus dev/XXXX@//localhost:1521/testdb01

SQL> select object_name, object_type from user_objects where 
object_type='MINING MODEL' order by object_name;

OBJECT_NAME                              OBJECT_TYPE
---------------------------------------- -----------------------
AI0929                                   MINING MODEL
AR1029                                   MINING MODEL
DT1029                                   MINING MODEL
GLM0115                                  MINING MODEL
GLM1031A                                 MINING MODEL
GLM1031B                                 MINING MODEL
GLM1031C                                 MINING MODEL
GLM1031E                                 MINING MODEL
KM1031C                                  MINING MODEL
KM1211                                   MINING MODEL
KM_MODEL                                 MINING MODEL
KM_MODEL_TRY1                            MINING MODEL
NB1021                                   MINING MODEL
OC_SH_CLUS_SAMPLE                        MINING MODEL
SVD0119                                  MINING MODEL
SVM1029                                  MINING MODEL
TMSVD1                                   MINING MODEL


17 rows selected.
Oracle 12c multitenant database is new and it took me a few hours to figure out how to manage the pluggable databases. I could build predictive models using the same PL/SQL code that worked on Oracle 11g. I have found that it is helpful for a data miner to know some basic database administrator stuff.