Wednesday, October 26, 2016

Ranking High in Kaggle Competition is a Huge Advantage for Job Seekers

For someone who is looking for a job in data analytics field, high rankings in Kaggle Competition will give him tremendous advantage. Employers see the competition winners having strong problem solving skills and hands-on expertise. Indeed, to be able to complete some projects for Kaggle competitions, participants have to put in a lot of effort to deal with data issues even before any predictive models can be built. This is very similar to that in the real world applications where 80% of time is spent on data cleanse and manipulation. It is not a surprise that some employers prefer Kaggle competition winners over PhD graduates whose skills are perceived as more theoretical.

Taking my nephew, Yuyu Zhou, as an example. He got a master degree in data analytics. While he was in school, he spent a few weeks with other classmates to participate in Kaggle competitions. His team has achieved the top 3% and 5% in two Kaggle prediction competitions respectively (see my blog post a Young Data Scientist- Kaggle Competition Top 5% Winner: Yuyu Zhou. Once he graduated, he was quickly hired by a prestigious company and has been earning PhD level salary. Those few weeks he spent on Kaggle Competition was the best time investment of his life.

Thursday, October 20, 2016

Oracle Ora_Hash function- Part 1 Random Sampling

Oracle ora_hash() is a very useful function. I have used it for different purposes such as generating random number. The following query generate 5 buckets from 0 to 4, each of them have the similar number of records.
First, we create a table and populate it with 1,000 numbers.

create table t_n (n number);

begin
for i in 1..1000
 loop 
 insert into t_n values(i);
 end loop;
 commit;
end;
In the query below, the parameter 5 of ora_hash defines the number of buckets is 5. As we see, each bucket has simlilar number of records.
with tbl as
(
select ora_hash(n, 5) bucket, n  from t_n)
select bucket, count(*), min(n), max(n) from tbl
group by bucket order by 1;
BBUCKET COUNT(*) MIN(N) MAX(N)
0 154          2  993
1 164          7  999
2 168          6  991
3 175          4  995
4 173          8 1000

Saturday, July 02, 2016

SQL Listagg Function

Listagg is a useful function to combine column values from different rows. In a sense, it is similar to sum() function that aggregates information across records.
To show how listagg works, first we create a table that has record id, group id and words.

create table t(id number, gp number, word varchar2(128));

insert into t values(1,1,'Hello');
insert into t values(2,1,'World!');
insert into t values(3,2,'Good');
insert into t values(4,2,'Morning!');
select * from t;

ID GP WORD
1 1 Hello
2 1 World!
3 2 Good
4 2 Morning!

We can combine the words by group using listagg function.
select gp, listagg(word,' ') within group(order by id) 
from t group by gp order by gp;

GP LISTAGG(WORD,'')WITHINGROUP(ORDERBYID)
1 Hello World!
2 Good Morning!

Saturday, June 11, 2016

Run Oracle SQL Online

I have found an excellent website to learn Oracle PL/SQL Oracle SQL live. From the website, we can run SQL or PL/SQL commands. There are also many scripts in Code Library that we can learn from. Here are screenshots.

Wednesday, May 11, 2016

Featured Company: ZSAnalytics

ZSAnalytics is a business partner bringing advanced process capabilities to its customers at the best possible cost resulting in the highest savings. ZSA provides customer’s access to Productized Analytics, advanced Analysis and subject matter expertise resulting in increased operational efficiency and lowest total operating cost. Their solutions and services have saved our clients hundreds of millions. They include:

  • ZSales: 70% increase in response rate
  • ZClaim: 40% reduction in insurance claim losses
  • ZCheck: 70% reduction in charge off
  • ZCredit: 50% reduction in delinquency rate 
In the following youtube video clips, ZSAnalytics' CEO Chris Sandusky is explaining how ZSA's solutions lower the overall business cost and the advantages of predictive models over intuitive rules in fraud detection.

Lower the Overall Business Cost

 


Predictive Models vs Intuitive Rules 

    Sunday, March 06, 2016

    the Danger of Predictive Model Overfitting

    In the post a Young Data Scientist- Kaggle Competition Top 5% Winner: Yuyu Zhou, Yuyu talks about the important role of feature engineering, i.e, finding good derived variables, and gradient boosting trees in their success. He also tells me a very interesting observations on Kaggle Competition ranking.

    "After participating teams finished building their predictive models, they apply their models to two data sets to generate predictions: a smaller set containing target variable and a larger data set where the target variable is removed. Each participating team's model is temporarily ranked based on the result on the smaller data set with the target variable. After the competition's deadline is due, Kaggle will calculate the final ranking of each team based on a model's prediction on the larger data set. "
    "It is interesting to see that the rankings of some top 1% models based on the smaller data set drop more than 20% on the larger data set. I figure out what might cause the huge discrepancies in their model performance. Those teams' models fit the smaller data set so well that they lose their capability to generalize. It is a typical overfitting problem."
    It is important to avoid model overfitting. After all, a predictive model is only useful if it can generalize, i.e., able to handle new data reasonably well.

    Saturday, March 05, 2016

    a Young Data Scientist- Kaggle Competition Top 5% Winner: Yuyu Zhou


    Yuyu Zhou is a graduate student in Analytics in University of New Hampshire. His team has achieved the top 3% and 5% in two Kaggle prediction competitions respectively. In an interview, I asked him how their predictive models performed so well. Yuyu said,

    "One of the keys to the success is that we spend tremendous amount of time working on building feature variables. Those variables are usually the results of combining several raw variables. For example, the ratio between the body weight and height is a better variable in predicting a patient's health than using body weight or height alone."
    "My training in computer science is extremely helpful in these projects. I am able to write Java, Python and SQL scripts to perform tasks such as data cleansing, data merge, and data transform, etc. As we know, more than 80% of time in a project is typically spent on those tasks before we start building predictive models."
    "We have tried many type of predictive models and found that gradient boosting trees have consistently perform the best."

    The following is a summary of Yuyu's contribution in those two projects.

    Kaggle Competition: Rossmann Store Sales Prediction (ranked top 5%) Oct 2015 – Dec 2015

    • Built the Predictive Model for daily sales for Rossmann Stores using Python Machine Learning library.
    • Conducted data cleaning and feature engineering for increasing data quality.
    • Designed final prediction model by combining the multiple gradient boosting trees algorithms
    • Prediction accuracy was ranked at 163 out of 3303 teams

    Kaggle Competition: Property Risk Level Prediction (ranked top 3%) July 2015 – Aug 2015

    • Developed Statistics models to predict risk level of properties which Liberty Mutual Inc is going to protect.
    • Led the team and conducted cost and benefit analysis on new ideas.
    • Implemented ideas using statistical packages from Python.
    • Prediction accuracy was ranked at 71 out of 2236 teams.
    Yuyu is currently looking for a full time job in data analytics. Please feel free to contact him if you are hiring. He can be reached by email yuyu.zhou@hotmail.com or phone (508) 933-7311. Here is his LinkedIn Profile.

    Friday, March 04, 2016

    Find Tablespaces for Oracle Tables

    We may run the following query from SYS to find the tablespace names for all tables, including IOTs (index organized table) and partitioned tables (replacing 'DMUSER' with your user name).

    select  u.name owner, o.name table_name,t.name tablspace_name 
    from
     obj$ o,  ts$ t, sys_objects s, user$ u
    where o.obj#=s.object_id and s.ts_number=t.ts#
          and o.owner#=u.user# 
          and o.type#=2
          and u.name='DMUSER'
    order by 1,2;
    OWNER     TABLE_NAME              TABLSPACE_NAME
    -------------------------------- -------------------------------- 
    DMUSER     AI_EXPLAIN_OUTPUT         TBS_1
    DMUSER     AR_SH_SAMPLE_SETTINGS     TBS_1
    DMUSER     DM$P0AR_SH_SAMPLE         TBS_1
    DMUSER     DM$P0AR_SH_SAMPLE_2COL    TBS_1
    DMUSER     DM$P0EM_SH_CLUS_SAMPLE    TBS_1
    DMUSER     DM$P0NB_SH_CLAS_SAMPLE    TBS_1
    DMUSER     DM$P0OC_SH_CLUS_SAMPLE    TBS_1
    DMUSER     DM$P1EM_SH_CLUS_SAMPLE    TBS_1
    DMUSER     DM$P1NB_SH_CLAS_SAMPLE    TBS_1
    DMUSER     DM$P1OC_SH_CLUS_SAMPLE    TBS_1
    

    Wednesday, March 02, 2016

    a Random Number Generator

    Rand Number Generator

    Wednesday, February 17, 2016

    Dr. Jiang Zhou's Presentation at UNH Analytics

    I shared my analytics experience with a group of students and professors from University of New Hampshire Analytics on February 25. I really enjoyed the interaction with professors and students. The following is the news on UNH Graduate Analytics LinkedIn page.

    UNH Graduate Analytics would like to thank Dr. Jiang Zhou for his wonderful presentation this morning, which was part of our Analytics Speaker Series. 

    Dr. Zhou highlighted the advantages and disadvantages of using statistical models vs. intuition to solve business problems throughout his 10+ years of experience. Among his many accomplishments, Dr. Zhou built a fraud detection model for large banks, which detects 70% more fraud than using business intuition rules alone.

    Free Data Sources

    Bernard Marr wrote a post Big Data: 33 Brilliant And Free Data Sources For 2016. He gives a list of free data sources provided by governments, social websites and companies including Google and Facebook. The data types include census, climate, financial, health, news, etc. I have found the list very interesting.

    Springboard "19 Free Public Data Sets For Your First Data Science Project" has good information about free data sets such as United States Census Data, FBI Crime Data and CDC Cause of Death etc.

    Friday, December 25, 2015

    Convert Values Using Decode Function

    We use decode() function to convert a value to other values. We have the following table.

    SQL> select * from tbl_test order by id;
    
         ID MESSAGE
    ------- --------------------------------
          1 A
          2 B
          3 C
          4 D
    
    The following query transforms characters values into numbers or weights.
    SQL> select a.*, decode(message, 'A', 0.3, 'B',0.4, 0.35) weight from tbl_test a order by id;
    
         ID MESSAGE                              WEIGHT
    ------- -------------------------------- ----------
          1 A                                        .3
          2 B                                        .4
          3 C                                       .35
          4 D                                       .35
    
    In the above query, A is mapped to 0.3, B to 0.4. All other values are mapped to the default value of 0.35. If the default value is not defined, other characters where the mapping is not defined will be given a NULL value.
    SQL> select a.*, decode(message, 'A', 0.3, 'B',0.4) weight from tbl_test a order by id;
    
         ID MESSAGE                              WEIGHT
    ------- -------------------------------- ----------
          1 A                                        .3
          2 B                                        .4
          3 C
          4 D
    
    I have used decode function to convert categorical values into weights that are derived from a logistic regression model. For example, I used the following query to convert transaction code into weights. Before decode function is applied, CARD_TXN_CDE is cleaned up using trim, substr and nvl functions. The trim function removes leading and trailing blanks. The substr function extracts the first 18 characters of the string. The nvl function converts NULL values to blank.
    decode(nvl(substr( trim(CARD_TXN_CDE),1,18),' '),
    ' ',-.660841,
    '01',-.518927,
    '1',-.076546,
    '10',-.294631,
    '12',.077699,
    '14',-.709884,
    '25',-.30619,
    '40',.021855,
    '51',-.004593,
    '52',-.069521,
    '53',-.000344,
    '57',.570421,
    '59',.858444,
    '67',1.481654,
    '81',.29988,
    '91',-.004755,
    '96',.02628,
    'AD',-.001036,
    'AK',-.150162,
    'CB',-.001588,
    'D5',.364975,
    'G',-.015795,
    'H',-.274374,
    'I',-1.065177,
    'J',-.027991,
    'N',.157622,
    'NK',.763406,
    'NR',.080558,
    'P',-.961133,
    'PE',-.19558,
    'PH',.479081,
    'PR',.134741,
    'S',-.239287,
    'SV',.475934,
    'T1',.241061,
    'T2',-.277572,
    'T3',.901487,
    'T4',-.0137,
    'TS',.01362,
    'U2',-.186914,
    'X',1.301152,
    'XX',-.11462,
    'Z0',-.7141,
    'Z4',-.004642,
    'Z6',-.014541,
    0)
    

    Thursday, July 30, 2015

    Remove the Rows with Lowest or Highest Values

    In a project, I need to remove rows with lowest or highest values within groups. I used the following table to illustrate the problem. For grp 10, rows with val 1 and 4 need to be removed and for grp 11, rows with val 5.5 and 9.6.

    SQL> select * from tbl_x2 order by 1,2;
    
           GRP        VAL
    ---------- ----------
            10          1
            10          2
            10          3
            10          4
            11        5.5
            11        7.3
            11        9.6
    
    7 rows selected.
    
    I used the following SQL statement to do it.
    delete from tbl_x2 a
    where a.rowid in
    (
    select rid from
    (
    with tbl as (
    select b.rowid rid,
    row_number() over(partition by grp order by val) rnk,
    row_number() over(partition by grp order by val desc) rnkr
    from tbl_x2 b)
    select rid from tbl where rnk=1 or rnkr=1
    )
    );
    
    The following query shows that rows with lowest or highest values within groups are removed.
    SQL> select * from tbl_x2 order by 1,2;
    
           GRP        VAL
    ---------- ----------
            10          2
            10          3
            11        7.3
    
    The trick is to use row_number() function to generate ranks or reverse ranks based on val for each grp. For example, the following query show all rows in the original table and the ranks (rnk) and reversed ranks (rnkr). The lowest val has a rank of 1 and the highest val has a reversed rank of 1.
    select b.*,
    row_number() over(partition by grp order by val) rnk,
    row_number() over(partition by grp order by val desc) rnkr
    from tbl_x3 b
    
           GRP        VAL        RNK       RNKR
    ---------- ---------- ---------- ----------
            10          1          1          4
            10          2          2          3
            10          3          3          2
            10          4          4          1
            11        5.5          1          3
            11        7.3          2          2
            11        9.6          3          1
    

    Friday, April 10, 2015

    Extract SQL Source Code For Objects From Datapump Dumpfile

    When we dump the whole schema as a dumpfile, the SQL statements for creating those database objects (DDL) can be extracted from the dumpfile.
    First, we create the dumpfile using Oracle datapump export utility expdp.

    expdp myuser/myuser dumpfile=myuserdump.dmp directory=myuser_dir
    
    We use Oracle datapump import utility impdp and specify "sqlfile=".
    impdp myuser/myuser dumpfile=myuserdump.dmp directory=myuser_dir 
    sqlfile=myuserdump.sql
    
    When "sqlfile=" is used, data import is not actually done. Instead, it produces a file containing the SQL statements to create database objects. For example, the following "create table" and "create index" are taken from the sqlfile produced by impdp. We can simple run those SQL queries to create objects.
    CREATE TABLE "MYUSER"."DMPZKM" 
       ( "CLUSTER_ID" NUMBER, 
     "RECORD_COUNT" NUMBER, 
     "PARENT" NUMBER, 
     "TREE_LEVEL" NUMBER, 
     "DISPERSION" BINARY_DOUBLE, 
     "NORM" BINARY_DOUBLE, 
      CONSTRAINT "DMPSKM" PRIMARY KEY ("CLUSTER_ID") ENABLE
       ) ORGANIZATION INDEX NOCOMPRESS PCTFREE 10 INITRANS 2 MAXTRANS 255 LOGGING
      STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
      PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1
      BUFFER_POOL DEFAULT);
    
    CREATE UNIQUE INDEX "MYUSER"."DMPQKM" ON "MYUSER"."DMPXKM" 
    ("CLUSTER_ID", "ATTRIBUTE_NAME", "ATTRIBUTE_SUBNAME", "BIN_ID") 
      PCTFREE 10 INITRANS 2 MAXTRANS 164 
      STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
      PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1
      BUFFER_POOL DEFAULT)
      TABLESPACE "DMUSER" PARALLEL 1 ;
    
    

    Sunday, April 05, 2015

    Table Joining Using Function Based Index

    The post Create Function Based Index on Oracle Table shows that indexes can be created based on functions. Here is an example of the actual use of function based index in table joining. In a project, I try to join the following two tables based on a unique sequence numbers:
    1. Checks deposited, tbl_checks_deposited.
    2. Checks that are returned, tbl_checks_returned.
    Both tables contain column sequence_ids that are unique for each record and are supposed to link these two tables. I used the following query to join them.

    select a.*, b.return_code
    from  tbl_checks_deposited a,  tbl_checks_returned b
    where a.sequence_id=b.sequence_id;
    
    However, I did not get many successful matches. After some investigation, I have found that the sequence_id in the second table tbl_checks_returned has extra leading and trailing blanks. This could happen when the data is loaded from text files into an Oracle database. I used trim() function to remove the leading and trailing blanks and redo the join using the following query. To make the join faster, I first create a function based index on table tbl_checks_returned.
    create index tbl_checks_returned_idx on tbl_checks_returned(trim(sequence_id));
    
    select a.*, b.return_code
    from  tbl_checks_deposited a,  tbl_checks_returned b
    where a.sequence_id=trim(b.sequence_id);
    
    Now the join returns the correct number of records.

    Remove Character at Specific Location in String

    In the post Remove the Trailing Character From a String Using SQL, we use substr SQL function. To remove a character at a specific location in a string, I create a function that takes advantage of substr again as shown below.

    create or replace function remove_at(
       str IN varchar2,
       pos IN number)
    return varchar2
    is
    begin
    return substr(str,1,pos-1)||substr(str, pos+1, length(str));
    end;
    /
    
    Now, I apply this function using different locations and the results look correct.
    SQL> select id, val from tbl_val;
    
         ID VAL
    ------- --------------------------------
          1 abcdef
    SQL> select id, remove_at(val,3) val from tbl_val;
    
         ID VAL
    ------- --------------------------------
          1 abdef
    
    SQL> select id, remove_at(val,5) val from tbl_val;
    
         ID VAL
    ------- --------------------------------
          1 abcdf
    
    SQL> select id, remove_at(val,1) val from tbl_val;
    
         ID VAL
    ------- --------------------------------
          1 bcdef
    
    SQL> select id, remove_at(val,0) val from tbl_val;
    
         ID VAL
    ------- --------------------------------
          1 abcdef
    
    SQL> select id, remove_at(val,10) val from tbl_val;
    
         ID VAL
    ------- --------------------------------
          1 abcdef
    

    Saturday, April 04, 2015

    Remove the Trailing Character From a String Using SQL

    We can using SQL function substr to remove the last character from a string. For example, the following query shows there is a extra comma at the end of the string. Please also see Remove Character at Specific Location in String.

    SQL> select * from tbl_test;
    
         ID VAL
    ------- --------------------------------
          1 Hello,
    
    The following query will remove it. Substr takes three parameters. The first parameter is the string to process, the second is the starting position and the third is the length of the substring to extract. Here, we define the length to be the total length of the original string minus one.
    SQL>  select id, substr(val, 1, length(val)-1) val from tbl_test;
    
         ID VAL
    ------- --------------------------------
          1 Hello
    

    Convert Oracle Number to Hex and Vice Verse

    To convert a hexadecimal number (base 16) to a decimal number, we use to_number function and specify the format as 'XXXXX' as shown below.

    SQL> select to_number('30003','XXXXXX') from dual;
    
    TO_NUMBER('30003','XXXXXX')
    ---------------------------
           196611
    
    We can verify the conversion by using the following query.The fifth position from the right represents 16 to the 4th power.
    SQL> select power(16,4)*3+3 from dual;
    
    POWER(16,4)*3+3
    ---------------
             196611
    
    To convert a decimal number to a hex number, we use to_char() function.
    SQL> select to_char(196611, 'XXXXXX') from dual;
    
    TO_CHAR
    -------
      30003
    

    Friday, April 03, 2015

    Chi-square Test in Oracle Database

    When we want to test the association of two categorical variables, such as sex and certain disease, we may use Chi-square test. Fortunately, with Oracle statistical function stats_crosstab, this can be done using a SQL query. For example, we have a table containing two columns of discrete values, sex and target.

    SQL> select sex, target, count(*) num from tbl_chiqs group by sex, target
     order by sex, target;
    
    S     TARGET        NUM
    - ---------- ----------
    F          0        152
    F          1         13
    M          0        411
    M          1         26
    
    We want to find out if sex and target are associated. Since both are categorical variables, we perform Chi-square test using a SQL query.
    select
    stats_crosstab(sex, target, 'CHISQ_OBS') chi_squared,
    stats_crosstab(sex, target, 'CHISQ_SIG') p_value
    from
    tbl;
    
    CHI_SQUARED    P_VALUE
    ----------- ----------
     .735719671 .391035481   
    
    Based on the p value of the Chi-square statistics, we can not reject the NULL hypothesis that sex and target are independent given the significant level of 0.05. As we see, it is very convenient to perform statistical tests such as Chi-square, within a database using SQL query.It is not necessary to move the data to another statistical software to perform those tests.

    Thursday, April 02, 2015

    Oracle Impdp Remap Tablespace

    When I import a Oracle dump file produced by datapump export utility expdp from another database instance, I get the following error. This is caused the the default tablespace for the user is different.

    
    shell> impdp myuser/mypassword directory=DM_DUMP dumpfile=tmpm_r_pmml.dmp
    
    ORA-39083: Object type TABLE:"MYUSER"."DATA_TRAIN" failed to create with error:
    ORA-00959: tablespace 'MYUSER' does not exist
    
    To find our the default tablepsace name for the current user, I log onto the user.
    Shell> sqlplus myuser/mypassword
    
    SQL> select default_tablespace from user_users;
    
    DEFAULT_TABLESPACE
    ------------------------------
    TBS
    
    
    I add "remap_tablespace=DMUSER:TBS" to impdp command and the dump file is imported correctly.
    impdp dmuser/dmuser directory=DM_DUMP dumpfile=tmpm_r_pmml.dmp 
       remap_tablespace=MYUSER:TBS
    

    Thursday, March 26, 2015

    Best Practice In Loading Text Files Into Oracle Database

    The post Five Ways of Loading Text Files Into Oracle Database shows common ways of loading test files. There are many issues when loading text files into an Oracle database. These issues include:

    • Data format conversions such as strings to numbers and strings to dates
    • NULl value representation in text files. NUll are ., n/a, null, blanks, etc.?
    • Other data representation issues such as $ in front of amount.
    Thus it is important to find out effective ways to handle those data issues and make sure data are indeed loaded correctly. For many projects, I have used the following process of importing text files into an Oracle database.
    • Step 1. Define external tables pointing to the text files. In the external tables, we simply define all columns as string (varchar2). Creating an external table does not physically load the data. It just contains the "pointer" to the files. We can run SQL queries against an external table just like a regular table.
    • Step 2. Perform data cleansing and conversion using SQL functions and store the data into tables or materialized views.
    For example, the following a few lines of text files to be loaded.
    10,"ABASTECEDORA NAVAL Y INDUSTRIAL, S.A.",-0- ,"CUBA",-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0-
    15,"ABDELNUR, Nury de Jesus","individual","CUBA",-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0-
    25,"ACEFROSTY SHIPPING CO., LTD.",-0- ,"CUBA",-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0-
    36,"AEROCARIBBEAN AIRLINES",-0- ,"CUBA",-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0-
    39,"AEROTAXI EJECUTIVO, S.A.",-0- ,"CUBA",-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0-
    41,"AGENCIA DE VIAJES GUAMA",-0- ,"CUBA",-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0-
    
    First, we create an external table as shown below. All columns are simply defined as varchar2 as we will do the data conversion later.
    create directory dir_files as '/home/data';
    create table tbl_sdn_ext(
      v1  varchar2(512),
      v2  varchar2(512),
      v3  varchar2(512),
      v4  varchar2(512),
      v5  varchar2(512),
      v6  varchar2(512),
      v7  varchar2(512),
      v8  varchar2(512),
      v9  varchar2(512),
      v10  varchar2(512)
    )
    organization external
    ( type oracle_loader
    default directory DIR_ORA_EXT
    access parameters
    ( records delimited by newline
    skip 1
    fields terminated by ','
    optionally enclosed by '"'
    missing field values are null
    )
    location('sdn.csv')
    );
    
    Once we have the external table, we can apply SQL functions to do various data conversions and store the results into tables physically as shown below.
    create table tbl_sdn2
    as
    with tbl as
    (
    select
    case when trim(V1)='-0-' then null else upper(trim(V1)) end V1,
    case when trim(V2)='-0-' then null else upper(trim(V2)) end V2,
    case when trim(V3)='-0-' then null else upper(trim(V3)) end V3,
    case when trim(V4)='-0-' then null else upper(trim(V4)) end V4,
    case when trim(V5)='-0-' then null else upper(trim(V5)) end V5,
    case when trim(V6)='-0-' then null else upper(trim(V6)) end V6,
    case when trim(V7)='-0-' then null else upper(trim(V7)) end V7,
    case when trim(V8)='-0-' then null else upper(trim(V8)) end V8,
    case when trim(V9)='-0-' then null else upper(trim(V9)) end V9,
    case when trim(V10)='-0-' then null else upper(trim(V10)) end V10
    from tbl_sdn_ext
    )
    select
    v1 sdn_id,
    v2 full_name,
    trim(substr(v2, 1, regexp_instr(v2,',')-1)) lnm,
    trim(substr(v2, regexp_instr(v2,',')+1, length(v2))) fnm,
    v3 dtype,
    v4 country
    from
    tbl;
    

    Tuesday, March 24, 2015

    Most Frequently Used SQL Functions by Data Scientist

    As a data scientist, I have done many projects for banks, insurance companies, marketing firms, etc. Most of the projects involve building predictive models. I have won three head to head competitions for building best predictive models.

    In a typical project, I receive the data, usually in the format for test files, from a client. Then I load them into Oracle databases. I do most of the data manipulation in the database using SQL. As a result of years of work, I have created close to 800 SQL script files under my work directories on my computers. I figure out it would be interesting to see what are the most frequently used SQL functions from a data scientist's perspective. So I perform a simple analysis on my SQL scripts. I have found that the following 45 functions accounts for about 65% of Oracle SQL functions that I have used for various projects. The following table shows ranks of frequencies of those functions as they present in my SQL scripts. As we may see, a typical data analytics project only involves a relatively small number of frequently used SQL functions.

    We may have the following observations: 1. Many of the functions are used for data cleanse and data conversion including trim, nvl, decode,to_char, to_number, trunc, substr,etc. This precisely reflects the fact that at least 80% of the work is about data manipulation. 2. Some simple statistics functions include sum, count, min, max, stddev, corr, median. It is convenient to calculate useful statistics using SQL. We may use those SQL functions to produce reports similar to that by SAS UNIVARIATE.

    Rank SQL Function Frequency
    1  trim 10.03%
    2  sum 8.22%
    3  nvl 5.93%
    4  count 5.67%
    5  decode 5.38%
    6  to_char 4.63%
    7  min 3.53%
    8  max 3.33%
    9  to_number 2.25%
    10  trunc 2.22%
    11  row_number 1.97%
    12  avg 1.88%
    13  round 1.33%
    14  to_date 1.20%
    15  substr 1.07%
    16  sign 0.93%
    17  upper 0.77%
    19  stddev 0.45%
    20  mod 0.35%
    21  lower 0.30%
    22  exp 0.28%
    23  ratio_to_report 0.27%
    24  median 0.25%
    25  abs 0.23%
    26  lag 0.23%
    27  dense_rank 0.22%
    28  length 0.22%
    29  greatest 0.18%
    30  sys_guid 0.15%
    31  ceil 0.15%
    32  chr 0.13%
    33  regexp_instr 0.10%
    35  percent_rank 0.10%
    36  replace 0.10%
    37  regexp_replace 0.10%
    38  cume_dist 0.08%
    39  instr 0.07%
    40  rtrim 0.07%
    41  width_bucket 0.07%
    42  prediction 0.07%
    43  corr 0.05%
    44  lead 0.03%
    45  lpad 0.03%

    Sunday, March 22, 2015

    The 2015 Rexer Analytics Data Miner Survey has launched!

    The 2015 Rexer Analytics Data Miner Survey has launched. The founder of Rexer Analytics, Karl Rexer, is regarded by Deep-data-mining.com as one of 10 Most Influential People in Data Analytics.

    Analytic Professionals — Share your views: Participate in the 2015 Data Miner Survey

    Data Analysts, Predictive Modelers, Data Scientists, Data Miners, and all other types of analytic professionals, students, and academics: Please participate in the 2015 Rexer Analytics Data Miner Survey.


    Survey Link: www.rexeranalytics.com/Data-Miner-Survey-2015-Intro.html
    Access Code: CL72L4It is OK to share this Access Code with others: It can be used by multiple people.
    Survey results will be unveiled at the Fall-2015 Boston Predictive Analytics World event.

    Rexer Analytics has been conducting the Data Miner Survey since 2007. Each survey explores the analytic behaviors, views and preferences of data miners and analytic professionals. Over 1200 people from around the globe participated in the 2013 survey. Summary reports (40 page PDFs) from previous surveys are available FREE to everyone who requests them by emailing DataMinerSurvey@RexerAnalytics.com. Also, highlights of earlier Data Miner Surveys are available at www.rexeranalytics.com/Data-Miner-Survey-Results-2013.html, including best practices shared by respondents on analytic success measurement, overcoming data mining challenges, and other topics. The FREE Summary Report for this 2015 Data Miner Survey will be available to everyone Fall-2015.

    Please tell other data analysis professionals about the survey.

    Rexer Analytics is a consulting firm focused on providing data mining and analytic CRM solutions. Recent solutions include customer loyalty analyses, customer segmentation, predictive modeling to predict customer attrition and to target direct marketing, fraud detection, sales forecasting, market basket analyses, and complex survey research. More information is available at www.RexerAnalytics.com or by calling +1 617-233-8185.

    Saturday, March 21, 2015

    Find Out Database Object Dependencies

    Some database objects are depending other objects. For example, a view may query other tables or views. We can query user_dependencies to find out these dependencies.

    SQL> create view v_x_y as select a.*, b.val val2 from t_x a, t_y b 
    where a.id=b.id;
    
    View created.
    
    The following query shows that view v_x_y depends on two tables, t_x and t_y.
    SQL>select * from user_dependencies where name='V_X_Y';
    
    NAME, TYPE, REFERENCED_OWNER, REFERENCED_NAME, REFERENCED_TYPE, 
    REFERENCED_LINK_NAME, SCHEMAID, DEPENDENCY_TYPE
    V_X_Y VIEW DMUSER T_X TABLE  116 HARD
    V_X_Y VIEW DMUSER T_Y TABLE  116 HARD
    
    We create another view v_x_y2 based on view v_x_y.
    SQL> create view v_x_y2 as select * from v_x_y;
    
    The following query shows that V_X_Y2 depends on V_X_Y.
    SQL> select * from user_dependencies where name='V_X_Y2';
    
    NAME, TYPE, REFERENCED_OWNER, REFERENCED_NAME, REFERENCED_TYPE, 
    REFERENCED_LINK_NAME, SCHEMAID, DEPENDENCY_TYPE
    V_X_Y2 VIEW DMUSER V_X_Y VIEW  116 HARD
    
    We create a materialized view based on two tables.
    SQL> create materialized view mv_x_y as select a.*, b.val val2 
    from t_x a, t_y b where a.id=b.id;
    SQL> select * from user_dependencies where name='MV_X_Y';
    Materialized view created.
    
    The materialized view MV_X_Y depends on tables T_X ,T_Y and MV_X_Y.
    NAME, TYPE, REFERENCED_OWNER, REFERENCED_NAME, REFERENCED_TYPE, 
    REFERENCED_LINK_NAME, SCHEMAID, DEPENDENCY_TYPE
    MV_X_Y MATERIALIZED VIEW DMUSER T_X TABLE  116 REF
    MV_X_Y MATERIALIZED VIEW DMUSER T_Y TABLE  116 REF
    MV_X_Y MATERIALIZED VIEW DMUSER MV_X_Y TABLE  116 REF
    
    

    Monday, March 16, 2015

    Watch Out Materialized View Refresh Group

    In the post Refresh an Oracle Materialized View, we add a materialized view to a refresh group to be refreshed. When we drop the materialized view, it will be removed from the refresh group. Thus if we drop and then recreate the same materialized view, it will not be refreshed according to the schedule defined in the refresh group. We have to manually add the materialized view back to the refresh group. So I normally do not use refresh group. Instead, I create a procedure to refresh the materialized view using dbms_mview.refresh.

    create or replace procedure proc_all(md varchar2 default NULL)
    is
    begin
     dbms_mview.refresh('MV_A');
     dbms_mview.refresh('MV_B');
     dbms_mview.refresh('MV_C');
     dbms_mview.refresh('MV_D');
    end;
    
    Then I use DBMS_SCHEDULER to schedule the job.
    BEGIN
      DBMS_SCHEDULER.create_job (
        job_name        => 'my_proc',
        job_type        => 'PLSQL_BLOCK',
        job_action      => 'BEGIN proc_all; END;',
        start_date      => SYSTIMESTAMP,
        repeat_interval => 'freq=daily; byhour=19;',
        end_date        => NULL,
        enabled         => TRUE,
        comments        => 'daily job.');
    END;
    /
    

    Display Date Type in Oracle SQLPLUS

    To display the date type in the desired format, we can manually format the date using to_char() function. Or, we use "alter session set NLS_DATE_FORMAT" to set the environment variable and the format will be applied to all date columns in any queries.

    SQL> select sysdate from dual;
    
    SYSDATE
    ---------
    16-MAR-15
    
    SQL> select to_char(sysdate,'YYYYMMDD:HH24:MI:SS') from dual;
    
    TO_CHAR(SYSDATE,'
    -----------------
    20150316:07:16:02
    
    SQL> alter session set NLS_DATE_FORMAT = 'YYYYMMDD:HH24:MI:SS';
    
    Session altered.
    
    SQL> select sysdate from dual;
    
    SYSDATE
    -----------------
    20150316:07:16:31
    

    Sunday, March 15, 2015

    Replace Consecutive Blanks With a Single Blank (Continued)

    In an earlier post Replace Consecutive Blanks With a Single Blank, we use SQL regexp_replace() to replace consecutive blanks with a single blank. We may also use Linux command tr to do the same before we load a text file into a database.

    $ cat x.txt
    hello    world!
    This   is a  test .
    
    Using "tr -s ' '" command to squeeze consecutive blanks into one.
    $ cat x.txt | tr -s ' '
    hello world!
    This is a test .
    
    Many Linux/Unix commands are very useful to clean up text files. I am using Windows but I installed open source Cygwin that provides a Linux environment. In another example Calculate Histogram for a Text File, we calculate the histogram using Linux commands.

    Saturday, March 14, 2015

    Import XML File Into Oracle

    We can import XML files into an Oracle database schema. First, we use BFILE data type to store the XML file name. Then we use xmltype() function to convert the file content into oracle XMLTYPE.

    create table test_docs (id number, text_doc bfile);
    truncate table test_docs;
    insert into test_docs values (1, BFILENAME('DM_DUMP', 'test.xml'));
    
    SQL> select xmltype(text_doc, nls_charset_id('AL32UTF8')) xml_doc from test_docs where id=1;
    We may also create a new table containing the xmltype data type column using CTAS (Create Table As Select) query as shown below.
    create table test_xml 
    as select id, xmltype(text_doc, nls_charset_id('AL32UTF8')) xml_doc 
    from test_docs;
    

    Thursday, March 12, 2015

    Fix Views That Stop Working

    A view is based on query against other database objects that may involve tables, views, database links, etc.. Sometimes when the underline objects changes, the view may become invalid. For example, a view is defined based on a remote table. When I dropped and recreate the database link pointing to the remote table, the querying against the view returns error.

    
    SQL*Plus: Release 11.2.0.1.0 Production on Sun Nov 30 10:09:48 2014
    
    Copyright (c) 1982, 2010, Oracle.  All rights reserved.
    
    
    Connected to:
    Oracle Database 10g Express Edition Release 10.2.0.1.0 - Production
    
    SQL>  select count(*) from V_CLAIM
    old   1: select count(*) from &1
    new   1: select count(*) from V_CLAIM
    select count(*) from V_CLAIM
                         *
    ERROR at line 1:
    ORA-04063: view "PROD.V_CLAIM" has errors
    
    I use dbms_utility.invalidate procedure to "fix" the view as shown below. First, we need to find the object_id for the view.
    SQL> select object_name, object_id from user_objects where object_name='V_CLAIM';
    OBJECT_NAME                               OBJECT_ID
    ---------------------------------------- ----------
    V_CLAIM                                   16995
    
    SQL> exec dbms_utility.invalidate(16995);
    
    PL/SQL procedure successfully completed.
    
    SQL> @ct V_SH_FH_CLAIM
    old   1: select count(*) from &1
    new   1: select count(*) from V_CLAIM
    
      COUNT(*)
    ----------
         24782
    

    Move Large Amount of Data Using Insert Into

    When we insert many records into a table, it is faster to use hint /*+ append */. This way, Oracle will use direct path load to insert large amount of data that is much faster than conventional insert. As we see from the example below, when /*+ apeend */ hint is used, insert only takes 0.18 second vs 0.49 with conventional insert.

    SQL> insert /*+ append */ into tbl_y select * from TBL_CLAIM_ALL@dl_apex2;
    
    1031 rows created.
    
    Elapsed: 00:00:00.18
    SQL> commit;
    
    Commit complete.
    
    Elapsed: 00:00:00.07
    SQL> insert into tbl_y select * from TBL_CLAIM_ALL@dl_apex2;
    
    1031 rows created.
    
    Elapsed: 00:00:00.49
    SQL> commit;
    

    Wednesday, March 11, 2015

    More on Get Source Code for Oracle Database Objects

    The following are the three ways to get the source code for Oracle database objects such as views, materialized views, procedures, functions, synonyms, database links etc.
    1. For views, we query user_views. This is described in Get the source code for Oracle database views and materialized view
    2. For materialized view, we query user_mviews. This is also described in Get the source code for Oracle database views and materialized view.
    3. For objects, including views and materialized views, we can use dbms_meta.get_ddl(). See Generate SQL Create Table/View Queries for Existing Tables/Views, SQL Scripts for Oracle Database Link and Hide PL/SQL Scripts For Function or Procedure For example, the following query returns the "create table" statement.

    SQL> select dbms_metadata.get_ddl('TABLE', 'TBL_A') from dual;
    
    DBMS_METADATA.GET_DDL('TABLE','TBL_A')
    --------------------------------------------------------------------------------
    
      CREATE TABLE "PROD"."TBL_A"
       (    "ID" NUMBER,
            "VALUE" VARCHAR2(64),
            "VALUE2" VARCHAR2(32)
       ) SEGMENT CREATION IMMEDIATE
      PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255 NOCOMPRESS LOGGING
      STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
      PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT FLASH_CACHE DE
    FAULT CELL_FLASH_CACHE DEFAULT)
      TABLESPACE "PROD"
    

    Monday, March 09, 2015

    Terminate Oracle Session on Amazon RDS

    To kill an Oracle session on Amazon RDS instance, in stead of using "alter system kill session", we use the procedure provided by Amazon rdsadmin.rdsadmin_util.kill(sid, serial#).
    First, we find out the sid and serial# for the session that we want to terminated. We log in as the master user of an Amazon RDS Oracle instance.
    SQL> select username, sid, serial#, command, status from v$session;
    
    USERNAME                              SID    SERIAL#    COMMAND STATUS
    ------------------------------ ---------- ---------- ---------- --------
    TESTPROD723                            12      56369          3 ACTIVE
                                           13         17          0 ACTIVE
    RDSADMIN                               15         27          0 INACTIVE
    TEST1001                               21      12787          0 INACTIVE
                                           28      50603          0 ACTIVE
                                          609          1          0 ACTIVE
                                          610          1          0 ACTIVE
                                          611          1          0 ACTIVE
                                          612          1          0 ACTIVE
                                          613          1          0 ACTIVE
                                          614          1          0 ACTIVE
                                          615          1          0 ACTIVE
                                          616        103          0 ACTIVE
                                          618          3          0 ACTIVE
                                          619          7          0 ACTIVE
                                          622          1          0 ACTIVE
                                          623          3          0 ACTIVE
                                          625        131          0 ACTIVE
    TEST3                                 627      35083         74 ACTIVE
    
    29 rows selected.
    
    We are not allowed to run "alter system kill session".
    SQL> alter system kill session '627,35083';
    alter system kill session '627,35083'
    *
    ERROR at line 1:
    ORA-01031: insufficient privileges
    
    We can terminate a session using rdsadmin.rdsadmin_util.kill
    SQL> exec rdsadmin.rdsadmin_util.kill(627,35083);
    
    PL/SQL procedure successfully completed.
    
    

    Debug Oracle PL/SQL scripts

    To debug PL/SQL scripts, I have found two commands are extremely helpful, "set echo on" and "show errors". "Set echo on" prints the scripts in a file along with the line number. "show errors" displays the compilation errors. By combining the outputs from the above two commands, we can easily identify where the problems are and fix them. When we are done debugging, we can turn off the display using "set echo off". In the example below, I put my "create or replace procedure" in a script file called debug_proc.sql.

    SQL> set echo on
    SQL> @debug_proc.sql
    SQL> create or replace procedure
      2  proc_test_ins(
      3  p1 number,
      4  p2 date)
      5
      6  is
      7
      8  begin
      9
     10  insert into tbl_test_x(
     11  id, dt)
     12
     13  select
     14  p1, p2 from dua;
     15  commit;
     16
     17
     18  end;
     19
     20  /
    
    Warning: Procedure created with compilation errors.
    
    The "set echo on" show the line number for each line of scripts. To see the detailed compilation errors, we run "show errors" command.
    SQL> show errors
    Errors for PROCEDURE PROC_TEST_INS:
    
    LINE/COL
    --------
    ERROR
    --------------------------------------------------------------
    --------------------------------------------------------------
    ------------------------------------------------------------
    10/1
    PL/SQL: SQL Statement ignored
    
    14/13
    PL/SQL: ORA-00942: table or view does not exist
    
    As we see, there is a typo on line 14, "from dua" should be "from dual".
    SQL> create or replace procedure
      2  proc_test_ins(
      3  p1 number,
      4  p2 date)
      5
      6  is
      7
      8  begin
      9
     10  insert into tbl_test_x(
     11  id, dt)
     12
     13  select
     14  p1, p2 from dual;
     15
     16  commit;
     17
     18
     19  end;
     20
     21  /
    
    Procedure created.
    
    To turn off the display of the scripts and the line numbers, I run "set echo off".
    SQL> set echo off
    SQL> @debug_proc.sql
    
    Procedure created.
    
    Now we can run the procedure.
    SQL> exec proc_test_ins(1, sysdate);
    SQL> select * from  tbl_test_x;
    
         ID DT
    ------- ---------
          1 09-MAR-15
    

    Monday, February 09, 2015

    Refresh Materialized View Daily at Specific Time

    We can define a specific time of the day to refresh a materialized view. For example, the following query makes the existing materialized view to be refreshed immediately and then every day at 7pm.

    SQL> alter materialized view MV_NAME refresh start with sysdate next trunc(sysdate)+19/24;
    
    In the above SQL command, trunc(sysdate) makes the date start from the beginning of the day (midnight). The time is then added by 19/24 of the day. For Oracle date type, 1 unit is a day.

    Monday, January 19, 2015

    List Materialized Views In a Refresh Group

    To show the members in a materialized view refresh group, we can use the following query that takes advantage of DBA_RGROUP and DBA_RCHILD. DBA_RGROUP includes all refresh groups. DBA_RCHILD contains all the children in any refresh group.

    SQL>  select r.owner, r.name gp_name, c.name mv_name 
    from DBA_RCHILD c, DBA_RGROUP r 
    where c.owner=r.owner and c.REFGROUP=r.REFGROUP 
    order by r.owner, r.name, c.name;
    
    OWNER                          GP_NAME                        MV_NAME
    ------------------------------ ------------------------------ ------------------------------
    TEST_COMPS                     MV_GRP                         MV_SH_ABC
    TEST_COMPS                     MV_GRP                         MV_SH_DEF
    TEST_COMPS                     MV_GRP                         MV_SH_TEST
    

    Hide PL/SQL Scripts For Function or Procedure

    To protect the intellectual properties or for security reasons, we may want to hide the source PL/SQL code. To do this, we use "wrap" command to translate the PL/SQL script files into a format that can not be understood by human beings but can be processed by the database.
    For example we have the following script file add_func.sql.

    $ cat add_func.sql
    create or replace function add2 (
      a number,
      b number
    )
    return number
    is
    begin
    return a+b;
    end;
    /
    
    We run wrap to convert it into a plb file that is not understandable by human beings.
    oradba@bdm64-PC ~/projects_c/sql
    $ wrap iname=add_func.sql
    
    Above command generates the following file, add_func.plb. The following are its content.
    create or replace function add2 wrapped
    a000000
    b2
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    8
    52 7d
    bWldIYVqPP9njmOrnggm/xpHL1wwg8eZgcfLCNL+XlpZoX8JUI8JabjDpZmBMsCyJfvCkA5+
    0S5E4sqxyFDKJHWpF3yctwqYsCxZtMwF+uzZPXKV7Pumreu29A==
    
    We connect to the database and run the file add_func.plb to create the function just like running a regular PL/SQL script file.
    SQL> @add_func.plb
    
    The following queries show that the source code for this function is not readable. This is exactly what we want.
    SQL> select text from user_source where name='ADD2';
    
    TEXT
    ----------------------------------------------------------------------------------------------
    ----------------------------------------------------------------------------------------------
    ------------------------------------------------------------
    function add2 wrapped
    a000000
    b2
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    8
    52 7d
    bWldIYVqPP9njmOrnggm/xpHL1wwg8eZgcfLCNL+XlpZoX8JUI8JabjDpZmBMsCyJfvCkA5+
    0S5E4sqxyFDKJHWpF3yctwqYsCxZtMwF+uzZPXKV7Pumreu29A==
    
    
    SQL> select dbms_metadata.get_ddl('FUNCTION', 'ADD2') from dual;
    
    DBMS_METADATA.GET_DDL('FUNCTION','ADD2')
    --------------------------------------------------------------------------------
    
      CREATE OR REPLACE FUNCTION "PROD"."ADD2" wrapped
    a000000
    b2
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    abcd
    8
    52 7d
    bWldIYVqPP9njmOrnggm/xpHL1wwg8eZgcfLCNL+XlpZoX8JUI8JabjDpZmBMsCyJfvCkA5+
    0S5E4sqxyFDKJHWpF3yctwqYsCxZtMwF+uzZPXKV7Pumreu29A==
    
    

    Monday, December 15, 2014

    Replace Consecutive Blanks With a Single Blank

    We can use regexp_replace() function to replace multiple consecutive blanks with a single blank. For example, we have the following table.

    SQL> select * from tbl_text;
    
            ID VAL
    ---------- --------------------------------
             1 hello     world !
             1 how   are     you doing
    
    
    In the following query, regexp_replace() finds patterns that have 2 or more spaces and replace them with single blank.
    SQL> select val, regexp_replace(val, '[[:space:]]{2,}',' ' ') val2 from tbl_text;
    
    VAL                              VAL2
    -------------------------------- --------------------------------
    hello     world !                hello world !
    how   are     you doing          how are you doing
    

    Friday, December 12, 2014

    Insert a Record into a Table Every Second

    We can using Oracle function dbms_lock.sleep(n_seconds) to run queries at certain interval, such as inserting data into a table every second. Function dbms_lock.sleep(n_seconds) will suspend the session for n_seconds. To illustrate this function, we insert the system time into a table every second 20 times.

    SQL> create table tbl_x (id number, dt date);
    
    SQL> begin
    for i in 1..20 loop
    insert into tbl_x values(i, sysdate);
    dbms_lock.sleep(1);
    end loop;
    commit;
    end;
    SQL> /
    
    SQL> select * from tbl_x;
    
     ID DT
    ---------- -------------------
      1 12-12-2014 12:44:41
      2 12-12-2014 12:44:42
      3 12-12-2014 12:44:43
      4 12-12-2014 12:44:44
      5 12-12-2014 12:44:45
      6 12-12-2014 12:44:46
      7 12-12-2014 12:44:47
      8 12-12-2014 12:44:48
      9 12-12-2014 12:44:49
     10 12-12-2014 12:44:50
     11 12-12-2014 12:44:51
     12 12-12-2014 12:44:52
     13 12-12-2014 12:44:53
     14 12-12-2014 12:44:54
     15 12-12-2014 12:44:55
     16 12-12-2014 12:44:56
     17 12-12-2014 12:44:57
     18 12-12-2014 12:44:58
     19 12-12-2014 12:44:59
     20 12-12-2014 12:45:00
    
    20 rows selected.
    
    
    As it is shown above, records are inserted every 1 second.

    Wednesday, December 10, 2014

    Oracle User Privileges for Data Analysts

    As data analysts, we often perform various data work within databases such as Oracle. I have found the following user privileges are very helpful. If possible, database administrators should grant us these privileges.

    
    PRIVILEGE
    ----------------------------------------
    CREATE TABLE
    CREATE SYNONYM
    CREATE VIEW
    CREATE SEQUENCE
    CREATE DATABASE LINK
    CREATE PROCEDURE
    CREATE MATERIALIZED VIEW
    CREATE ANY DIRECTORY
    CREATE JOB
    CREATE MINING MODEL
    

    Monday, December 08, 2014

    Create Views Based on Nonexistent Tables

    In Oracle, it is possible to create views first based on nonexistent tables. After views are created, we can create the underlying tables and the view will work.
    To create a view based on tables that do not exist, we using "create force view" as shown below.

    SQL> create force view v_tbla as select * from tbla;
    Warning: View created with compilation errors.
    
    It is OK that we got the compilation errors. The key word "force" is necessary. Without it, the view will not be created.
    SQL> create view v_tbla as select * from tbla;
    create view v_tbla as select * from tbla
                                        *
    ERROR at line 1:
    ORA-00942: table or view does not exist
    
    If we query this view, we will get an error message. This is fine. The error message will disappear after we create the underlying table tbla.
    SQL> select * from v_tbla;
    select * from v_tbla
                  *
    ERROR at line 1:
    ORA-04063: view "DMUSER.V_TBLA" has errors
    
    Now we create table that the view is based on and populate it with data. As we see, we can query the view!
    SQL> create table tbla (id number, val varchar2(32));
    Table created.
    
    SQL> insert into tbla values (1,'hello');
    1 row created.
    
    SQL> select * from tbla;
    
     ID VAL
    ---------- --------------------------------
      1 hello
    
    SQL> select * from v_tbla;
    
     ID VAL
    ---------- --------------------------------
      1 hello
    
    

    Display Long CLOB Text in Sqlplus

    In the following example, the display of CLOB text is truncated.

    SQL> create table tbl_test (id number, val clob);
    
    Table created.
    
    SQL> insert into tbl_test values(1,'this is a test.ddlkjdfklj dflkjdlkj 
     jdfhi  dfdlkjflkdlkoieuiooiop jdofoidu  baskjhfp  dfdk dkfiegps 
      dfdalkfaidel hdlfjdlflkd dafjdflkdi igod kdigl dfodfud');
    
    1 row created.
    
    SQL> select * from tbl_test;
    
            ID
    ----------
    VAL
    --------------------------------------------------------------------------------
             1
    this is a test.ddlkjdfklj  dflkjdlkj  jdfhi  dfdlkjflkdlkoieuiooiop 
    jdofoidu  ba
    
    
    We can use "set long" to specify how many bytes of CLOB to display. In the following example, "set long 2000" tells sqlplus to display up to 2000 bytes of CLOB text.
    SQL> set long 2000
    SQL> select * from tbl_test;
    
            ID
    ----------
    VAL
    --------------------------------------------------------------------------------
             1
    this is a test.ddlkjdfklj  dflkjdlkj  jdfhi  dfdlkjflkdlkoieuiooiop 
    jdofoidu  baskjhfp  dfdk  dkfiegps   dfdalkfaidel hdlfjdlflkd dafjdflkdi
     igod kdigl dfodfud 
    

    Sunday, December 07, 2014

    Dump the Oracle Schema

    We can use Oracle data pump utility to dump the whole schema or selected database objects. The following command dump the schema.

    $ expdp prod/XXXX schemas=prod directory=ORA_DATA_EBS_BK dumpfile=prod_bk.dmp logfile=prod_bk.log
    
    The above command creates two files under directory ORA_DATA_EBS_BK.
    -rw-r--r-- 1 oracle dba       5129 Dec  7 08:07 prod_bk.log
    -rw-r----- 1 oracle dba    2650112 Dec  7 08:07 prod_bk.dmp
    
    We can import them back into a database using impdp command.
    $ impdp prod/XXXX schemas=prod directory=ORA_DATA_EBS_BK dumpfile=prod_bk.dmp
    

    Thursday, November 20, 2014

    SQL Scripts for Oracle Database Link

    In posts Get the source code for Oracle database views and materialized view and Get source code for Oracle database objects: user_source view, we show how to get the SQL scripts that create some database objects.

    To get the SQL scripts for existing database links, we can take advantage of dbms_metadata.get_ddl().The following query returns the database links created.

    SQL> select  owner, db_link from all_db_links order by owner, db_link;
    
    OWNER                          DB_LINK
    ------------------------------ ----------------
    PROD                           DL_DATA
    PROD                           DL_EC1
    
    To get the original SQL scripts that create these database links, we use dbms_metadata.get_ddl() as shown below.
    SQL> select dbms_metadata.get_ddl('DB_LINK', 'DL_EC1') from dual;
    
    DBMS_METADATA.GET_DDL('DB_LINK','DL_EC1')
    --------------------------------------------------------------------------------
    
      CREATE DATABASE LINK "DL_EC1"    USING '(description=(address=(protocol=TCP)
    (host=ec2-xx-xx-xxx-178.compute-1.amazonaws.com)(port=1521))
    (connect_data=(sid=XE)))'
    

    Tuesday, November 18, 2014

    Drop User in Oracle

    Post Drop All Objects In an Oracle Schema uses PL/SQL to iteratively drop database objects. We can simply drop a user and its database objects in Oracle using "drop user.. cascade". This provides an easy way to clean up the database.

    We have to drop a user by connecting to another user. If we try to drop a user that we are currently connecting to, we will get error ORA-01940. The Sqlplus session is connected to user1.

    SQL> drop user user1 cascade;
    drop user user1 cascade
    *
    ERROR at line 1:
    ORA-01940: cannot drop a user that is currently connected
    
    The user that we are connecting to needs to have "drop user" privilege. In the following example, the Sqlplus session is connected to user2.
    SQL> drop user user1 cascade;
    rop user user1 cascade
    
    RROR at line 1:
    RA-01031: insufficient privileges
    
    We connected to sys and grant "drop user" privilege to user2.
    SQL> grant drop user to user2;
    
    Grant succeeded.
    
    Now we connect to user2 and are able to drop user1
    SQL> drop user user1 cascade;
    
    User dropped.
    

    Monday, October 06, 2014

    Five Trends in Big Data Analytics

    Big data are described as having big volume, complex structures and being updated frequently. Analytics, the technology that extract meaningful information from the "raw" data to support decision, is the ultimate driver of the value of the big data. After extensive research, deep-data-mining.com has identified the following five trends in big data analytics.

    1. SQL Based In-Database Analytics

    Data analytics functions are built within relational database engine. Users take advantage of SQL to perform data mining/predictive analytics task. All processes including data extraction, data preparation, predictive model building and validation, and model deployment are done within the database. SQL based in-database analytics will become a trend, particularly in the enterprise environment. This observation is based on the following facts. Most of the core enterprise data are stored in relationship databases. A lot of business logic that support the daily operation of an enterprises are written in SQL. There are huge number of SQL developers around the world and many of who will be able to perform data analytics task without learning other scripting languages.

    2. Apache Spark

    According to The Apache Software Foundation, Apache is a fast and general engine for large-scale data processing. Apache Spark runs programs much faster. Its machine learning library MLlib include SVM, logistic regression, decision tree and k-means clustering, etc.

    3. the Proliferation R Language

    R is a statistical analysis language that is much more "natural" to data scientists than other programming languages such as Java, C or SQL. For example, the R scripts to manipulate vector/matrix or build predictive models are very similar to the mathematics equations found in textbooks. For example, the following R script builds a logistic regression model that predicts y based a, b and c.
      glm(y ~ a+ b + c, data = trainset, family = binomial(link = "logit"))
    
    There is a big community of R users who develop R algorithms and share them in the format of R packages. Thus, we can find almost all data analytics algorithms in R packages.

    4. Real Time, In Memory Data Analytics

    Traditionally, raw data are that are collected in real time in computer memory but are transformed and loaded into disc-based data warehouse periodically. The data are analyzed offline in a delayed fashion. For example, it is not unusual for a large enterprise to take weeks, if not months, to build a predictive model, test and eventually deploy it. Due to the slowness in identifying new useful patterns in the data, opportunities are lost in the case of new sales or risks are realized in the case of fraud prevention. Thus, there will be a trend to shift traditional offline, disk-based data analytics to online, in-mermory, real time environment.

    5. Innovative Data Analytics Applications

    As we know, the ultimate purpose of big data is to provide data-driven decision support to solve problems. There will be more and more innovative applications of big data analytics. For example, polices will use model to predict potential repeated criminals. Colleges predict in advance if a student will choose to drop from the school based his background and current situation. Human resource department in a large company can design the best career paths for its employees using models. Applications of data analytics is unlimited.

    Sunday, September 28, 2014

    Update Multiple Columns In a Table

    In the post Update a Table Based on Another Table, we showed how to update a column in a table based on the result from a select query. To update multiple columns at the same time, we simply specify multiple columns after "set" and "select".

    SQL> select * from tbl_a order by id;
    
            ID VALUE            VALUE2
    ---------- ---------------- ----------------
             1 A
             2 B
             3 C
             4 D
    
    SQL> update tbl_a a set (value, value2) 
                            = ( select a.value||'_1', a.value||'_2' 
                                from tbl_a b where a.id=b.id);
    
    4 rows updated.
    
    SQL> select * from tbl_a order by id;
    
            ID VALUE            VALUE2
    ---------- ---------------- ----------------
             1 A_1              A_2
             2 B_1              B_2
             3 C_1              C_2
             4 D_1              D_2
    

    Update a Table Based on Another Table

    It is a common task to update records in one table based on those in another table. We can use "update" query to do this. Suppose we have two tables as shown below.

    SQL> select * from tbl_1 order by id;
    
            ID VALUE
    ---------- --------------------------------
             1 A
             2 X
             3 Y
             4 D
    
    SQL> select * from tbl_2 order by id;
    
            ID VALUE
    ---------- --------------------------------
             2 B
             3 C
    
    We want to update the val in tbl_1 with the value in tbl_2 based on id. The desirable result for tbl_1 after the update will be:
           ID VALUE
    ---------- --------------------------------
             1 A
             2 B
             3 C
             4 D
    
    One way to do this is the following query.
    
    SQL> update tbl_1 a 
    set value=(select value from tbl_2 b where a.id=b.id) 
    where exists ( select 1 from tbl_2 b where a.id=b.id);
    
    2 rows updated.
    
    SQL> select * from tbl_1 order by id;
    
            ID VAL
    ---------- --------------------------------
             1 A
             2 B
             3 C
             4 D
    
    Please notice in the update query, only 2 rows are updated due to the "where exists" clause. If we do not include the "where exists" clause, records whose id are not in tbl_2 will have null values as shown below.
    SQL> update tbl_1 a set val=(select val from tbl_2 b where a.id=b.id) ;
    
    4 rows updated.
    
    SQL> select * from tbl_1 order by id;
    
            ID VALUE
    ---------- --------------------------------
             1
             2 B
             3 C
             4
    
    In the above update query, 4, instead of 2, rows are updates due to the removal of the "where exists" clause. We can change the above update query without using "where exists" to make it work as the following.
    SQL> update tbl_1 a set val=nvl((select value from tbl_2 b where a.id=b.id),a.value) ;
    
    4 rows updated.
    
    SQL> select * from tbl_1 order by id;
    
            ID VALUE
    ---------- --------------------------------
             1 A
             2 B
             3 C
             4 D
    
    In the above update query, we added function nvl((select ...), a.value) which means if the select query returns no result (NULL), values will be set as the original values. All four rows are updated. The first update query with "where exists" clause is the best because it only update rows whose id in tbl_2.

    Saturday, September 13, 2014

    Insert Data into an Oracle View

    We are able to insert records into a simple view. By doing that, the data are actually inserted into the physical table that the view is based on. The following is an example.

    
    SQL> create table tbl_test (id number, value varchar2(64));
    
    Table created.
    
    SQL> create view v_tbl_test as select * from tbl_test;
    
    View created.
    
    SQL> insert into tbl_test values(1,'Hello');
    
    1 row created.
    
    SQL> insert into v_tbl_test values(2,'World');
    
    1 row created.
    
    SQL> commit;
    
    Commit complete.
    
    SQL> select * from tbl_test order by id;
    
            ID VALUE
    ---------- ----------------------------------------------------------------
             1 Hello
             2 World
    
    SQL> select * from v_tbl_test order by id;
    
            ID VALUE
    ---------- ----------------------------------------------------------------
             1 Hello
             2 World
    
    

    Wednesday, August 27, 2014

    Oracle Multiple Table Insert

    We can insert the selected results into multiple tables using one query "insert all".

    
    SQL> create table t2 (val number);
    
    Table created.
    
    SQL> create table t2 (val number);
    
    Table created.
    
    SQL> insert all into t1 into t2 select 1 from dual;
    
    2 rows created.
    
    SQL> select * from t1;
    
      VAL
    -----
        1
    
    SQL> select * from t2;
    
      VAL
    -----
        1
    
    However, multiple table insert does not work for remote tables. We can only insert data into one remote table at a time.
    SQL> insert all into t1 into t3@DL_ANOTHER_DB select 1 from dual;
    insert all into t3 into t2@DL_APEX select 1 from dual
                               *
    ERROR at line 1:
    ORA-02021: DDL operations are not allowed on a remote database
    
    SQL> insert  into t2@DL_APEX select 1 from dual;
    
    1 row created.
    

    Sunday, August 24, 2014

    Oracle Character Return in a String

    If a string is too long to display, we can put character returns using "chr(10)" as shown below. As you can see, we can format the long string better. When we generate a long message such as an email from PL/SQL, chr(0) is very helpful.

    SQL> select 'hello world!' as message from dual;
    
    MESSAGE
    ------------
    hello world!
    
    SQL> select 'hello'||chr(10)||'world!' as message from dual;
    
    MESSAGE
    ------------
    hello
    world!
    

    Schedule Oracle Procedure

    We may want to run our Oracle procedures on a regular basis. For example, every day at 10PM, I want to get new transactions from a database schema, calculate the risk score and insert the result into a table. I first created a procedure, proc_all, and then I use DBMS_SCHEDULER.create_job() to schedule the job. This procedure will run automatically every day at 10pm.

    BEGIN
      DBMS_SCHEDULER.create_job (
        job_name        => 'my_proc',
        job_type        => 'PLSQL_BLOCK',
        job_action      => 'BEGIN proc_all; END;',
        start_date      => SYSTIMESTAMP,
        repeat_interval => 'freq=daily; byhour=22;',
        end_date        => NULL,
        enabled         => TRUE,
        comments        => 'Job defined entirely by the CREATE JOB procedure.');
    END;
    /
    
    If we want to stop the job, we can drop it using DBMS_SCHEDULER.drop_job()
    begin DBMS_SCHEDULER.drop_job(job_name => 'my_proc'); end;
    
    We can query views DBA_SCHEDULER_JOBS or USER_SCHEDULER_JOBS to get the status of scheduled jobs.
    SQL> select JOB_NAME,JOB_ACTION from USER_SCHEDULER_JOBS;
    
    JOB_NAME                         JOB_ACTION
    -------------------------- ------------------------------------------------------------------
    MY_PROC                   BEGIN proc_all; END;
    PROC_CLEAR_TBL            BEGIN execute immediate 'truncate table tbl_tobe_alerted'; END;
    

    Friday, August 22, 2014

    Refresh an Oracle Materialized View

    An Oracle materialized view is a very useful feature in situations including data refresh on regular basis. When we create a materialized view, it takes a snapshot of query result and store it physically. This is different from a view where only query logic is stored with the view. Once the materialized view is created, its content does not change until it is refreshed. When we need to refresh the materialized view, one way to do it is to use dbms_mview.refresh().

    SQL> create materialized view mv_test_refresh as select sysdate dt 
         from dual;
    
    Materialized view created.
    SQL> select to_char(dt, 'YYYYMMDD:HH24:Mi:SS') from mv_test_refresh;
    
    TO_CHAR(DT,'YYYYM
    -----------------
    20140801:10:42:47
    
    SQL> exec dbms_mview.refresh('mv_test_refresh');
    
    PL/SQL procedure successfully completed.
    
    SQL> select to_char(dt, 'YYYYMMDD:HH24:Mi:SS') from mv_test_refresh;
    
    TO_CHAR(DT,'YYYYM
    -----------------
    20140801:10:50:31
    
    Alternatively, we can create a refresh group to include materialized views that we want to refresh. Then we can run dbms_refresh.refresh() to refresh all materialized views included in the group.
    SQL> exec dbms_refresh.make(name=>'test_grp', list=>'mv_test_refresh', 
    next_date=>sysdate, interval=>'null');
    
    SQL> exec dbms_refresh.refresh('test_grp');
    
    PL/SQL procedure successfully completed.
    
    SQL> select to_char(dt, 'YYYYMMDD:HH24:Mi:SS') from mv_test_refresh;
    
    TO_CHAR(DT,'YYYYM
    -----------------
    20140801:10:54:35
    
    We can also schedule when a materialized view will be refreshed automatically. That will be the topic of another post.

    Tuesday, July 22, 2014

    Get Current Date In Oracle and Server SQL

    To get the current system date, we use sysdate in Oracle and getdate() in SQL Server, respectively.

    Using Oracle sqlplus.

    SQL> select sysdate from dual;
    
    SYSDATE
    -------------------
    07-22-2014 07:21:21
    
    Using SQL Sever Management Studio.
    select getdate()
    2014-07-22 11:23:23.097
    

    Wednesday, July 16, 2014

    More on Business Rules vs Statistical Models

    In a post Predictive Modeling vs Intuitive Business Rules, we described advantages of data-driven models over intuitive rules. The limitations of our thinking are nicely described in a free book written by a former CIA veteran, Richards J. Heuer Psychology of Intelligence Analysis. It is one of my favorite books and I highly recommend it.I have found often we are too confident about our perception and intuition and not aware that we are wrong. (For example, some fraud detection analysts use her intuitions/experience to reject statistically derived optimal fraud scores). Data-driven statistical models are powerful tools to resolve the limitations in our mental processes.