Deep Data Mining Blog: March 2015

Thursday, March 26, 2015

Best Practice In Loading Text Files Into Oracle Database

The post Five Ways of Loading Text Files Into Oracle Database shows common ways of loading test files. There are many issues when loading text files into an Oracle database. These issues include:

Data format conversions such as strings to numbers and strings to dates
NULl value representation in text files. NUll are ., n/a, null, blanks, etc.?
Other data representation issues such as $ in front of amount.

Thus it is important to find out effective ways to handle those data issues and make sure data are indeed loaded correctly. For many projects, I have used the following process of importing text files into an Oracle database.

Step 1. Define external tables pointing to the text files. In the external tables, we simply define all columns as string (varchar2). Creating an external table does not physically load the data. It just contains the "pointer" to the files. We can run SQL queries against an external table just like a regular table.
Step 2. Perform data cleansing and conversion using SQL functions and store the data into tables or materialized views.

For example, the following a few lines of text files to be loaded.

10,"ABASTECEDORA NAVAL Y INDUSTRIAL, S.A.",-0- ,"CUBA",-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0-
15,"ABDELNUR, Nury de Jesus","individual","CUBA",-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0-
25,"ACEFROSTY SHIPPING CO., LTD.",-0- ,"CUBA",-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0-
36,"AEROCARIBBEAN AIRLINES",-0- ,"CUBA",-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0-
39,"AEROTAXI EJECUTIVO, S.A.",-0- ,"CUBA",-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0-
41,"AGENCIA DE VIAJES GUAMA",-0- ,"CUBA",-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0- ,-0-

First, we create an external table as shown below. All columns are simply defined as varchar2 as we will do the data conversion later.

create directory dir_files as '/home/data';
create table tbl_sdn_ext(
  v1  varchar2(512),
  v2  varchar2(512),
  v3  varchar2(512),
  v4  varchar2(512),
  v5  varchar2(512),
  v6  varchar2(512),
  v7  varchar2(512),
  v8  varchar2(512),
  v9  varchar2(512),
  v10  varchar2(512)
)
organization external
( type oracle_loader
default directory DIR_ORA_EXT
access parameters
( records delimited by newline
skip 1
fields terminated by ','
optionally enclosed by '"'
missing field values are null
)
location('sdn.csv')
);

Once we have the external table, we can apply SQL functions to do various data conversions and store the results into tables physically as shown below.

create table tbl_sdn2
as
with tbl as
(
select
case when trim(V1)='-0-' then null else upper(trim(V1)) end V1,
case when trim(V2)='-0-' then null else upper(trim(V2)) end V2,
case when trim(V3)='-0-' then null else upper(trim(V3)) end V3,
case when trim(V4)='-0-' then null else upper(trim(V4)) end V4,
case when trim(V5)='-0-' then null else upper(trim(V5)) end V5,
case when trim(V6)='-0-' then null else upper(trim(V6)) end V6,
case when trim(V7)='-0-' then null else upper(trim(V7)) end V7,
case when trim(V8)='-0-' then null else upper(trim(V8)) end V8,
case when trim(V9)='-0-' then null else upper(trim(V9)) end V9,
case when trim(V10)='-0-' then null else upper(trim(V10)) end V10
from tbl_sdn_ext
)
select
v1 sdn_id,
v2 full_name,
trim(substr(v2, 1, regexp_instr(v2,',')-1)) lnm,
trim(substr(v2, regexp_instr(v2,',')+1, length(v2))) fnm,
v3 dtype,
v4 country
from
tbl;

Tuesday, March 24, 2015

Most Frequently Used SQL Functions by Data Scientist

As a data scientist, I have done many projects for banks, insurance companies, marketing firms, etc. Most of the projects involve building predictive models. I have won three head to head competitions for building best predictive models.

In a typical project, I receive the data, usually in the format for test files, from a client. Then I load them into Oracle databases. I do most of the data manipulation in the database using SQL. As a result of years of work, I have created close to 800 SQL script files under my work directories on my computers. I figure out it would be interesting to see what are the most frequently used SQL functions from a data scientist's perspective. So I perform a simple analysis on my SQL scripts. I have found that the following 45 functions accounts for about 65% of Oracle SQL functions that I have used for various projects. The following table shows ranks of frequencies of those functions as they present in my SQL scripts. As we may see, a typical data analytics project only involves a relatively small number of frequently used SQL functions.

We may have the following observations: 1. Many of the functions are used for data cleanse and data conversion including trim, nvl, decode,to_char, to_number, trunc, substr,etc. This precisely reflects the fact that at least 80% of the work is about data manipulation. 2. Some simple statistics functions include sum, count, min, max, stddev, corr, median. It is convenient to calculate useful statistics using SQL. We may use those SQL functions to produce reports similar to that by SAS UNIVARIATE.

Rank	SQL Function	Frequency
1	trim	10.03%
2	sum	8.22%
3	nvl	5.93%
4	count	5.67%
5	decode	5.38%
6	to_char	4.63%
7	min	3.53%
8	max	3.33%
9	to_number	2.25%
10	trunc	2.22%
11	row_number	1.97%
12	avg	1.88%
13	round	1.33%
14	to_date	1.20%
15	substr	1.07%
16	sign	0.93%
17	upper	0.77%
19	stddev	0.45%
20	mod	0.35%
21	lower	0.30%
22	exp	0.28%
23	ratio_to_report	0.27%
24	median	0.25%
25	abs	0.23%
26	lag	0.23%
27	dense_rank	0.22%
28	length	0.22%
29	greatest	0.18%
30	sys_guid	0.15%
31	ceil	0.15%
32	chr	0.13%
33	regexp_instr	0.10%
35	percent_rank	0.10%
36	replace	0.10%
37	regexp_replace	0.10%
38	cume_dist	0.08%
39	instr	0.07%
40	rtrim	0.07%
41	width_bucket	0.07%
42	prediction	0.07%
43	corr	0.05%
44	lead	0.03%
45	lpad	0.03%

Sunday, March 22, 2015

The 2015 Rexer Analytics Data Miner Survey has launched!

The 2015 Rexer Analytics Data Miner Survey has launched. The founder of Rexer Analytics, Karl Rexer, is regarded by Deep-data-mining.com as one of 10 Most Influential People in Data Analytics.

Analytic Professionals — Share your views: Participate in the 2015 Data Miner Survey

Data Analysts, Predictive Modelers, Data Scientists, Data Miners, and all other types of analytic professionals, students, and academics: Please participate in the 2015 Rexer Analytics Data Miner Survey.

Survey Link: www.rexeranalytics.com/Data-Miner-Survey-2015-Intro.html
Access Code: CL72L4 — It is OK to share this Access Code with others: It can be used by multiple people.
Survey results will be unveiled at the Fall-2015 Boston Predictive Analytics World event.

Rexer Analytics has been conducting the Data Miner Survey since 2007. Each survey explores the analytic behaviors, views and preferences of data miners and analytic professionals. Over 1200 people from around the globe participated in the 2013 survey. Summary reports (40 page PDFs) from previous surveys are available FREE to everyone who requests them by emailing DataMinerSurvey@RexerAnalytics.com. Also, highlights of earlier Data Miner Surveys are available at www.rexeranalytics.com/Data-Miner-Survey-Results-2013.html, including best practices shared by respondents on analytic success measurement, overcoming data mining challenges, and other topics. The FREE Summary Report for this 2015 Data Miner Survey will be available to everyone Fall-2015.

Please tell other data analysis professionals about the survey.

Rexer Analytics is a consulting firm focused on providing data mining and analytic CRM solutions. Recent solutions include customer loyalty analyses, customer segmentation, predictive modeling to predict customer attrition and to target direct marketing, fraud detection, sales forecasting, market basket analyses, and complex survey research. More information is available at www.RexerAnalytics.com or by calling +1 617-233-8185.

Saturday, March 21, 2015

Find Out Database Object Dependencies

Some database objects are depending other objects. For example, a view may query other tables or views. We can query user_dependencies to find out these dependencies.

SQL> create view v_x_y as select a.*, b.val val2 from t_x a, t_y b 
where a.id=b.id;

View created.

The following query shows that view v_x_y depends on two tables, t_x and t_y.

SQL>select * from user_dependencies where name='V_X_Y';

NAME, TYPE, REFERENCED_OWNER, REFERENCED_NAME, REFERENCED_TYPE, 
REFERENCED_LINK_NAME, SCHEMAID, DEPENDENCY_TYPE
V_X_Y VIEW DMUSER T_X TABLE  116 HARD
V_X_Y VIEW DMUSER T_Y TABLE  116 HARD

We create another view v_x_y2 based on view v_x_y.

SQL> create view v_x_y2 as select * from v_x_y;

The following query shows that V_X_Y2 depends on V_X_Y.

SQL> select * from user_dependencies where name='V_X_Y2';

NAME, TYPE, REFERENCED_OWNER, REFERENCED_NAME, REFERENCED_TYPE, 
REFERENCED_LINK_NAME, SCHEMAID, DEPENDENCY_TYPE
V_X_Y2 VIEW DMUSER V_X_Y VIEW  116 HARD

We create a materialized view based on two tables.

SQL> create materialized view mv_x_y as select a.*, b.val val2 
from t_x a, t_y b where a.id=b.id;
SQL> select * from user_dependencies where name='MV_X_Y';
Materialized view created.

The materialized view MV_X_Y depends on tables T_X ,T_Y and MV_X_Y.

NAME, TYPE, REFERENCED_OWNER, REFERENCED_NAME, REFERENCED_TYPE, 
REFERENCED_LINK_NAME, SCHEMAID, DEPENDENCY_TYPE
MV_X_Y MATERIALIZED VIEW DMUSER T_X TABLE  116 REF
MV_X_Y MATERIALIZED VIEW DMUSER T_Y TABLE  116 REF
MV_X_Y MATERIALIZED VIEW DMUSER MV_X_Y TABLE  116 REF

Monday, March 16, 2015

Watch Out Materialized View Refresh Group

In the post Refresh an Oracle Materialized View, we add a materialized view to a refresh group to be refreshed. When we drop the materialized view, it will be removed from the refresh group. Thus if we drop and then recreate the same materialized view, it will not be refreshed according to the schedule defined in the refresh group. We have to manually add the materialized view back to the refresh group. So I normally do not use refresh group. Instead, I create a procedure to refresh the materialized view using dbms_mview.refresh.

create or replace procedure proc_all(md varchar2 default NULL)
is
begin
 dbms_mview.refresh('MV_A');
 dbms_mview.refresh('MV_B');
 dbms_mview.refresh('MV_C');
 dbms_mview.refresh('MV_D');
end;

Then I use DBMS_SCHEDULER to schedule the job.

BEGIN
  DBMS_SCHEDULER.create_job (
    job_name        => 'my_proc',
    job_type        => 'PLSQL_BLOCK',
    job_action      => 'BEGIN proc_all; END;',
    start_date      => SYSTIMESTAMP,
    repeat_interval => 'freq=daily; byhour=19;',
    end_date        => NULL,
    enabled         => TRUE,
    comments        => 'daily job.');
END;
/

Display Date Type in Oracle SQLPLUS

To display the date type in the desired format, we can manually format the date using to_char() function. Or, we use "alter session set NLS_DATE_FORMAT" to set the environment variable and the format will be applied to all date columns in any queries.

SQL> select sysdate from dual;

SYSDATE
---------
16-MAR-15

SQL> select to_char(sysdate,'YYYYMMDD:HH24:MI:SS') from dual;

TO_CHAR(SYSDATE,'
-----------------
20150316:07:16:02

SQL> alter session set NLS_DATE_FORMAT = 'YYYYMMDD:HH24:MI:SS';

Session altered.

SQL> select sysdate from dual;

SYSDATE
-----------------
20150316:07:16:31

Sunday, March 15, 2015

Replace Consecutive Blanks With a Single Blank (Continued)

In an earlier post Replace Consecutive Blanks With a Single Blank, we use SQL regexp_replace() to replace consecutive blanks with a single blank. We may also use Linux command tr to do the same before we load a text file into a database.

$ cat x.txt
hello    world!
This   is a  test .

Using "tr -s ' '" command to squeeze consecutive blanks into one.

$ cat x.txt | tr -s ' '
hello world!
This is a test .

Many Linux/Unix commands are very useful to clean up text files. I am using Windows but I installed open source Cygwin that provides a Linux environment. In another example Calculate Histogram for a Text File, we calculate the histogram using Linux commands.

Saturday, March 14, 2015

Import XML File Into Oracle

We can import XML files into an Oracle database schema. First, we use BFILE data type to store the XML file name. Then we use xmltype() function to convert the file content into oracle XMLTYPE.

create table test_docs (id number, text_doc bfile);
truncate table test_docs;
insert into test_docs values (1, BFILENAME('DM_DUMP', 'test.xml'));

SQL> select xmltype(text_doc, nls_charset_id('AL32UTF8')) xml_doc from test_docs where id=1;

We may also create a new table containing the xmltype data type column using CTAS (Create Table As Select) query as shown below.

create table test_xml 
as select id, xmltype(text_doc, nls_charset_id('AL32UTF8')) xml_doc 
from test_docs;

Thursday, March 12, 2015

Fix Views That Stop Working

A view is based on query against other database objects that may involve tables, views, database links, etc.. Sometimes when the underline objects changes, the view may become invalid. For example, a view is defined based on a remote table. When I dropped and recreate the database link pointing to the remote table, the querying against the view returns error.


SQL*Plus: Release 11.2.0.1.0 Production on Sun Nov 30 10:09:48 2014

Copyright (c) 1982, 2010, Oracle.  All rights reserved.


Connected to:
Oracle Database 10g Express Edition Release 10.2.0.1.0 - Production

SQL>  select count(*) from V_CLAIM
old   1: select count(*) from &1
new   1: select count(*) from V_CLAIM
select count(*) from V_CLAIM
                     *
ERROR at line 1:
ORA-04063: view "PROD.V_CLAIM" has errors

I use dbms_utility.invalidate procedure to "fix" the view as shown below. First, we need to find the object_id for the view.

SQL> select object_name, object_id from user_objects where object_name='V_CLAIM';
OBJECT_NAME                               OBJECT_ID
---------------------------------------- ----------
V_CLAIM                                   16995

SQL> exec dbms_utility.invalidate(16995);

PL/SQL procedure successfully completed.

SQL> @ct V_SH_FH_CLAIM
old   1: select count(*) from &1
new   1: select count(*) from V_CLAIM

  COUNT(*)
----------
     24782

Move Large Amount of Data Using Insert Into

When we insert many records into a table, it is faster to use hint /*+ append */. This way, Oracle will use direct path load to insert large amount of data that is much faster than conventional insert. As we see from the example below, when /*+ apeend */ hint is used, insert only takes 0.18 second vs 0.49 with conventional insert.

SQL> insert /*+ append */ into tbl_y select * from TBL_CLAIM_ALL@dl_apex2;

1031 rows created.

Elapsed: 00:00:00.18
SQL> commit;

Commit complete.

Elapsed: 00:00:00.07
SQL> insert into tbl_y select * from TBL_CLAIM_ALL@dl_apex2;

1031 rows created.

Elapsed: 00:00:00.49
SQL> commit;

Wednesday, March 11, 2015

More on Get Source Code for Oracle Database Objects

The following are the three ways to get the source code for Oracle database objects such as views, materialized views, procedures, functions, synonyms, database links etc.
1. For views, we query user_views. This is described in Get the source code for Oracle database views and materialized view
2. For materialized view, we query user_mviews. This is also described in Get the source code for Oracle database views and materialized view.
3. For objects, including views and materialized views, we can use dbms_meta.get_ddl(). See Generate SQL Create Table/View Queries for Existing Tables/Views, SQL Scripts for Oracle Database Link and Hide PL/SQL Scripts For Function or Procedure For example, the following query returns the "create table" statement.

SQL> select dbms_metadata.get_ddl('TABLE', 'TBL_A') from dual;

DBMS_METADATA.GET_DDL('TABLE','TBL_A')
--------------------------------------------------------------------------------

  CREATE TABLE "PROD"."TBL_A"
   (    "ID" NUMBER,
        "VALUE" VARCHAR2(64),
        "VALUE2" VARCHAR2(32)
   ) SEGMENT CREATION IMMEDIATE
  PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255 NOCOMPRESS LOGGING
  STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
  PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT FLASH_CACHE DE
FAULT CELL_FLASH_CACHE DEFAULT)
  TABLESPACE "PROD"

Monday, March 09, 2015

Terminate Oracle Session on Amazon RDS

To kill an Oracle session on Amazon RDS instance, in stead of using "alter system kill session", we use the procedure provided by Amazon rdsadmin.rdsadmin_util.kill(sid, serial#).
First, we find out the sid and serial# for the session that we want to terminated. We log in as the master user of an Amazon RDS Oracle instance.

SQL> select username, sid, serial#, command, status from v$session;

USERNAME                              SID    SERIAL#    COMMAND STATUS
------------------------------ ---------- ---------- ---------- --------
TESTPROD723                            12      56369          3 ACTIVE
                                       13         17          0 ACTIVE
RDSADMIN                               15         27          0 INACTIVE
TEST1001                               21      12787          0 INACTIVE
                                       28      50603          0 ACTIVE
                                      609          1          0 ACTIVE
                                      610          1          0 ACTIVE
                                      611          1          0 ACTIVE
                                      612          1          0 ACTIVE
                                      613          1          0 ACTIVE
                                      614          1          0 ACTIVE
                                      615          1          0 ACTIVE
                                      616        103          0 ACTIVE
                                      618          3          0 ACTIVE
                                      619          7          0 ACTIVE
                                      622          1          0 ACTIVE
                                      623          3          0 ACTIVE
                                      625        131          0 ACTIVE
TEST3                                 627      35083         74 ACTIVE

29 rows selected.

We are not allowed to run "alter system kill session".

SQL> alter system kill session '627,35083';
alter system kill session '627,35083'
*
ERROR at line 1:
ORA-01031: insufficient privileges

We can terminate a session using rdsadmin.rdsadmin_util.kill

SQL> exec rdsadmin.rdsadmin_util.kill(627,35083);

PL/SQL procedure successfully completed.

Debug Oracle PL/SQL scripts

To debug PL/SQL scripts, I have found two commands are extremely helpful, "set echo on" and "show errors". "Set echo on" prints the scripts in a file along with the line number. "show errors" displays the compilation errors. By combining the outputs from the above two commands, we can easily identify where the problems are and fix them. When we are done debugging, we can turn off the display using "set echo off". In the example below, I put my "create or replace procedure" in a script file called debug_proc.sql.

SQL> set echo on
SQL> @debug_proc.sql
SQL> create or replace procedure
  2  proc_test_ins(
  3  p1 number,
  4  p2 date)
  5
  6  is
  7
  8  begin
  9
 10  insert into tbl_test_x(
 11  id, dt)
 12
 13  select
 14  p1, p2 from dua;
 15  commit;
 16
 17
 18  end;
 19
 20  /

Warning: Procedure created with compilation errors.

The "set echo on" show the line number for each line of scripts. To see the detailed compilation errors, we run "show errors" command.

SQL> show errors
Errors for PROCEDURE PROC_TEST_INS:

LINE/COL
--------
ERROR
--------------------------------------------------------------
--------------------------------------------------------------
------------------------------------------------------------
10/1
PL/SQL: SQL Statement ignored

14/13
PL/SQL: ORA-00942: table or view does not exist

As we see, there is a typo on line 14, "from dua" should be "from dual".

SQL> create or replace procedure
  2  proc_test_ins(
  3  p1 number,
  4  p2 date)
  5
  6  is
  7
  8  begin
  9
 10  insert into tbl_test_x(
 11  id, dt)
 12
 13  select
 14  p1, p2 from dual;
 15
 16  commit;
 17
 18
 19  end;
 20
 21  /

Procedure created.

To turn off the display of the scripts and the line numbers, I run "set echo off".

SQL> set echo off
SQL> @debug_proc.sql

Procedure created.

Now we can run the procedure.

SQL> exec proc_test_ins(1, sysdate);
SQL> select * from  tbl_test_x;

     ID DT
------- ---------
      1 09-MAR-15

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics