Sunday, October 27, 2013

Categorical Variables in Logistic Regression

In the old post Build Predictive Models Using PL/SQL, we showed how to call DBMS_DATA_MINING.CREATE_MODEL() function to build a logistic regression model.

The input data set should contain (and only contain) the following columns:
1. a unique case id;
2. the target variable;
3. independent variables used in the model. All variables other than the case id and target variables will be used as the input variables to the model.

We can easily construct the input data set using view based on a data table. In the view, we specify case id, target variable and independent variables that we desire in the select part of the SQL.

Independent variables are either numeric or character types. Character types, such as state name or male/female, are categorical variables. Oracle models automatically treat the most frequent categorical value as the reference class and assign it a weight of zero. This is very convenient. For example, we built a model that use transaction code as one of the input variables as mentioned in post Logistic Regression Model Implemented in SQL. Take a look at the piece of SQL code below. It converts the txn_code into weight derived from a logistic regression model. substr((TXN_CODE),1,18) is to only take the first 18 characters of txn_code (just in case the txn_code is too long). nvl() is to treat missing value as a blank. txn_code 'XX' will receive a weight of -.070935, NULL or blank value a weight of -.330585. If there is a new txn_code in production that is unseen in the training data set, say 'ZZZ', it will receive a default weight of 0 which is the weight for the most frequent txn_code. This makes sense as we can assume that the unseen code share the weight of historically most common codes. Thus, the model can produce a score (that is reasonable) under any circumstance. This example also shows that it is important to design the model that can handle unseen situations after it is deployed.

decode(nvl(substr((TXN_CODE1),1,18),' '),
'XX',-.070935,
'57',-.192319,
'1',-.053794,
'81',-.010813,
'NR',-.079628,
'PD',-.102987,
'P',-1.388433,
'Z6',-.106081,
'01',-1.1528,
'Z4',-.004237,
'T1',.697737,
'AK',-.490381,
'U2',.063712,
'NK',.054354,
'PR',.205336,
'51',-.286213,
'N',.075582,
' ',-.330585,
0)

The above SQL code that converts values to weights is not necessary normally. Instead, we use the model mining object and prediction_probability function. I took this approach was simply that the database administers of the production databases were unaware of Oracle mining objects and felt not comfortable using them. Thus, to be able to deploy our predictive models into production systems, data miners need to flexible. I have seen to many good models built in labs that never got deployed.

Tuesday, October 01, 2013

A Quick Way to Import Spreadsheet Into a Relational Database

I just made a youtube video to show readers how to do this. We often need to import data from Excel spreadsheet, such as the one shown below, into a database.

A quick way that works for any databases is to simply generate SQL insert statements for those rows using formula similar to: Formula: =CONCATENATE("insert into tbl_dataset values(",A2,",","'",B2,"');") as shown in column C of the picture below.

Formula: =CONCATENATE("insert into tbl_dataset values(",A2,",","'",B2,"');")

We create a destination table. Then we copy those "insert into " statements from Excel spreadsheet and paste them into SQL client tool such as SQL Developer or SQLPLUS to run them. Do not forget to commit the inserts.This approach has advantages: 1. It works for any relational databases as the insert statements are standard SQL (if not can adjust the spreadsheet formula slightly). 2. It does not require any data import tools. All we need are Excel spreadsheet and a SQL client to run the create table and insert statements.

SQL> create table tbl_dataset (col1 number, col2 varchar2(8));
Table created.

Run the following insert statements. If there are many lines, we can put them in a script file and run the script file.

insert into tbl_dataset values(1,'A');
insert into tbl_dataset values(2,'B');
insert into tbl_dataset values(3,'C');
insert into tbl_dataset values(4,'D');
insert into tbl_dataset values(5,'E');
insert into tbl_dataset values(6,'F');
insert into tbl_dataset values(7,'G');
insert into tbl_dataset values(8,'H');
insert into tbl_dataset values(9,'I');

SQL> insert into tbl_dataset values(1,'A');
1 row created.
SQL> insert into tbl_dataset values(2,'B');
1 row created.
SQL> insert into tbl_dataset values(3,'C');
1 row created.
SQL> insert into tbl_dataset values(4,'D');
1 row created.
SQL> insert into tbl_dataset values(5,'E');
1 row created.
SQL> insert into tbl_dataset values(6,'F');
1 row created.
SQL> insert into tbl_dataset values(7,'G');
1 row created.
SQL> insert into tbl_dataset values(8,'H');
1 row created.
SQL> insert into tbl_dataset values(9,'I');
1 row created.

Do not forget to commit the changes.

SQL> commit;
Commit complete.

Data are imported into the database.

SQL> select * from tbl_dataset;

COL1 COL2
1 A
2 B
3 C
4 D
5 E
6 F
7 G
8 H
9 I

9 rows selected.