Monday, August 14, 2017

Create Predicitve Models in Oracle by Bootstrap Aggregating (Bagging)

I have a training data set of 136 records, 25 of them are positive examples and the remaining negative. The goal of the project is to build a predictive model that gives the probabilities of data points being positive. A logistic regression model is selected for its structural and implementation simplicity. To make the model more robust and able to perform reasonably well on new data set, I decide to build 20 logistic models, each based on a randomly sampled set of the original 136 records with replacement. The prediction probabilities produced by these 20 models are averaged to arrive at the final score. My first step is to generate 20 random sampling sets of 136 records from the original training set. The sampled set will have the same size but some records will be picked more zero, one or more than one times. I write the following PL/SQL to do the 20 rounds of random sampling with replacement.


create table t_bagging (iter number, id number);

declare
i number;
begin
 for i in 1..20
 loop
 insert into t_bagging select i, 1+mod(abs(dbms_random.random),136) from t_train;
 dbms_output.put_line(i);
 end loop;
 commit;
end;
/

In the above script, t_trian is the original training set having 136 records with unique identifier starting from 1 to 136. The function dbms_random.random generates a uniformly distributed random integer from from -2^^31 to 2^^31. I make the random number positive by taking the absolute value using abs() function. Mod() function forces the random number to be within the range of 0 and 135. I also add 1 after applying mod function so that its range becomes from 1 to 136. Next, I write the following script to create 20 views which will be used as the new training sets for building 20 models.

declare
sqlstr varchar2(512);
begin
 for i in (select distinct iter from T_BAGGING order by iter)
 loop
 sqlstr:='create or replace view v_tr_bag'||i.iter||' as select '||
         '* from t_training a, T_BAGGING c'||
         ' where a.id=c.id and c.iter='||i.iter;
--dbms_output.put_line(sqlstr);
execute immediate sqlstr;
 end loop;
end;
/

Wednesday, August 09, 2017

Load a Text File to Oracle Database on Amazon

I have a text file on my laptop that I want to load into an Oracle database on Amazon. The only tool available is SQLPLUS client on my laptop from which I can connect to the database to run query. The text file has 3,735 records and 17 columns as shown below.

1.0,65.79,329.0,...,4.0
2.0,224.9,146.0,...,10.0
3.0,113.0,113.0,...,9.0
4.0,175.4,28.0,...,7.0
I decide to create a SQL script file that contains 3,735 "insert" SQL statements, connect from SQLPLUS client to the database and run the SQL script file. As shown in the following scripts, I use Linux awk command to make insert statement for each records.
$ cat x.txt | awk '{print "insert into t_data values(",$0,");"}' > insert_data2.sql
$ cat insert_data2.sql
insert into t_data values( 1.0,65.79,329.0,...,4.0 );
insert into t_data values( 2.0,224.9,146.0,...,10.0 );
insert into t_data values( 3.0,113.0,113.0,...,9.0 );
insert into t_data values( 4.0,175.4,28.0,...,7.0 );
.................
I open notepad and add "begin " at the beginning of the insert_data2.sql and "end; /" at the end of insert_data2.sql to make these insert statements within a PL/SQL block. By doing so, all these insert statments will be treated as one transaction, i.e., all 3,735 insert statements have to be executed fully or not at all. I don't want data partially inserted.
begin
insert into t_data values( 1.0,65.79,329.0,...,4.0 );
insert into t_data values( 2.0,224.9,146.0,...,10.0 );
insert into t_data values( 3.0,113.0,113.0,...,9.0 );
insert into t_data values( 4.0,175.4,28.0,...,7.0 );
.................
end; /
I connect to the Oracle database on Amazon and run the insert_data2.sql. It takes less than 16 seconds to insert 3725 records. Not bad.
SQL> @insert_data2

PL/SQL procedure successfully completed.

Elapsed: 00:00:15.93
SQL> select count(1) from t_data;

  COUNT(1)
----------
      3735

Elapsed: 00:00:00.39

Thursday, August 03, 2017

Connect to Oracle Instance on Amazon RDS

When I try to connect to the instance that I just started on Amazon, I get "ORA-12170: TNS:Connect timeout occurred".

$ sqlplus adminxxx/abcxxxx@xxxxx.ctbxjhstgq8o.us-east-1.rds.amazonaws.com:1521/orcl

SQL*Plus: Release 12.1.0.2.0 Production on Thu Aug 3 12:18:04 2017

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

ERROR:
ORA-12170: TNS:Connect timeout occurred


Enter user-name:
I suspect it has to to with the security group. I click the security group
Then I add port 1521 to the inbound rules.
I save the setting and am able to connect to the instance from my laptop.
$ sqlplus adminxxx/abcxxxx@xxxxx.ctbxjhstgq8o.us-east-1.rds.amazonaws.com:1521/orcl

SQL*Plus: Release 12.1.0.2.0 Production on Thu Aug 3 15:20:17 2017

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Last Successful login time: Thu Aug 03 2017 15:18:16 -04:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> select count(1) from user_tables;

  COUNT(1)
----------
         0

Starting Oracle Instance On Amazon

I am working on a project to analyze the tourism data. The raw data are delivered to me in the format of Microsoft Excel spreadsheet files. I open the files and find the data are messy. After I manage to combine 6 spreadsheet files into a single sheet using Microsoft Access, I decide to load the single sheet into an Oracle database as a table. All columns will be defined as varchar2 type initially. My plan is to clean up those varchar2 data type columns, such as average annual temperature, and convert them to numeric using rich set of Oracle string functions. I don't have the access to any Oracle instance and I decide to start one on Amazon's platform using RDS. It takes me about 2 minutes to fill in those necessary parameters and the database is starting.


After about 5 minutes, the instance is ready to be connected. Cool!