Tuesday, October 10, 2017

Leading Data Providers

I am compiling a list of leading data providers. This list is continuously expanding. If your favorite data providers are not included here and you feel they should be included in the list, please contact me at jay.zhou@deep-data-mining.com.

Thomson Reuters

Thomson Reuters provides financial data, financial data,including company, market, valuation and price. In addition, Open Calais from Thomson Reuters is a tool that extracts structured data from text.

Three Credit Bureaus

Three credit bureaus, Experian, Equifax and TransUnion offer consumer credit reports.


Acxiom's Personicx products divide consumers into clusters or segments according to different criteria to help companies can better market their products to targeted audience. Personnicx products include Lifestage, Digital, Financial and Hispanic.

U.S. Commodity Futures Trading Commission

Commitments of Traders reports are released weekly. They show the future positions held by large and small traders in various markets, including financial, currencies and commodities. These reports are considered useful for traders in predicting market movement directions as it is commonly believed that smaller traders are always wrong.

United States Department of Transportation

Aviation Data & Statistics

National Oceanic and Atmospheric Administration

Climate Data Online: Dataset Discovery

United States Department of Labor- Bureau of Labor Statistics

Wage Statistics by Occupation by State. For example, the average wage for Computer and Information Research Scientists (code15-1111) in Massachusetts is $125,870.

United States Department of Agriculture-Nutrient Data Laboratory

USDA Nutrient Database

Monday, October 02, 2017

A Great Book on SQL for Data Miners/Scientists

I have found an excellent book on SQL tailored for Data Miners/Scientists, Data Analysis Using SQL and Excel by Gordon S. Linoff. Unlike other technical SQL books, this one is highly practical with many examples that can be applied directly to business problems. I highly recommend it.

Thursday, September 07, 2017

About My Oracle SQL for Data Science Course

On January 31, 2017, I was invited by Prof. Wei Ding at Department of Computer Science, University of Massachusetts Boston, and gave 3 talks about my data science projects across different industries. These talks are extremely well received. The following is what Prof. Ding says about my talks.

"It was a fortune to have Jay come to our computer science department to share his experience in solving business problems with predictive analytics on February 28, 2017. What Jay had presented in his 3 talks, each lasting for 1 hour in different topics of data mining, was totally impressive and beyond our wildest expectation. Having built competition-winning predictive models for some of the biggest companies and produced hundreds of millions of dollars’ savings, Jay shared the secret of his success with students and faculty without reservation. His strong presentations were such an inspiration for our computer science students and faculty and his methodology was innovative and powerful , even for very seasoned data scientists among the audience. Jay, thank you so much for your hard work preparing and delivering these presentations!" -Prof. Ding Wei, Department of Computer Science, University of Massachusetts Boston

The audience are particularly amazed by how I come up with solutions using Oracle SQL environment. To share my expertise, I create the online course Oracle SQL for Data Science to show how to perform common data science tasks using Oracle SQL and the benefits for doing that.

I let Charlie Berger,Senior Director of Product Management, Machine Learning, AI and Cognitive Analytics at Oracle know about my course and he told me "Your course is amazing."

Tuesday, August 29, 2017

Stop Amazon RDS Oracle Instance

I have gone through many steps to build predictive models on the Oracle instance on Amazon AWS and presented the results to the management. We decide to stop the Oracle instance so that we will not pay instance hours (but we are still charged for provisioned storage). The stopped instance can be restarted.

The instance is being stopped.

Monday, August 14, 2017

Create Predicitve Models in Oracle by Bootstrap Aggregating (Bagging)

I have a training data set of 136 records, 25 of them are positive examples and the remaining negative. The goal of the project is to build a predictive model that gives the probabilities of data points being positive. A logistic regression model is selected for its structural and implementation simplicity. To make the model more robust and able to perform reasonably well on new data set, I decide to build 20 logistic models, each based on a randomly sampled set of the original 136 records with replacement. The prediction probabilities produced by these 20 models are averaged to arrive at the final score. My first step is to generate 20 random sampling sets of 136 records from the original training set. The sampled set will have the same size but some records will be picked more zero, one or more than one times. I write the following PL/SQL to do the 20 rounds of random sampling with replacement.

create table t_bagging (iter number, id number);

i number;
 for i in 1..20
 insert into t_bagging select i, 1+mod(abs(dbms_random.random),136) from t_train;
 end loop;

In the above script, t_trian is the original training set having 136 records with unique identifier starting from 1 to 136. The function dbms_random.random generates a uniformly distributed random integer from from -2^^31 to 2^^31. I make the random number positive by taking the absolute value using abs() function. Mod() function forces the random number to be within the range of 0 and 135. I also add 1 after applying mod function so that its range becomes from 1 to 136. Next, I write the following script to create 20 views which will be used as the new training sets for building 20 models.

sqlstr varchar2(512);
 for i in (select distinct iter from T_BAGGING order by iter)
 sqlstr:='create or replace view v_tr_bag'||i.iter||' as select '||
         '* from t_training a, T_BAGGING c'||
         ' where a.id=c.id and c.iter='||i.iter;
execute immediate sqlstr;
 end loop;

Wednesday, August 09, 2017

Load a Text File to Oracle Database on Amazon

I have a text file on my laptop that I want to load into an Oracle database on Amazon. The only tool available is SQLPLUS client on my laptop from which I can connect to the database to run query. The text file has 3,735 records and 17 columns as shown below.

I decide to create a SQL script file that contains 3,735 "insert" SQL statements, connect from SQLPLUS client to the database and run the SQL script file. As shown in the following scripts, I use Linux awk command to make insert statement for each records.
$ cat x.txt | awk '{print "insert into t_data values(",$0,");"}' > insert_data2.sql
$ cat insert_data2.sql
insert into t_data values( 1.0,65.79,329.0,...,4.0 );
insert into t_data values( 2.0,224.9,146.0,...,10.0 );
insert into t_data values( 3.0,113.0,113.0,...,9.0 );
insert into t_data values( 4.0,175.4,28.0,...,7.0 );
I open notepad and add "begin " at the beginning of the insert_data2.sql and "end; /" at the end of insert_data2.sql to make these insert statements within a PL/SQL block. By doing so, all these insert statments will be treated as one transaction, i.e., all 3,735 insert statements have to be executed fully or not at all. I don't want data partially inserted.
insert into t_data values( 1.0,65.79,329.0,...,4.0 );
insert into t_data values( 2.0,224.9,146.0,...,10.0 );
insert into t_data values( 3.0,113.0,113.0,...,9.0 );
insert into t_data values( 4.0,175.4,28.0,...,7.0 );
end; /
I connect to the Oracle database on Amazon and run the insert_data2.sql. It takes less than 16 seconds to insert 3725 records. Not bad.
SQL> @insert_data2

PL/SQL procedure successfully completed.

Elapsed: 00:00:15.93
SQL> select count(1) from t_data;


Elapsed: 00:00:00.39

Thursday, August 03, 2017

Connect to Oracle Instance on Amazon RDS

When I try to connect to the instance that I just started on Amazon, I get "ORA-12170: TNS:Connect timeout occurred".

$ sqlplus adminxxx/abcxxxx@xxxxx.ctbxjhstgq8o.us-east-1.rds.amazonaws.com:1521/orcl

SQL*Plus: Release Production on Thu Aug 3 12:18:04 2017

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

ORA-12170: TNS:Connect timeout occurred

Enter user-name:
I suspect it has to to with the security group. I click the security group
Then I add port 1521 to the inbound rules.
I save the setting and am able to connect to the instance from my laptop.
$ sqlplus adminxxx/abcxxxx@xxxxx.ctbxjhstgq8o.us-east-1.rds.amazonaws.com:1521/orcl

SQL*Plus: Release Production on Thu Aug 3 15:20:17 2017

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Last Successful login time: Thu Aug 03 2017 15:18:16 -04:00

Connected to:
Oracle Database 12c Enterprise Edition Release - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> select count(1) from user_tables;


Starting Oracle Instance On Amazon

I am working on a project to analyze the tourism data. The raw data are delivered to me in the format of Microsoft Excel spreadsheet files. I open the files and find the data are messy. After I manage to combine 6 spreadsheet files into a single sheet using Microsoft Access, I decide to load the single sheet into an Oracle database as a table. All columns will be defined as varchar2 type initially. My plan is to clean up those varchar2 data type columns, such as average annual temperature, and convert them to numeric using rich set of Oracle string functions. I don't have the access to any Oracle instance and I decide to start one on Amazon's platform using RDS. It takes me about 2 minutes to fill in those necessary parameters and the database is starting.

After about 5 minutes, the instance is ready to be connected. Cool!

Wednesday, July 19, 2017

NOAA Comprehensice Large Array-Data Stewdship System (CLASS)

I am working on a project that needs climate data about many towns China. I looked around and a friend who works for National Oceanic and Atmospheric Administration (NOAA) mentioned that NOAA Comprehensice Large Array-Data Stewdship System (CLASS) The site contains many data sources derived from earth observation satellites.

Saturday, July 15, 2017

19 Free Public Data Sets From Springboard

Springboard "19 Free Public Data Sets For Your First Data Science Project" has good information about free data sets such as United States Census Data, FBI Crime Data and CDC Cause of Death etc. I also added the link to my post Free Data Sources

Wednesday, June 28, 2017

Count Unique Values in Oracle and Microsoft Access

I have found that it is convenient to use Microsoft Access to create an external link to a Spreadsheet and run SQL queries against it.

In a recent project, I want to count the name of unique name in table t_test. If it is an Oracle table, it can be done using "count(distinct name)" as the following:

select count(distinct name) from t_test;
However, t_test is an linked external spreadsheet within an Access database. Access does not support "count(distinct "). So I first find the unique name and then count the number of records using the following query:
select count(*) from
SELECT  distinct name
FROM t_test

Friday, May 05, 2017

Current Schema Name

To find out the current schema that we are connect to, we may use function sys_context(). I connected to Oracle SQL Live at https://livesql.oracle.com and ran the following query:

select sys_context('USERENV', 'SESSION_USER') from dual;

The returned result showed that the schema name was APEX_PUBLIC_USER.

Sunday, February 26, 2017

Online Training Course- Oracle SQL for Data Science

Upon the requests of many people who listened to my presentations, I have built the online course Oracle SQL for Data Science. This course requires that students already have basic knowledge about SQL. The following is the description of the course.

Most of the core business data in an enterprise are stored in relational databases that support Structure Query Language (SQL). When data scientists perform common tasks such as data cleanse, validation, manipulation and feature variable calculation using SQL within database environment, they can achieve important advantages such as more compact code, easy deployment and security in comparison to moving the data outside of the database to a separate analytics environment. The instructor Dr. Jiang Zhou is one of the pioneers in developing SQL based in-database analytics solutions that are used by banks and insurance companies. In addition to his technical skills, he is, as one of his clients put it, "a great trainer, and a good presenter of theoretical data mining concepts so that they can be understood by most".
In this course, students will learn practical Oracle SQL skills to solve problems such as:

  • Data Validation
  • Data Summary
  • Detect and Remove Duplicates
  • Binning Variable Based on Equal Frequency
  • Build Good Variables for Predictive Models or Business Rules, e.g., RFM Analysis, Time Elapse Since Last Purchase, Number of Transactions in Last 3 Days, Moving Average Purchase Amount in Last 7 Days
  • Random Sampling
  • Gain Chart
  • Using View to Organize Process Flows
  • Histogram
There are totally about 4 hours and 30 minutes video presentations. SQL scripts that create data sets and perform the Data Science tasks are provided. Slides are included so that students can easily find the topics that they interested in.

Monday, February 13, 2017

Display Date in SQLPLUS Client

When I select column of date type within SQLPLUS, I got the following result.

SQL> select txn_date from card_txn where rownum <10;


9 rows selected.
Since I want to see the hour, minute and second of a transaction, I use to_char() to format the date.
SQL> select to_char(txn_date,'YYYYMMDD:hh24:mi:ss') from card_txn 
where rownum <10;


9 rows selected.
If we want to display dates in desired format without using to_char() conversion, we can use the following command. After running this command, all dates will be automatically formatted.
SQL> alter session set nls_date_format='YYYYMMDD:hh24:mi:ss';

Session altered.

SQL> select txn_date from card_txn where rownum <10;


9 rows selected.

Friday, February 10, 2017

Troubleshooting: Could not Connect to Amazon RDS Oracle Instance

I could not connect to the Oracle Instance on Amazon RDS platform when I worked in the town library. I followed the following steps and resolved the issue:

1. Find my IP address using Google Search "my ip address". Google will display my ip address 64.64.117.xx.

2. Go to Amazon AWS console/Services/RDS/Dashboard/Instance. Right click the instance and select "See details".

3. From the detail view, click the link to next to "Security Groups".
4. At the bottom section, click Inbound and then Edit

5. Click Add Rule and add the IP address 64.64.117.xx/32.

Wednesday, February 08, 2017

Oracle SQLPLUS Client Installation on Windows 64

I want to run SQLPLUS interactive command environment to query Oracle databases. When I type my SQL commands and hit return key, I get the response from the database right away. It is like I am having a conversation with the data. To me, this is a more preferable way than using a GUI-based tool. I wrote a post 3 years ago, Oracle SQLPLUS Client Installation on Windows Troubleshooting. But that one was about installing 32 bit version installation. Since I am installing 64 bit version on my Windows Machine, I write down the steps.

Step 1. Go to the Oracle site:
Step 2. Accept OTN License Agreement and download the following two files:

   Instant Client Package - Basic: All files required to run OCI, OCCI, 
and JDBC-OCI applications
Download instantclient-basic-windows.x64- (72,416,242 bytes) 
(cksum - 1751087003)
  Instant Client Package - SQL*Plus: Additional libraries and executable for 
running SQL*Plus with Instant Client Download 
instantclient-sqlplus-windows.x64- (876,515 bytes) 
(cksum - 1826332995) 
Step 3. I unzip both zip files and put files under the same directory. I run the following command and am able to connect the my Oracle DB on Amazon cloud.
sqlplus myusername/mypassword@mysid.ctbxjhstgq8o.us-east-1.rds.amazonaws.com:1521/mysid

Thursday, January 26, 2017

Dr. Jiang Zhou will Give a Talk at The University of Massachusetts Boston

I am going to give a talk at Department of Computer Science, the University of Massachusetts Boston, on January 31. The following is the abstract:

Abstract: Will a new mobile phone customer pay her monthly bill? Is a check deposited into a bank customer’s account fraudulent? Will an insurance claim become expensive? For nearly two decades, Dr. Jiang (Jay) Zhou has been helping mobile phone service providers, banks, insurance companies and other businesses solve problems like these. The predictive models that he built have resulted in over $200 million savings for clients. In the seminar, Dr. Zhou will discuss topics including challenges and best practices when building these predictive models in real world environments.