Wednesday, July 19, 2017

NOAA Comprehensice Large Array-Data Stewdship System (CLASS)

I am working on a project that needs climate data about many towns China. I looked around and a friend who works for National Oceanic and Atmospheric Administration (NOAA) mentioned that NOAA Comprehensice Large Array-Data Stewdship System (CLASS) The site contains many data sources derived from earth observation satellites.

Saturday, July 15, 2017

19 Free Public Data Sets From Springboard

Springboard "19 Free Public Data Sets For Your First Data Science Project" has good information about free data sets such as United States Census Data, FBI Crime Data and CDC Cause of Death etc. I also added the link to my post Free Data Sources

Wednesday, June 28, 2017

Count Unique Values in Oracle and Microsoft Access

I have found that it is convenient to use Microsoft Access to create an external link to a Spreadsheet and run SQL queries against it.

In a recent project, I want to count the name of unique name in table t_test. If it is an Oracle table, it can be done using "count(distinct name)" as the following:

select count(distinct name) from t_test;
However, t_test is an linked external spreadsheet within an Access database. Access does not support "count(distinct "). So I first find the unique name and then count the number of records using the following query:
select count(*) from
SELECT  distinct name
FROM t_test

Friday, May 05, 2017

Current Schema Name

To find out the current schema that we are connect to, we may use function sys_context(). I connected to Oracle SQL Live at and ran the following query:

select sys_context('USERENV', 'SESSION_USER') from dual;

The returned result showed that the schema name was APEX_PUBLIC_USER.

Sunday, February 26, 2017

Online Training Course- Oracle SQL for Data Science

Upon the requests of many people who listened to my presentations, I have built the online course Oracle SQL for Data Science. This course requires that students already have basic knowledge about SQL. The following is the description of the course.

Most of the core business data in an enterprise are stored in relational databases that support Structure Query Language (SQL). When data scientists perform common tasks such as data cleanse, validation, manipulation and feature variable calculation using SQL within database environment, they can achieve important advantages such as more compact code, easy deployment and security in comparison to moving the data outside of the database to a separate analytics environment. The instructor Dr. Jiang Zhou is one of the pioneers in developing SQL based in-database analytics solutions that are used by banks and insurance companies. In addition to his technical skills, he is, as one of his clients put it, "a great trainer, and a good presenter of theoretical data mining concepts so that they can be understood by most".
In this course, students will learn practical Oracle SQL skills to solve problems such as:

  • Data Validation
  • Data Summary
  • Detect and Remove Duplicates
  • Binning Variable Based on Equal Frequency
  • Build Good Variables for Predictive Models or Business Rules, e.g., RFM Analysis, Time Elapse Since Last Purchase, Number of Transactions in Last 3 Days, Moving Average Purchase Amount in Last 7 Days
  • Random Sampling
  • Gain Chart
  • Using View to Organize Process Flows
  • Histogram
There are totally about 4 hours and 30 minutes video presentations. SQL scripts that create data sets and perform the Data Science tasks are provided. Slides are included so that students can easily find the topics that they interested in.