Wednesday, January 24, 2018

Milestone Project Accomplished- Using Artificial Intelligence to Calculate Characteristic Town Index

West Lake in Hangzhou, China

I published an article today on my Linkedin profile Using Artificial Intelligence to Calculate Characteristic Town Index In this article, I described a milestone project that we recently accomplished and how it made me instantly famous in the field.The follow is the content of the Linkedin article.

Characteristic town development is a hot topic in China. I am extremely excited to announce that using artificial intelligence we have successfully developed Characteristic Town Index for 3,700 towns in China. Based on the index, we are able to identify the most promising towns objectively and efficiently for development. This is the first time that artificial intelligence is used in characteristic town evaluation. It is a cross-team effort under my leadership as the CTO of Hangzhou Jingli Company.

I published an article, Use Big Data and Artificial Intelligence to Rationally Evaluate the Characteristic Town, on influential "Economic Information Daily" (in Chinese) on January 10, 2018 (Please use Google Translate if you are interested). It quickly becomes one of the most cited articles on the Internet in China in this field. On one website alone, the article has been read nearly 52 thousand times. Websites hosting the article consistently ranked number one by Google and Baidu on keywords "characteristic town assessment artificial intelligence" (in Chinese). After reading the article, a government official of a major city in China says, "Dr. Zhou's article should be read carefully and thoroughly. The points raised are thought-provoking". The article's main point is that data-driven artificial intelligence models have advantages over experience-based expert systems. The following are some highlights:
  • Multiple teams have spent months to collect 69 variables that are relevant for 3,700 towns. These variables cover climate, geography, economy, ecology, transportation, Internet infrastructure and so on.
  • Expert systems are subjective, rigid and static. It is hard to adjust expert systems based on the discrepancies between their outputs and the realities. The feedback mechanism is lacking.
  • Data, machine learning models and applications should form closed loops. The data are used to train models. Models' outputs provide decision support for applications. The data are updated and enhanced when models are applied in real world, e.g., finding better target variables. These closed loops allow models to be improved iteratively.

This article has made me instantly famous in this field in China. However, the success doesn't come instantly at all. The core ideas are from my award-winning academic paper published two decades ago "Using genetic learning neural networks for spatial decision making in GIS". I feel fulfilled that I finally implement the ideas for practical applications in the field of GIS.

Tuesday, October 10, 2017

Leading Data Providers

I am compiling a list of leading data providers. This list is continuously expanding. If your favorite data providers are not included here and you feel they should be included in the list, please contact me at

Thomson Reuters

Thomson Reuters provides financial data, financial data,including company, market, valuation and price. In addition, Open Calais from Thomson Reuters is a tool that extracts structured data from text.

Three Credit Bureaus

Three credit bureaus, Experian, Equifax and TransUnion offer consumer credit reports.


Acxiom's Personicx products divide consumers into clusters or segments according to different criteria to help companies can better market their products to targeted audience. Personnicx products include Lifestage, Digital, Financial and Hispanic.

U.S. Commodity Futures Trading Commission

Commitments of Traders reports are released weekly. They show the future positions held by large and small traders in various markets, including financial, currencies and commodities. These reports are considered useful for traders in predicting market movement directions as it is commonly believed that smaller traders are always wrong.

United States Department of Transportation

Aviation Data & Statistics

National Oceanic and Atmospheric Administration

Climate Data Online: Dataset Discovery

United States Department of Labor- Bureau of Labor Statistics

Wage Statistics by Occupation by State. For example, the average wage for Computer and Information Research Scientists (code15-1111) in Massachusetts is $125,870.

United States Department of Agriculture-Nutrient Data Laboratory

USDA Nutrient Database

Monday, October 02, 2017

A Great Book on SQL for Data Miners/Scientists

I have found an excellent book on SQL tailored for Data Miners/Scientists, Data Analysis Using SQL and Excel by Gordon S. Linoff. Unlike other technical SQL books, this one is highly practical with many examples that can be applied directly to business problems. I highly recommend it.

Thursday, September 07, 2017

About My Oracle SQL for Data Science Course

On January 31, 2017, I was invited by Prof. Wei Ding at Department of Computer Science, University of Massachusetts Boston, and gave 3 talks about my data science projects across different industries. These talks are extremely well received. The following is what Prof. Ding says about my talks.

"It was a fortune to have Jay come to our computer science department to share his experience in solving business problems with predictive analytics on February 28, 2017. What Jay had presented in his 3 talks, each lasting for 1 hour in different topics of data mining, was totally impressive and beyond our wildest expectation. Having built competition-winning predictive models for some of the biggest companies and produced hundreds of millions of dollars’ savings, Jay shared the secret of his success with students and faculty without reservation. His strong presentations were such an inspiration for our computer science students and faculty and his methodology was innovative and powerful , even for very seasoned data scientists among the audience. Jay, thank you so much for your hard work preparing and delivering these presentations!" -Prof. Ding Wei, Department of Computer Science, University of Massachusetts Boston

The audience are particularly amazed by how I come up with solutions using Oracle SQL environment. To share my expertise, I create the online course Oracle SQL for Data Science to show how to perform common data science tasks using Oracle SQL and the benefits for doing that.

I let Charlie Berger,Senior Director of Product Management, Machine Learning, AI and Cognitive Analytics at Oracle know about my course and he told me "Your course is amazing."

Tuesday, August 29, 2017

Stop Amazon RDS Oracle Instance

I have gone through many steps to build predictive models on the Oracle instance on Amazon AWS and presented the results to the management. We decide to stop the Oracle instance so that we will not pay instance hours (but we are still charged for provisioned storage). The stopped instance can be restarted.

The instance is being stopped.

Monday, August 14, 2017

Create Predicitve Models in Oracle by Bootstrap Aggregating (Bagging)

I have a training data set of 136 records, 25 of them are positive examples and the remaining negative. The goal of the project is to build a predictive model that gives the probabilities of data points being positive. A logistic regression model is selected for its structural and implementation simplicity. To make the model more robust and able to perform reasonably well on new data set, I decide to build 20 logistic models, each based on a randomly sampled set of the original 136 records with replacement. The prediction probabilities produced by these 20 models are averaged to arrive at the final score. My first step is to generate 20 random sampling sets of 136 records from the original training set. The sampled set will have the same size but some records will be picked more zero, one or more than one times. I write the following PL/SQL to do the 20 rounds of random sampling with replacement.

create table t_bagging (iter number, id number);

i number;
 for i in 1..20
 insert into t_bagging select i, 1+mod(abs(dbms_random.random),136) from t_train;
 end loop;

In the above script, t_trian is the original training set having 136 records with unique identifier starting from 1 to 136. The function dbms_random.random generates a uniformly distributed random integer from from -2^^31 to 2^^31. I make the random number positive by taking the absolute value using abs() function. Mod() function forces the random number to be within the range of 0 and 135. I also add 1 after applying mod function so that its range becomes from 1 to 136. Next, I write the following script to create 20 views which will be used as the new training sets for building 20 models.

sqlstr varchar2(512);
 for i in (select distinct iter from T_BAGGING order by iter)
 sqlstr:='create or replace view v_tr_bag'||i.iter||' as select '||
         '* from t_training a, T_BAGGING c'||
         ' where and c.iter='||i.iter;
execute immediate sqlstr;
 end loop;

Wednesday, August 09, 2017

Load a Text File to Oracle Database on Amazon

I have a text file on my laptop that I want to load into an Oracle database on Amazon. The only tool available is SQLPLUS client on my laptop from which I can connect to the database to run query. The text file has 3,735 records and 17 columns as shown below.

I decide to create a SQL script file that contains 3,735 "insert" SQL statements, connect from SQLPLUS client to the database and run the SQL script file. As shown in the following scripts, I use Linux awk command to make insert statement for each records.
$ cat x.txt | awk '{print "insert into t_data values(",$0,");"}' > insert_data2.sql
$ cat insert_data2.sql
insert into t_data values( 1.0,65.79,329.0,...,4.0 );
insert into t_data values( 2.0,224.9,146.0,...,10.0 );
insert into t_data values( 3.0,113.0,113.0,...,9.0 );
insert into t_data values( 4.0,175.4,28.0,...,7.0 );
I open notepad and add "begin " at the beginning of the insert_data2.sql and "end; /" at the end of insert_data2.sql to make these insert statements within a PL/SQL block. By doing so, all these insert statments will be treated as one transaction, i.e., all 3,735 insert statements have to be executed fully or not at all. I don't want data partially inserted.
insert into t_data values( 1.0,65.79,329.0,...,4.0 );
insert into t_data values( 2.0,224.9,146.0,...,10.0 );
insert into t_data values( 3.0,113.0,113.0,...,9.0 );
insert into t_data values( 4.0,175.4,28.0,...,7.0 );
end; /
I connect to the Oracle database on Amazon and run the insert_data2.sql. It takes less than 16 seconds to insert 3725 records. Not bad.
SQL> @insert_data2

PL/SQL procedure successfully completed.

Elapsed: 00:00:15.93
SQL> select count(1) from t_data;


Elapsed: 00:00:00.39