Tuesday, January 15, 2019

About Dr. Zhou's Oracle SQL for Data Science Course


On January 31, 2017, I was invited by Prof. Wei Ding at the Department of Computer Science, University of Massachusetts Boston, and gave 3 talks about my data science projects across different industries. These talks are extremely well received. The following is what Prof. Ding says about my talks.

"It was a fortune to have Jay come to our computer science department to share his experience in solving business problems with predictive analytics on February 28, 2017. What Jay had presented in his 3 talks, each lasting for 1 hour in different topics of data mining, was totally impressive and beyond our wildest expectation. Having built competition-winning predictive models for some of the biggest companies and produced hundreds of millions of dollars’ savings, Jay shared the secret of his success with students and faculty without reservation. His strong presentations were such an inspiration for our computer science students and faculty and his methodology was innovative and powerful, even for very seasoned data scientists among the audience. Jay, thank you so much for your hard work preparing and delivering these presentations!" -Prof. Ding Wei, Department of Computer Science, University of Massachusetts Boston

The audience are particularly amazed by how I come up with solutions using Oracle SQL environment. To share my expertise, I create the online course Oracle SQL for Data Science to show how to perform common data science tasks using Oracle SQL and the benefits for doing that.

I let Charlie Berger,Senior Director of Product Management, Machine Learning, AI and Cognitive Analytics at Oracle know about my course and he told me "Your course is amazing."

Deep Learning World


The premier conference covering the
commercial deployment of deep learning




DeepLearning World is the premier conference covering the commercial deployment of deep learning. The event’s mission is to foster breakthroughs in the value-driven operationalization of established deep learning methods. DLW runs parallel to the established PredictiveAnalytics World for Industry 4.0 at the same venue. Combo passes are available.



How to turn Deep Tech into Broad Application

The hype is over: deep learning enters the “trough of disillusionment”. Companies are realizing that not every business problem requires the deep learning hammer. Of course, there are use cases that are best solved with artificial neural networks: image, speech and text recognition; anomaly detection and predictive maintenance on sensor data; complex data synthesis and sampling; reinforcement and sparse learning; and many more applications show the potential of artificial intelligence for real-world business scenarios. At the Deep Learning World conference data science experts present projects that went beyond experimentation and prototyping and showcase solutions that created economic value for the company. The case study sessions will focus on how it worked and what didn’t work while the deep dive sessions will explain topics such as RNN, CNN, LSTM, transfer learning and further in analytical and technical detail. Meet the European deep learning community in May in Munich and learn from well-known industry leaders!


Deep-data-mining.com blog readers receive 15% discount with code: DDMPAWDLW

Predictive Analytics World for Industry 4.0


6-7 May, 2019 – Munich
Predictive Analytics World is the leading vendor independent conference for applied machine learning for industry 4.0.
Business users, decision makers and experts in predictive analytics will meet on 6-7 May 2019 in Munich to discover and discuss the latest trends and technologies in machine & deep learning for the era of Internet of Things and artificial intelligence.

Putting Machine Intelligence into Production

Smart Factory, Smart Supply Chain, Smart Grid, Smart Transport: artificial intelligence promises an intelligent and fully automated future but reality is: most machines, most vehicles and most grids lack sensors and even where sensors do exist they might not be connected to the Internet of Things. Many companies invested in their infrastructure and are experimenting with prototypes e.g. for predictive maintenance, dynamic replenishment, route optimization and more, but even if they succeeded in delivering a proof of concept they face the challenge to deploy their predictive model into production and scale their analytics solution to company wide adoption. The issues are not merely analytical but a combination of technical, organisational, judicial and economic details. At the Predictive Analytics World for Industry 4.0 experienced data scientists and business decision makers from a wide variety of industries will meet for two days to demonstrate and to discuss dozens of real-world case studies from well-known industry leaders. In addition, predictive analytics experts will explore new methods and tools in special deep dive sessions in detail. Finally, the Predictive Analytics World is accompanied by the Deep Learning World conference, which focuses on the industry and business application of neural networks. Take the chance, learn from the experts and meet your industry peers in Munich in May!
 Hot topics on the 2019 Agenda:
  • Predictive Maintenance & Logistics
  • Anomaly Detection & Root Cause Analysis
  • Fault Prediction & Failure Detection
  • Risk Management & Prevention
  • Route & Stock Optimization
  • Industry & Supply Chain Analytics
  • Image & Video Recognition
  • Internet of Things & Smart Devices
  • Stream Mining & Edge Analytics
  • Machine ~, Ensemble ~ & Deep Learning
  • Process Mining & Network Analyses
  • Mining Open & Earth Observation Data
  • Edge Analytics & Federated Learning
… and many more related topics

PredictiveAnalytics World 4.0 will be co-located with Deep Learning World, the premier conference covering the commercial deployment of deep learning in 2019. Deep-data-mining.com blog readers receive 15% discount with code: DDMPAWDLW


Tuesday, January 08, 2019

Analytics & AI in Travel North America


Analytics & AI in Travel North America launched by EyeForTravel will take place on March 14-15 at the Hilton Parc 55 Hotel, San Francisco, USA. There will be over 350 senior data, analytics, pricing, product development and digital marketing experts from the world’s leading travel companies, the event will explore the strategies for brands to address the biggest opportunity right now – how to conquer hyper-personalization.
Confirmed speakers include Hilton’s SVP of Analytics, Google’s head of AI global product partnerships, Expedia’s Head of Platform – Loyalty, Wyndham Hotel Group’s Vice President of Global Revenue Management Operations and Sales, Carlson Wagonlit’s Principal Data Scientist, and many more.

Attendees can expect to explore insights into the following:

• Harnessing AI and Data to Transform your Loyalty Strategy: Discover how weaving AI into your business, capturing preference data and delivering a truly personalized service will give you the edge in winning loyal customers from your competition

• Overcoming Pricing Peril with Personalized & Real-Time Revenue Generation Tactics: Make the shift to real-time pricing on an individual level, Nail down the use-cases of how to overcome this, forecast like a pro and optimize direct revenue

• Getting Up Close and Personal with the Customer and Capitalize on Every Channel: Use AI to fuel CRM and CS to bring customer data to life at every touchpoint, use the rich and famous on social to avoid brand erosion and secure market share.

• Immersing Yourself in an AI-driven Predictive Future to Seize New Profits: It all comes down to being predictive if you want to turn new profits. Deliver AI-led futures in your company for more efficient internal mechanics and travel customer-centricity

• Driving Real-Time, Hyper-Personalization to Move Your Profit Needle: Delve into new levels of granularity, become the Amazon of travel and deliver the perfect travel itinerary every time for unstoppable loyalty

• Seizing Voice, AR and VR Makes You Grab that Conversion: Be part of the lucky few that benefits from voice enabled search, drive direct bookings and use AR and VR to give your customer the confidence to convert

• Dominating Direct Bookings Through A Mastery of Mobile: Create an AI-enabled mobile product that drives direct bookings, focus on UI and UX that screams out loyalty and bolster your bottom line

• Outclassing your Competition with Total RM and Surge Ancillary Sales: Build state-of-the-art infrastructure that supports ancillary revenue and squeeze every ounce of profit from all revenue streams

Please check out the following icon for more information.

Thursday, December 20, 2018

Could Not Connect to Amazon RDS Oracle Database From Car Dealer WiFi

I am trying to connect to my Oracle database on Amazon RDS at a car dealer while my car is in service. My laptop is connecting to the public WiFi. When I try to connect to the Oracle server, I got "Error Message = IO Error: The Network Adapter could not establish the connection".

I realized the issue is caused by the new ip address not included Amazon Security Group inbound rules. I find my ip address. Then I log onto Amazon AWS console and find the security group associated with the DB instance. After I add a inbound rule 64.188.5.xxx/32, I am able to connect to the DB immediately.

Wednesday, December 19, 2018

Statistically Manufactured Personal Data

To avoid the trouble of dealing with personal data when we test our analytics processes, I have created mock personal data that closely reflect American population from statistical point of view. The largest data set has 1 million records with variables including first name, last name, sex, date of birth, social security number, address, phone number and email. The values of these variables are produced to be as realistic as possible to real American population. They represents about 0.33% of population in the United States.

These observations about the data 1 million mock personal data records are very close to the real statistics of the population in USA.

1.The top 4 states that have the most people are: California(138223 persons, %13.82), Texas(99217 persons, %9.92), Florida(69640 persons, %6.96) and New York(49979 persons, %5). These are close to the real distribution of the population in USA.
2. The female are 51% and the male are 49%.
3. Top 3 last names are Smith(10800 persons, %1.08), Williams(8000 persons, %.8) and Jones(6900 persons, %.69).
4. Top 3 female first names are Ava(4707 persons, %.93), Olivia(4508 persons, %.89) and Isabella(4311 persons, %.85) and top 3 male first names are Noah(5075 persons, %1.03), Elijah(4736 persons, %.96) and Liam(4434 persons, %.9).
5. The following table shows distributions of persons by age for both sexes. Women live longer than men.
                        Female           Male
Age Group        #        %       #  % 
   .Under 5 years 34603 6.81% 35656 7.25%
   .5 to 9 years 34707 6.83% 34010 6.92%
   .10 to 14 years 30192 5.94% 33013 6.72%
   .15 to 19 years 34361 6.76% 32689 6.65%
   .20 to 24 years 32512 6.39% 36647 7.45%
   .25 to 29 years 35626 7.01% 37278 7.58%
   .30 to 34 years 34344 6.76% 31977 6.50%
   .35 to 39 years 33325 6.55% 31927 6.49%
   .40 to 44 years 33332 6.56% 34456 7.01%
   .45 to 49 years 35070 6.90% 35443 7.21%
   .50 to 54 years 37321 7.34% 34876 7.09%
   .55 to 59 years 31623 6.22% 31315 6.37%
   .60 to 64 years 28801 5.67% 24218 4.93%
   .65 to 69 years 20999 4.13% 19881 4.04%
   .70 to 74 years 16617 3.27% 14065 2.86%
   .75 to 79 years 13520 2.66% 10272 2.09%
   .80 to 84 years 10693 2.10% 7983 1.62%
   .85 years and over 10754 2.12% 5894 1.20%
You may download a small file with 100 records free here. Free Download. Files with 5k, 50K, 250K and 1 million records are available for purchase at https://www.datamanufacturer.com.
File Name Description Price Buy
dm_mock_person_100.csv 100 mock personal data records. CSV format. free Free Download
dm_mock_person_5k.csv 5K mock personal data records. About 0.7M bytes. CSV format. $2.95  
dm_mock_person_50k.csv 50K mock personal data records. About 7M bytes. CSV format. $7.95  
dm_mock_person_250k.csv 250K mock personal data records. About 35M bytes. CSV format. $9.95  
dm_mock_person_1m.csv 1 million mock personal data records. About 140M bytes. CSV format. $39.95  

Tuesday, December 18, 2018

Generate Random String in Oracle

The following query generate random email address.
SQL> select dbms_random.string('l', 8)||'@'||dbms_random.string('l', 7)||'.com' 
        email from dual;

EMAIL
------------------------------------------------------------------------------------
irslsxrf@wltikyv.com
The first parameter 'l' means string will be created in lower cases.

Monday, December 17, 2018

Find the Most Frequent Values

To find the most frequent values, we can use STATS_MODE function. The following query shows areacode in state Missouri.
SQL> select areacode from T_PHONE_AREA where state='Missouri' order by 1;
  AREACODE
----------
       314
       314
       314
       314
       314
       314
       314
       314
       314
       314
       314
       314
       417
       417
       573
       573
       573
       636
       636
       636
       636
       636
       636
       660
       816
       816
       816
       816
       816
       816
       816
       816
       816
       816

34 rows selected.
In the following query, stats_mode(areacode) returns the areacode 314 that is the most frequent value.
SQL> select stats_mode(areacode) from T_PHONE_AREA where state='Missouri';

STATS_MODE(AREACODE)
--------------------
                 314

Remove the Last Word From a String Using Oracle SQL

I use the following query to remove the last word of a sentence.
with tbl as (select 'Boston city' as name  from dual)
select  name, substr(name, 1, instr(name,' ',-1)-1 ) simple_name  from tbl;

NAME        SIMPLE_NAME
----------- -----------
Boston city Boston     

Find Out Table Columns That Are Indexed

I use the following query to find out columns that are indexed for a table.
select index_name, column_name from USER_IND_COLUMNS where table_name='MYTABLE'

Oracle Function Based Index is Handy

In table t_my_table, column name is in lower case. However, I want to join this table with another table where names are in upper cases. I create a function based index.
create index t_my_tablei on t_my_table(upper(name));
That way, I don't to create another column or table that contains upper(name) and create index on it. When I join the two tables based on upper(a.name) = b.name, function based index upper(a.name) is used and it is fast.
select a.*, b.* from t_my_table a, my_another_table b where upper(a.name) = b.name;

Sunday, December 16, 2018

Amazon RDS Oracle Instance Running Out of Disc Space

My Oracle database instance on Amazon RDS runs out of disc space. I add more of them by modifying the instance and add extra disc space. This is the link to instructions.

Saturday, December 15, 2018

Roulette Wheel Selection Using SQL

Roulette wheel selection is a very useful algorithm found in many applications such as Genetic Algorithm(GA). In GA solutions with higher fitness values are given larger probabilities of being selected to produce children, just like natural evolution. I implemented an Oracle SQL version of the Roulette wheel selection algorithm.
The first step is to calculate for each record the cumulative value for the variable that the selection will be based on, such as fitness function, probability or other. I used sum() over(order by) analytics function. Make sure the "order by" is using a unique key so that the cumulative value is also unique.
 create table tbl as select id, num, sum(num) over(order by id) as cum_count
  from t_mydata;

The following is the roulette wheel selection scripts.
create table t_rw(sel number);

declare
  mx number;
  rnd  number;
  x number;
begin
   select max(cum_count) into mx from tbl;
  for i in 1..10000 loop
     execute immediate 'select ora_hash(:1,:2) from dual '
          into rnd using i, mx;
     select min(cum_count) into x from tbl where cum_count >= rnd;
     insert into t_rw(sel) values(x);
     end loop;
end;

create view v_selected as select a.* from tbl a, t_rw b where a.cum_count=b.sel;

In the above scripts, ora_hash() generates a uniformly distributed random number between 0 and maximum cum_count. The selected cum_count is inserted into t_rw. The final result is the view v_selected which is based on the inner join of table tbl and t_rw.

Tuesday, December 04, 2018

Generate Serial Number for Existing Table

Table T_REVIEW has column id as the unique key and dt as the timestamp. I want to add a serial number to the table based on the order by dt and id. In the following scripts, I use window function row_number to generate the serial number and update the table.
alter table t_review add(seq number);

update t_review a set seq= (
with tbl as (select id, row_number() over(order by dt, id) rnk from t_review)
select rnk from tbl b 
where b.id=a.id
);

Incrementally Add New Records to Table

I have a table t_review that stores historical records including key sid and timestamp dt. Every day, more records come into table t_new. I use the following scripts to add those new records identified by sid in t_new to t_review.
begin
insert into t_review(sid, dt) select sid, sysdate from 
 (select sid from t_new minus select sid from t_review);
commit;
end;

Friday, November 30, 2018

Load Text Into Oracle Table CLOB Column Using Python Part 2

The texts in some records are too long, e.g., more thatn 22k characters. This is inconvenient when we try to edit it online. I use chunkstring(line2,900) to divide the text into multiple strings of fixed length of 900 characters and stored them in multiple records. uuid.uuid1().hex is used to generate an universally unique id.
import cx_Oracle
import uuid
def chunkstring(string, length):
    return (string[0+i:length+i] for i in range(0, len(string), length))
connection = cx_Oracle.connect("xxxxx","xxxx",\
                               "golden.maxapex.net:1521/xxxx",\
                               encoding = "UTF-8", nencoding = "UTF-8")
cur = connection.cursor()
cur.setinputsizes(T1 = cx_Oracle.CLOB)
cur.setinputsizes(T2 = cx_Oracle.CLOB)
f = open('id_baike2.txt',  \
        encoding="utf8")
line1 = f.readline()
line2 = f.readline()
while line2:
    vuid = uuid.uuid1().hex
    data_list = list(chunkstring(line2,900))
    for idx, v in enumerate(data_list):
        cur.execute("INSERT INTO t_text_clob_m (ID, SID, SEQ, T1)  \
               VALUES (:ID, :SID, :SEQ, :T1)",\
                    ID = int(line1), SID = vuid, SEQ = idx+1, T1 = v)
    # create table t_text_clob_m (id number, sid raw(32), seq number, t1 clob);
    line1 = f.readline()
    line2 = f.readline()
f.close()
connection.commit()
cur.close()
connection.close()

Thursday, November 29, 2018

Load Text Into Oracle Table CLOB Column Using Python

A text file contains Chinese texts that describe close to 4 thousand small towns. The ID of a town is always the first line followed by the text. The text could be as long as 22k characters. Thus, we have to use CLOB to store these texts.
1
宁明县爱店镇成立于1992年3月,位于中越陆地边境公母山下,与越南谅山省禄平县接壤,
边境线长达25.5公里,...
2
林逢镇位居田东县城东部,地处右江河谷中心地带、大西南出海通道咽喉要地,东经南宁可达北海、
防城及广东沿海各地,...
From SQL Developer client, I connect to the Oracle database and create a table with a CLOB column to store the text.
#create a table to hold texts
create table t_text (id number, val clob);
I load the file into the table on Amazone RDS Oracle database using the following scripts.
import cx_Oracle
connection = cx_Oracle.connect("xxxxx","xxx",\
                               "orcl.xxxxxx.us-east-1.rds.amazonaws.com:1521/orcl",\
                               encoding = "UTF-8", nencoding = "UTF-8")
cur = connection.cursor()
cur.setinputsizes(VAL = cx_Oracle.CLOB)
f = open('id_baike2.txt',  encoding="utf8")
line1 = f.readline()
line2 = f.readline()
while line2:
    cur.execute("INSERT INTO t_text (ID, VAL) VALUES (:ID, :VAL)", \
                ID=int(line1),VAL=line2)
    line1 = f.readline()
    line2 = f.readline()
f.close()
connection.commit()
cur.close()
connection.close()

Connect to Amazon RDS Oracle Database Using Python

The following scripts connect to an Oracle database and execute queries.
import cx_Oracle
connection = cx_Oracle.connect("aisxxx","xxxxxxx",\
              "orcl.ctbxxxxxxx.us-east-1.rds.amazonaws.com:1521/orcl")
# username, password, host:port number/service name
cur = connection.cursor()
# query table
cur.execute("select table_name, tablespace_name from user_tables order by 1")
for each in cur.description:
  print(each[0:2])
for col_1, col_2 in cur.fetchall():
 print(col_1+" "+col_2)
cur.close()
connection.close()
Outputs are the following.
('TABLE_NAME', )
('TABLESPACE_NAME', )
T_TEST USERS

Wednesday, November 28, 2018

Find All Unique Characters In a Text File

There is a file containing 7,288 lines of Chinese text (see below). I want to find our all characters that appear in the text.
$ head -2 id_baike2.txt
1
宁明县爱店镇成立于1992年3月,位于中越陆地边境公母山下,与越南谅山省禄平县接壤,
边境线长达25.5公里,东兴至那坡沿边公路贯穿其间。
爱店为国家一类口岸,是我国西南经济板块通往东南亚的陆路效能要道。
全镇下辖3个村委会19个自然屯,总面积65.79平方公里,耕地面积7225亩,
粮食播种面积2700亩,总人8055口人,其中流动人口2000余人。
爱店镇1995年被国家建设部定为全国500家小城镇建设试点之一,
2002年被定为全区小城镇建设重点镇。城镇各种基础设施日臻完善。
镇区主要街道全部实现硬化和街砖铺设。
2006年以来,爱店镇结合崇左市开展市容环境综合整治竞赛活动和城乡清洁工程,
与有关部门共同筹措资金共561.7万元投入小城镇建设,
不断扩大城镇规模和完善城镇功能,搞好环境卫生,提高口岸服务水平,
树立良好国门形象。著名的“金牛潭”风景区,
常年流水潺潺,怪石嶙峋,古木参天,景色迷人。
历代的文人名士曾在潭边题字留墨,常年吸引大批游客前来观光旅游。
镇内公母山海拔1358米,山奇水秀,景色宜人。主要有金牛潭生态旅游度假村、
公母山庄、爱店起义纪念碑、

I use the following Linux command to accomplish this.
cat  id_baike2.txt | awk -F "" '{for (i=1;i <= NF;i++) print $i;}' | sort | 
uniq -c > char_list.txt

The following shows the partial result. The number on the left is the frequency of the characters appearing in the file
      3  
  50033  
     94 !
   3773 "
     47 #
    124 $
  23974 %
     56 &
    422 '
   7333 (
   7245 )
    101 *
    672 +
   7558 ,
   9437 -
  64329 .
   1700 /
 267266 0
 239610 1
 201818 2
 108849 3
  91833 4
 111514 5
  86085 6
  69366 7
  83533 8
 100306 9
   1309 :
    758 ;
     25 <
I also use the following python scripts to do the same calculation.
import collections
f = open('id_baike2.txt',  encoding="utf8")
lines = f.readlines()
lines_joined =  ' '.join(lines)
frq = collections.Counter(lines_joined)
frq.most_common(100)
f.close()

Thursday, June 28, 2018

Introducing New Book: Principles of Database Management – The Practical Guide to Storing, Managing and Analyzing Big and Small Data


Deep-Data-Mining is pleased to introduce the new book, Principles of Database Management – The Practical Guide to Storing, Managing and Analyzing Big and Small Data, by Lemahieu W., vanden Broucke S.,  and Baesens B. (ISBN: 9781107186125). The following is the book interview.




Why did you decide to write your book Principles of Database Management?

This book is the result of having taught an undergraduate database management class and a postgraduate advanced database management class for more than ten years at KU Leuven (Belgium).  Throughout these years, we have found no good textbook which covers the material in a comprehensive way without getting flooded by theoretical detail and losing focus.  Hence, after having teamed up together, we decided to start writing a book ourselves.  This work aims to offer a complete and practical guide covering all the governing principles of database management, including:
  •          End-to-end coverage starting with legacy technologies to emerging trends such as Big Data, NoSQL databases, Analytics, data governance, etc.
  •        A unique perspective on how lessons learnt from past data management could be relevant in today’s technology setting (e.g., navigational access and its perils in Codasyl and XML/OO databases)
  •          A critical reflection and accompanying risk management considerations when implementing the technologies considered, based on our own experiences from participating in data and analytics related projects with industry partners in a variety of sectors, from banking to retail and from government to the cultural sector
  •          Offering a solid balance between theory and practice, including various exercises, industry examples and case studies originating from a diversified and complimentary business practice, scientific research and academic teaching experience


We hear a lot of companies complaining about bad data quality these days.  How can database management contribute to this?
First of all, data quality (DQ) is often defined as ‘fitness for use,’ which implies the relative nature of the concept.  Data of acceptable quality in one decision context may be perceived to be of poor quality in another decision context, even by the same business user.  For instance, the extent to which data is required to be complete for accounting tasks may not be required for analytical sales prediction tasks.  Database management can contribute in various ways to improving data quality. 
A good conceptual data model capturing the business requirements as accurately as possible is the start of everything.  As discussed in the book, both EER and UML can be used for this purpose.  It is important to also list the semantic shortcomings of the conceptual models developed such that they can be followed up during application development. 
A next important activity concerns metadata management.  Just as raw data, also metadata is data that needs to be properly modeled, stored and managed.  Hence, the concepts of data modeling should also be applied to metadata in a transparent way.  In a DBMS approach, metadata is stored in a catalog, sometimes also called data dictionary or data repository, which constitutes the heart of the database system. 
Finally, to manage and safeguard data quality, a data governance culture should be put in place assigning clear roles and responsibilities. The ultimate aim of data governance is to set up a company-wide controlled and supported approach towards data quality, accompanied by data quality management processes. The core idea is to manage data as an asset rather than a liability, and adopt a proactive attitude towards data quality problems. To succeed, it should be a key element of a company’s corporate governance and supported by senior management.

Should all companies start to invest in NoSQL databases?

The explosion of popularity of NoSQL databases should be put in perspective considering their limitations.  Most NoSQL implementations have yet to prove their true worth in the field (most are very young and in development).  Most implementations sacrifice ACID (atomicity, consistency, isolation and durability) concerns in favor of being eventually consistent, and the lack of relational support makes expressing some queries or aggregations particularly difficult, with map-reduce interfaces being offered as a possible, but harder to learn and use, alternative.  Combined with the fact that RDBMSs do provide strong support for transactionality, durability and manageability, quite a few early adopters of NoSQL were confronted with some sour lessons.  It would be an over-simplification to reduce the choice between RDBMSs and NoSQL databases to a choice between consistency and integrity on the one hand, and scalability and flexibility on the other.  The market of NoSQL systems is far too diverse for that.  Still, this tradeoff will often come into play when deciding on taking the NoSQL route.  We see many NoSQL vendors focusing again on robustness and durability.  We also observe traditional RDBMS vendors implementing features that let you build schema-free, scalable data stores inside a traditional RDBMS, capable to store nested, semi-structured documents, as this seems to remain the true selling point of most NoSQL databases, especially those in the document store category.  Expect the future trend to continue towards adoption of such “blended systems”, except for use cases that require specialized, niche database management systems.  In these settings, the NoSQL movement has rightly taught users that the one size fits all mentality of relational systems is no longer applicable and should be replaced by finding the right tool for the job.  For instance, graph databases arise as being “hyper-relational” databases, which makes relations first class citizens next to records themselves rather than doing away with them altogether. 

Many companies are investing in data lakes these days.  What is the difference with a data warehouse?

Much more recent than data warehouses, the data lake concept became known as part of the big data and analytics trend.  Although both data warehouses and data lakes are essentially data repositories, there are some clear differences as listed in the table below.


Data Warehouse
Data lake
Data
Structured
Often unstructured
Processing
Schema-on-write
Schema-on-read
Storage
Expensive
Low cost
Transformation
Before entering the DW
Before analysis
Agility
Low
High
Security
Mature
Maturing
Users
Decision makers
Data Scientists

A key distinguishing property of a data lake is that it stores raw data in its native format, which could be structured, unstructured or semi-structured.  This makes data lakes fit for more exotic and ‘bulk’ data types that we generally do not find in data warehouses, such as social media feeds, clickstreams, server logs, sensor data, etc.  A data lake collects data emanating from operational sources ‘as is’, often without knowing upfront which analyses will be performed on it, or even whether the data will ever be involved in analysis at all.  For this reason, either no or only very limited transformations (formatting, cleansing, …) are performed on the data before it enters the data lake.  Consequently, when the data is tapped from the data lake to be analyzed, quite a bit of processing will typically be required before it is fit for analysis.  The data schema definitions are only determined when the data is read (schema-on-read) instead of when the data is loaded (schema-on-write) as is the case for a data warehouse.  Storage costs for data lakes are also relatively low because most of the implementations are open-source solutions that can be easily installed on low-cost commodity hardware.  Since a data warehouse assumes a predefined structure, it is less agile compared to a data lake which has no structure.  Also, data warehouses have been around for quite some time already, which automatically implies that their security facilities are more mature.  Finally, in terms of users, a data warehouse is targeted towards decision makers at middle and top management level, whereas a data lake requires a data scientist, which is a more specialized profile in terms of data handling and analysis.

How do Object Oriented DBMSs (OODBMSs) compare against Relational Databases (RDBMSs)? Why haven’t they been that successful?

OODBMSs offer several advantages.  First, they store complex objects and relationships in a transparent way.  The identity based approach allows for improved performance when performing complex queries involving multiple interrelated objects, avoiding expensive joins.  By using the same data model as the programming language to develop database applications, the impedance mismatch problem is no longer an issue.  In addition, developers can be more productive, as they are confronted with only a single language and data model.
Still, the success of OODBMSs has been limited to niche applications characterized by complex, nested data structures where an identity-based, instead of a value-based, method of working pays off.  An example is the processing of scientific data sets by CERN in Switzerland, where data access follows predictable patterns.  The widespread use and performance of RDBMSs, however, proved hard to displace: the (ad-hoc) query formulation and optimization procedures of OODBMSs are often inferior to relational databases, which all adopt SQL as their primary database language combined with a powerful query optimizer.  When compared to RDBMSs, OODBMSs are not that developed in terms of robustness, security, scalability and fault-tolerance.  They also don’t provide a transparent implementation of the three-layer database architecture.  More specifically, most OODBMSs provide no support for defining external database models, such as views in the relational model. 

Why are so many companies still struggling with database legacy instead of investing in newer technologies?

Many firms still struggle with legacy databases due to historical implementations and limited IT budgets.  Hence, knowing the basic characteristics thereof is essential to the maintenance of the corresponding database applications and the potential migration to modern DBMSs.  Second, understanding the basics of these legacy models will contribute to better understanding the semantical richness of newer database technologies.  Finally, the concept of procedural DML and navigational access originally introduced by these legacy models has also been adopted by more recent databases such as OODBMSs. 

This book covers a few decades of evolutions in the database world.  Which developments or patterns struck you as the most remarkable in the long run ?
The perseverance of SQL is really striking.  It survived object-oriented databases and XML databases, and after a while, many of its constructs were even incorporated in these paradigms.  More recently, there is considerable irony (and beauty) in the fact that one of the main tendencies of NoSQL development efforts is to mimic the convenient data retrieval and query facilities of, well, SQL-based databases.  Solutions for ‘SQL on top of NoSQL’ are all over the place.  And let’s not forget: SQL still remains a beautiful language.

Is there any additional material provided (e.g., for instructors or students)?
Yes, the book comes with the following additional material:
  • A website with additional information: www.pdbmbook.com
  • Free YouTube lectures for each of the 20 chapters, see https://www.youtube.com/watch?v=o36Z_OqC2ac&list=PLdQddgMBv5zHcEN9RrhADq3CBColhY2hl
  • PowerPoint slides for each of the 20 chapters, see http://www.pdbmbook.com/lecturers
  • A solutions manual with the solutions to all multiple choice and open questions.
  • An online playground with diverse environments, including MySQL for querying; MongoDB; Neo4j Cypher; and a tree structure visualization environment
  • Full-color illustrations throughout the text
  • Extensive coverage of important trending topics, including data warehousing, business intelligence, data integration, data quality, data governance, Big Data, and analytics
  • Hundreds of examples to illustrate and clarify the concepts discussed that can be reproduced on the book’s companion online playground
  • Case studies, review questions, problems, and exercises in every chapter