Deep Data Mining Blog: December 2018

Thursday, December 20, 2018

Could Not Connect to Amazon RDS Oracle Database From Car Dealer WiFi

I am trying to connect to my Oracle database on Amazon RDS at a car dealer while my car is in service. My laptop is connecting to the public WiFi. When I try to connect to the Oracle server, I got "Error Message = IO Error: The Network Adapter could not establish the connection".

I realized the issue is caused by the new ip address not included Amazon Security Group inbound rules. I find my ip address. Then I log onto Amazon AWS console and find the security group associated with the DB instance. After I add a inbound rule 64.188.5.xxx/32, I am able to connect to the DB immediately.

Wednesday, December 19, 2018

Statistically Manufactured Personal Data

To avoid the trouble of dealing with personal data when we test our analytics processes, I have created mock personal data that closely reflect American population from statistical point of view. The largest data set has 1 million records with variables including first name, last name, sex, date of birth, social security number, address, phone number and email. The values of these variables are produced to be as realistic as possible to real American population. They represents about 0.33% of population in the United States.

These observations about the data 1 million mock personal data records are very close to the real statistics of the population in USA.

1.The top 4 states that have the most people are: California(138223 persons, %13.82), Texas(99217 persons, %9.92), Florida(69640 persons, %6.96) and New York(49979 persons, %5). These are close to the real distribution of the population in USA.
2. The female are 51% and the male are 49%.
3. Top 3 last names are Smith(10800 persons, %1.08), Williams(8000 persons, %.8) and Jones(6900 persons, %.69).
4. Top 3 female first names are Ava(4707 persons, %.93), Olivia(4508 persons, %.89) and Isabella(4311 persons, %.85) and top 3 male first names are Noah(5075 persons, %1.03), Elijah(4736 persons, %.96) and Liam(4434 persons, %.9).
5. The following table shows distributions of persons by age for both sexes. Women live longer than men.

                        Female           Male
Age Group        #        %       #  % 
   .Under 5 years 34603 6.81% 35656 7.25%
   .5 to 9 years 34707 6.83% 34010 6.92%
   .10 to 14 years 30192 5.94% 33013 6.72%
   .15 to 19 years 34361 6.76% 32689 6.65%
   .20 to 24 years 32512 6.39% 36647 7.45%
   .25 to 29 years 35626 7.01% 37278 7.58%
   .30 to 34 years 34344 6.76% 31977 6.50%
   .35 to 39 years 33325 6.55% 31927 6.49%
   .40 to 44 years 33332 6.56% 34456 7.01%
   .45 to 49 years 35070 6.90% 35443 7.21%
   .50 to 54 years 37321 7.34% 34876 7.09%
   .55 to 59 years 31623 6.22% 31315 6.37%
   .60 to 64 years 28801 5.67% 24218 4.93%
   .65 to 69 years 20999 4.13% 19881 4.04%
   .70 to 74 years 16617 3.27% 14065 2.86%
   .75 to 79 years 13520 2.66% 10272 2.09%
   .80 to 84 years 10693 2.10% 7983 1.62%
   .85 years and over 10754 2.12% 5894 1.20%

You may download a small file with 100 records free here. Free Download. Files with 5k, 50K, 250K and 1 million records are available for purchase at https://www.datamanufacturer.com.

File Name	Description	Price	Buy
dm_mock_person_100.csv	100 mock personal data records. CSV format.	free	Free Download
dm_mock_person_5k.csv	5K mock personal data records. About 0.7M bytes. CSV format.	$2.95
dm_mock_person_50k.csv	50K mock personal data records. About 7M bytes. CSV format.	$7.95
dm_mock_person_250k.csv	250K mock personal data records. About 35M bytes. CSV format.	$9.95
dm_mock_person_1m.csv	1 million mock personal data records. About 140M bytes. CSV format.	$39.95

Tuesday, December 18, 2018

Generate Random String in Oracle

The following query generate random email address.

SQL> select dbms_random.string('l', 8)||'@'||dbms_random.string('l', 7)||'.com' 
        email from dual;

EMAIL
------------------------------------------------------------------------------------
irslsxrf@wltikyv.com

The first parameter 'l' means string will be created in lower cases.

Monday, December 17, 2018

Find the Most Frequent Values

To find the most frequent values, we can use STATS_MODE function. The following query shows areacode in state Missouri.

SQL> select areacode from T_PHONE_AREA where state='Missouri' order by 1;
  AREACODE
----------
       314
       314
       314
       314
       314
       314
       314
       314
       314
       314
       314
       314
       417
       417
       573
       573
       573
       636
       636
       636
       636
       636
       636
       660
       816
       816
       816
       816
       816
       816
       816
       816
       816
       816

34 rows selected.

In the following query, stats_mode(areacode) returns the areacode 314 that is the most frequent value.

SQL> select stats_mode(areacode) from T_PHONE_AREA where state='Missouri';

STATS_MODE(AREACODE)
--------------------
                 314

Remove the Last Word From a String Using Oracle SQL

I use the following query to remove the last word of a sentence.

with tbl as (select 'Boston city' as name  from dual)
select  name, substr(name, 1, instr(name,' ',-1)-1 ) simple_name  from tbl;

NAME        SIMPLE_NAME
----------- -----------
Boston city Boston

Find Out Table Columns That Are Indexed

I use the following query to find out columns that are indexed for a table.

select index_name, column_name from USER_IND_COLUMNS where table_name='MYTABLE'

Oracle Function Based Index is Handy

In table t_my_table, column name is in lower case. However, I want to join this table with another table where names are in upper cases. I create a function based index.

create index t_my_tablei on t_my_table(upper(name));

That way, I don't to create another column or table that contains upper(name) and create index on it. When I join the two tables based on upper(a.name) = b.name, function based index upper(a.name) is used and it is fast.

select a.*, b.* from t_my_table a, my_another_table b where upper(a.name) = b.name;

Sunday, December 16, 2018

Amazon RDS Oracle Instance Running Out of Disc Space

My Oracle database instance on Amazon RDS runs out of disc space. I add more of them by modifying the instance and add extra disc space. This is the link to instructions.

Saturday, December 15, 2018

Roulette Wheel Selection Using SQL

Roulette wheel selection is a very useful algorithm found in many applications such as Genetic Algorithm(GA). In GA solutions with higher fitness values are given larger probabilities of being selected to produce children, just like natural evolution. I implemented an Oracle SQL version of the Roulette wheel selection algorithm.
The first step is to calculate for each record the cumulative value for the variable that the selection will be based on, such as fitness function, probability or other. I used sum() over(order by) analytics function. Make sure the "order by" is using a unique key so that the cumulative value is also unique.

 create table tbl as select id, num, sum(num) over(order by id) as cum_count
  from t_mydata;

The following is the roulette wheel selection scripts.

create table t_rw(sel number);

declare
  mx number;
  rnd  number;
  x number;
begin
   select max(cum_count) into mx from tbl;
  for i in 1..10000 loop
     execute immediate 'select ora_hash(:1,:2) from dual '
          into rnd using i, mx;
     select min(cum_count) into x from tbl where cum_count >= rnd;
     insert into t_rw(sel) values(x);
     end loop;
end;

create view v_selected as select a.* from tbl a, t_rw b where a.cum_count=b.sel;

In the above scripts, ora_hash() generates a uniformly distributed random number between 0 and maximum cum_count. The selected cum_count is inserted into t_rw. The final result is the view v_selected which is based on the inner join of table tbl and t_rw.

Tuesday, December 04, 2018

Generate Serial Number for Existing Table

Table T_REVIEW has column id as the unique key and dt as the timestamp. I want to add a serial number to the table based on the order by dt and id. In the following scripts, I use window function row_number to generate the serial number and update the table.

alter table t_review add(seq number);

update t_review a set seq= (
with tbl as (select id, row_number() over(order by dt, id) rnk from t_review)
select rnk from tbl b 
where b.id=a.id
);

Incrementally Add New Records to Table

I have a table t_review that stores historical records including key sid and timestamp dt. Every day, more records come into table t_new. I use the following scripts to add those new records identified by sid in t_new to t_review.

begin
insert into t_review(sid, dt) select sid, sysdate from 
 (select sid from t_new minus select sid from t_review);
commit;
end;

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics