Saturday, November 09, 2013

Analyze Text Files in Real Time Using SQL Without Loading Them into Database

Problems

A log file that records the number of impressions and number of clicks for advertisements are constantly updated (growing). The file looks like the following:

ADS_ID|DEVICE_OS|NUM_IMPRESSION|NUM_CONVERSION
32| Android| 1| 0
32| Android| 1| 0
32| Android| 1| 0
32| Android| 2| 0
32| Android| 1| 0
32| Android| 2| 0
32| Android| 1| 0
32| Android| 3| 0
32| Android| 2| 0

We want to generate real time reports about the summarized performance of advertisements, such as the click through rate by ads_id, click through rate by operation system and ads_id, etc. The reports should reflect the real time changes in the log file.

Solutions

One of the best solutions is to use an Oracle external table and view. There are two tasks to be performed.

Task 1. Define an external table on the log file. An external table is just a pointer to the location of file and definition of its format. The file itself is not loaded into the database as a permanent table. Once the external table is defined, we can query it using SQL just like regular table. The data is read by Oracle on the fly. Thus any changes in the file will be reflected on the query result.

Task 2. Define views to summarize the external table and produce reports. Since views just store the process logic and only produce the output when we query it, the content of views always reflects the latest information in the external table which in turn captures the changes in the log file.

Task 1.Define an external table.

Step 1. If not yet, we need to assign "create any directory" privilege to current user.

Log in as the system user and run the following command under sqlplus.

SQL> grant create any directory to current_user;

Step 2. We create directory. It is assumed that the Oracle database server has the access to the directory where the log file is located.

Log in as the current user and create an directory.

SQL> create directory dir_files as '/home/log/data';

Step 3. We define the external table that points to the log file.

create table tbl_data1_ext
(
ads_id number,
operation_sys varchar2(32),
num_impression number,
num_click number
)
organization external
( type oracle_loader
default directory dir_files
access parameters
( records delimited by newline
skip 1
fields terminated by '|'
missing field values are null
)
location('ads_log.txt')
);
Once it is done, we can verify if the file is define correctly.
SQL> select count(1) from TBL_DATA1_EXT;

COUNT(1)
165201

We use Linux command wc to count the number of lines in the file. The text file has one more line which is the header. The header was skipped when we define the external table.

$ wc -l ads_log.txt
165202 ads_log.txt

Task 2. Create views. Once we have the external table, we can create views to summarized it.

SQL> create view v_ctr_for_ads as select ads_id, sum(NUM_IMPRESSION) num_impression, sum(NUM_CLICK) NUM_CLICK, sum(NUM_CLICK)/sum(NUM_IMPRESSION) clr from TBL_DATA1_EXT group by ads_id;
View created.

SQL> create view v_ctr_for_ads_os as select ads_id, OPERATION_SYS, sum(NUM_IMPRESSION) num_impression, sum(NUM_CLICK) NUM_CLICK, sum(NUM_CLICK)/su m(NUM_IMPRESSION) clr from TBL_DATA1_EXT group by OPERATION_SYS, ads_id;
View created.

The user can look at views the get the summary information about the log file in real time.
SQL> select * from v_ctr_for_ads where num_impression>100 order by clr desc;

ADS_ID NUM_IMPRESSION NUM_CLICK CLR
32 4116 5 .001215

SQL> select * from v_ctr_for_ads_os where num_impression>100 order by clr desc;

ADS_ID OPERATION_SYS NUM_IMPRESSION NUM_CLICK CLR
32 Android 3360 5 .001488
32   756 0 0

Conclusions

With Oracle external tables that are just pointers to the location of text files and definition of file formats, we can perform SQL queries against text files without loading them into the database as physical database tables. It is a great solution when we want to repeatedly analyze text files that are constantly changing. Combining external tables and views, we can get query results that reflect the most recent content of text files. This solution is also very "clean" since there is no permanent database tables created.

2 comments:

Justinnen said...

Hi,

I got what I was looking for. As I'm new to this I couldn't understand completely. Please help me to understand why did you create a new directory.
( create directory dir_files as '/home/log/data';)

Unknown said...

Hi,

In my case I have to load a text file into database. But, situation is I have to create destination table and control file for sql loader dynamically since the column names are not same all the time. So, through the normal process I cant create the destination table in advance and also cant create control file mentioning the column names. Can you please advice on this?