Deep Data Mining Blog: May 2012

Thursday, May 31, 2012

Floating Point Representation: the Loss of Precision

On SAS PC, 11169568236203649 is actually represented as 11169568236203648 due to the limitation of 64-bit IEEE floating point representation.

In most of the applications, there is practically no difference between 11169568236203649 and 11169568236203648. Unfortunately, every digit matters in the case of function mod.

Wednesday, May 30, 2012

A Clarification on SAS mod function.

I am a long term SAS fan. Its statistical functions and powerful macro scripting capability are simply unprecedented. There is a little bit of misunderstanding concerning my post on MOD function. The following is a clarification from Rick Langston from SAS.

Rick's clarification follows.
--------------------------------------------------------
There is no such "bug" in the MOD function in SAS.
The problem is due to the limitation of IEEE floating point
representation with the number of digits used in the example.
The number 11169568236203649 cannot be precisely stored in a 64-bit
IEEE floating point value. It is indistinguishable from
11169568236203648, for example.
SAS uses floating point representation for all of its numeric values.
Note that if you apply the same constraints to databases, you
will see the same results as SAS. For example, on Teradata
if you submit this expression:

select cast(11169568236203649 as float) mod cast(30269 as float);
You get the response of
1.17310000000000E 004
Which is just what SAS provides.

And consider this C code:
main()
{
double x;
x = (long long)11169568236203649.0 % (long long)30269.0;
printf("x=%f.\n",x);
x = 11169568236203649LL % 30269LL;
printf("x=%f.\n",x);
}

The results are this:
x=11731.000000.
x=11732.000000.

It's the same issue. The double-precision value cannot
be represented sufficiently. If the value remains as a long
integer, the representation is maintained.

Note also that when you run the same MOD function example on
z/OS (i.e. IBM mainframe), you get 11732. This is because the
IBM floating point representation has 4 more bits of mantissa
(at the expense of 4 less bits of exponent) than does IEEE. So
more digits of precision will be noticed, as is the case with the
example.

It had been commented that 64-bit systems might make a difference,
such as SAS on 64-bit Windows as compared to 32-bit Windows.
That is not the case here. The IEEE floating point representation
used by SAS is 64-bit on all platforms. Only the z/OS implementation
of SAS uses the IBM representation.

-----------------------
End of Rick's clarification

It is not a bug in SAS mod function.

My original suspicion was an issue with SAS mod function itself. It is actually because the number 11,169,568,236,203,649 is larger than the maxim integer9,007,199,254,740,992 supported by SAS PC vesion. It is a not a bug since it is clearly stated in the document that there is a maximum integer without rounding supported by SAS.

I was building random generators on both SAS and Oracle. It is required that given the same initial number, the random generators on SAS and Oracle will produce the same series of pseudo random numbers. Since it is impossible to match the SAS and Oracle results when the initial number is longer than 16 digits (Oracle integer supports 38 digits of precision), in my code I will throw an exception, stop the process and print a message similar to ORA-01438: "value larger than specified precision allowed for this column".

Tuesday, May 29, 2012

A Bug in SAS MOD Function.

MOD( m, n) is one of the most commonly used functions. It returns the remainder of m divided by n and the result is deterministic. We would expect it to be calculated correctly by any data analysis software packages.

However, we have found an issue in SAS mod function.The calculation we tested is mod(11169568236203649, 30269). SAS returns different results than Oracle and R do. SAS returns 11731. Both Oracle and R return 11732.

SAS script (SAS Version 9.2) and the result.
data test; val=mod(11169568236203649, 30269) ; run;
11731
or
proc sql; select mod(11169568236203649, 30269) from dummy;
11731

Oracle SQL script and the result.
select mod(11169568236203649, 30269) from dual;
11732

R script and the result.
11169568236203649 %% 30269
11732

We have verified that 11732 is the correct answer using two independent approaches. You may try the above scripts on your own computer and let me know what you find out. ( I explain my approaches to verify that 11732 is the correct answer of mod(11169568236203649,30269) and why I bother to calculate mod(x, 30269) in other posts on this blog.)

Why Bother to Calculate Mod(x, 30269)?

I was trying to implement Wichmann-Hill random number generator in Oracle. The calculation of mod(x, 30269) is part of the algorithm. The algorithm is described in Algorithm AS 183: An Efficient and Portable Pseudo-Random Number Generator.

SAS Is Wrong! R and Oracle Are Right. MOD(11169568236203649, 30269)=11732

For function MOD(11169568236203649, 30269), SAS returns 11731 while both Oracle and R return 11732. I used the following two approaches to verify SAS Is Wrong. Oracle and R are right.
Approach 1. The following two Oracle Queries show that 11732 is the remainder
SQL> select ( 11169568236203649 -11732)/30269 vl from dual;
369010150193.000000
SQL> select ( 11169568236203649 -11731)/30269 vl from dual;
369010150193.000033

Approach 2. The following PL/SQL scripts.
At first, I wanted to run the following PL/SQL Script 1. It took too long and I had to kill the process. Instead, I ran Script 2 shown below and it returned 11732 at the end of the loop.
PL/SQL Script 1.
set serveroutput on;
declare
remainder number;
begin
remainder:=11169568236203649;
while (remainder >30269)
loop
remainder:=remainder-30269;
end loop;
dbms_output.put_line(remainder);
end;
/
PL/SQL Script 2.
set serveroutput on;
declare
remainder number;
begin
remainder:=11169568236203649;
while (remainder>302690000000)
loop
remainder:=remainder-302690000000; /* equivalent of 10,000,000 loops of -30269 */
end loop;
dbms_output.put_line(remainder);
while (remainder>30269)
loop
remainder:=remainder-30269;
end loop;
dbms_output.put_line(remainder);
end;
/
At the end of the loop, it returns 11732.

Monday, May 28, 2012

Randomly Sample a Precise Number of Records

With Oracle dbms_random.random and row_number function, we can easily randomly select a certain number of records from the data.

The following scripts select 50 records randomly from each group (strata). First, a number number is generate for each record. Then a rank is generated by row_number function based on the random number for each strata. Finally, only records ranked top 50 in each strata are selected. The rank of random number is also random.

create table tbl_id_w_rnd as select id, strata, dbms_random.random() rnd from tbl_data;

create materialized view mv_precise_50_samples as
With tbl as
(
Select a.*, row_number() over(partition by strata order by rnd) rnk from tbl_id_w_rnd a
)
Select * from tbl where rnk<=50;
--test
select strata, count(1) from mv_precise_50_samples group by strata order by strata;

Sunday, May 27, 2012

Calculate Gain Chart Using SQL in Oracle

The following script calculates the gain chart based on score and payment_ind (target variable, payment or non payment). The mod function is to reduce the number of data points in the gain chart

with tbl2 as
(select a.*, row_number() over(order by score desc ) risk_rank from tbl_testing_data a),
tbl3 as
( select a.*,
count(1) over(order by risk_rank) total_num,
sum(case when payment_ind='Y' then 1 else 0 end ) over(order by risk_rank) total_bad,
count(1) over(order by risk_rank)/count(1) over() pcnt_tot,
sum(case when payment_ind='Y' then 1 else 0 end ) over(order by risk_rank)/ sum(case when payment_ind='Y' then 1 else 0 end ) over()
pcnt_bad
from tbl2 a
)
select * from tbl3 where mod(risk_rank,100)=1 order by risk_rank;

Remove Duplicates From Data

It is quite common to have duplicates in the data. From example, an application is submitted or a claim is filed multiple times. With Oracle function row_number function, we can easily remove the duplicates.

The following SQL first calculates the rank of records by app_date for each app_id, then it only selects those records with ranks equals 1. The only earliest record for each app_id will be selected.

with tbl as
(
select a.*,
row_number() over(partition by app_id order by APP_DATE ) rnk from CELL_PHONE_SERVICE_APPS a
)
select * from tbl where rnk=1;

If we want to keep the latest record for each app_id, we simply generate rank for each app_id by the descending order of app_date as shown below.

with tbl as
(
select a.*,
row_number() over(partition by app_id order by APP_DATE desc ) rnk from CELL_PHONE_SERVICE_APPS a
)
select * from tbl where rnk=1;

Saturday, May 26, 2012

SAS vs SQL Database: Select an Enterprise Analytic Platform

Without doubt, SAS has the most sophisticated statistical models. However, as data analyitcs practitioners in real business world, we have realized that 80% or more of the time is spent on data preparation and model deployment. Thus, to decide which analytic platform to use in an company, we need to carefully assess factors including how the models will be used (online or offline), the staff's knowlege levels about SAS and SQL, data security, data size, costs, etc.

In my opinion, data should be stored, managed and analyzed in a SQL based database as much as possible. For example, with Oracle 11g, we can perform all the tasks including data collection, summary, manipulation, model building, model scoring, reporting, within a single environment. This in-database analytics approach greatly increases the productivity, manageability and security.

SAS and SQL (Oracle) Version of Univariate Statistics

The following SAS and SQL scripts produce exactly the same results.

SAS Version

proc univariate data = claims ;
by grp;
var payment_amt;
output out = claims_stats N = lines min = var_min
max = var_max mean = var_mean std = var_stnd_dev;
Run;

SQL Version

create materialized view claims
as select
grp,
count(1) lines,
minpayment_amt) var_min,
max(payment_amt) var_max,
avg(payment_amt) var_mean,
stddev(payment_amt) var_stnd_dec
from claim
group by grp;

More On Oracle External Table

Personally, I prefer external table over SQL Loader. When possible, I created a external table and run some queries to make sure the text file looks good. Then I loaded them as a table or materialized view. Basically I run the one of the following queries to load the data.
create table tbl_actual_tab as select * from tbl_external_table;
or
create materialized view tbl_actual_tab as select * from tbl_external_table;

How to Query a Text File In Oracle

With Oracle external tables, we can run SQL queries against text files without physically loading them into the database. To create a external table, simply specify the "ORGANIZATION EXTERNAL" parameters after "Create table". Once it is created, you can query the text file like a regular table.

CREATE TABLE "BDM"."TBL_TRADE_EXT" (
"SYMBOL" VARCHAR2(32),
"DT" VARCHAR2(32),
"OPEN" NUMBER, "HIGH" NUMBER,
"LOW" NUMBER, "CLOSE" NUMBER,
"ADJ_CLOSE" NUMBER )
ORGANIZATION EXTERNAL (
TYPE ORACLE_LOADER DEFAULT DIRECTORY "DIR_MKT_DATA"
ACCESS PARAMETERS ( records delimited by '\n' SKIP 1 fields terminated by "," )
LOCATION ( 'all.csv' ) )

Friday, May 25, 2012

Calculate Histogram Using Oracle Function

The following Oracle SQL script calculates the histogram. width_bucket() and ratio_to_report() are two useful functions.
Width_buck function divides the CREDIT_SCORE into 10 equal length bins based on the minimum and maximum values.
Ratio_to_report function calculate the percentage of number of records, i.e., count(*), for each bin created by with_bucket function in step2.

with
tbl as (
select min(CREDIT_SCORE) low, max(CREDIT_SCORE) high
from CELL_PHONE_SERVICE_APPS),
tbl2 as (
select width_bucket(CREDIT_SCORE, low, high, 10) s, CREDIT_SCORE
from CELL_PHONE_SERVICE_APPS, tbl
)
select s,
min(CREDIT_SCORe) lower, max(credit_score) upper,
count(1) num,
round((ratio_to_report(count(1)) over())*100,1) pcnt
from tbl2 group by s order by s;

Calculate Histogram for a Text File

Often the data are in text format before they are loaded into a database. It is a good practice to check the histogram of each data columns in the text file.
For example, we have csv file x.txt as shown below.

1,a
1,b
2,c
3,a
4,d
5,c
6,e
If the operating system is UNIX or Cygwin (a free unix simulator for Windows that can be downloaded here. we can use the follow commands to calculate histogram form column 2. Awk program defines "," as delimiter and print the second column ($2). Then the output data is sorted. Uniq counts the number of occurrences.

$ cat x.txt | awk -F"," '{print $2}' | sort | uniq -c
The following is the output.

2 a
1 b
2 c
1 d
1 e

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics