In most of the applications, there is practically no difference between 11169568236203649 and 11169568236203648. Unfortunately, every digit matters in the case of function mod.

## Popular Topics

### I am compiling a list of leading data providers. This is the current list. If your favorite data providers are not included here and you feel they should be, please contact me at jay.zhou@deep-data-mining.com. Thanks. Dr. Zhou.

#### Popular Topics

## Thursday, May 31, 2012

### Floating Point Representation: the Loss of Precision

On SAS PC, 11169568236203649 is actually represented as 11169568236203648 due to the limitation of 64-bit IEEE floating point representation.

In most of the applications, there is practically no difference between 11169568236203649 and 11169568236203648. Unfortunately, every digit matters in the case of function mod.

In most of the applications, there is practically no difference between 11169568236203649 and 11169568236203648. Unfortunately, every digit matters in the case of function mod.

## Wednesday, May 30, 2012

### A Clarification on SAS mod function.

I am a long term SAS fan. Its statistical functions and powerful macro scripting capability are simply unprecedented. There is a little bit of misunderstanding concerning my post on MOD function. The following is a clarification from Rick Langston from SAS.

Rick's clarification follows.

--------------------------------------------------------

There is no such "bug" in the MOD function in SAS.

The problem is due to the limitation of IEEE floating point

representation with the number of digits used in the example.

The number 11169568236203649 cannot be precisely stored in a 64-bit

IEEE floating point value. It is indistinguishable from

11169568236203648, for example.

SAS uses floating point representation for all of its numeric values.

Note that if you apply the same constraints to databases, you

will see the same results as SAS. For example, on Teradata

if you submit this expression:

select cast(11169568236203649 as float) mod cast(30269 as float);

You get the response of

1.17310000000000E 004

Which is just what SAS provides.

And consider this C code:

main()

{

double x;

x = (long long)11169568236203649.0 % (long long)30269.0;

printf("x=%f.\n",x);

x = 11169568236203649LL % 30269LL;

printf("x=%f.\n",x);

}

The results are this:

x=11731.000000.

x=11732.000000.

It's the same issue. The double-precision value cannot

be represented sufficiently. If the value remains as a long

integer, the representation is maintained.

Note also that when you run the same MOD function example on

z/OS (i.e. IBM mainframe), you get 11732. This is because the

IBM floating point representation has 4 more bits of mantissa

(at the expense of 4 less bits of exponent) than does IEEE. So

more digits of precision will be noticed, as is the case with the

example.

It had been commented that 64-bit systems might make a difference,

such as SAS on 64-bit Windows as compared to 32-bit Windows.

That is not the case here. The IEEE floating point representation

used by SAS is 64-bit on all platforms. Only the z/OS implementation

of SAS uses the IBM representation.

Rick's clarification follows.

--------------------------------------------------------

There is no such "bug" in the MOD function in SAS.

The problem is due to the limitation of IEEE floating point

representation with the number of digits used in the example.

The number 11169568236203649 cannot be precisely stored in a 64-bit

IEEE floating point value. It is indistinguishable from

11169568236203648, for example.

SAS uses floating point representation for all of its numeric values.

Note that if you apply the same constraints to databases, you

will see the same results as SAS. For example, on Teradata

if you submit this expression:

select cast(11169568236203649 as float) mod cast(30269 as float);

You get the response of

1.17310000000000E 004

Which is just what SAS provides.

And consider this C code:

main()

{

double x;

x = (long long)11169568236203649.0 % (long long)30269.0;

printf("x=%f.\n",x);

x = 11169568236203649LL % 30269LL;

printf("x=%f.\n",x);

}

The results are this:

x=11731.000000.

x=11732.000000.

It's the same issue. The double-precision value cannot

be represented sufficiently. If the value remains as a long

integer, the representation is maintained.

Note also that when you run the same MOD function example on

z/OS (i.e. IBM mainframe), you get 11732. This is because the

IBM floating point representation has 4 more bits of mantissa

(at the expense of 4 less bits of exponent) than does IEEE. So

more digits of precision will be noticed, as is the case with the

example.

It had been commented that 64-bit systems might make a difference,

such as SAS on 64-bit Windows as compared to 32-bit Windows.

That is not the case here. The IEEE floating point representation

used by SAS is 64-bit on all platforms. Only the z/OS implementation

of SAS uses the IBM representation.

-----------------------

End of Rick's clarification

End of Rick's clarification

### It is not a bug in SAS mod function.

My original suspicion was an issue with SAS mod function itself. It is actually because the number 11,169,568,236,203,649 is larger than the maxim integer9,007,199,254,740,992 supported by SAS PC vesion. It is a not a bug since it is clearly stated in the document that there is a maximum integer without rounding supported by SAS.

I was building random generators on both SAS and Oracle. It is required that given the same initial number, the random generators on SAS and Oracle will produce the same series of pseudo random numbers. Since it is impossible to match the SAS and Oracle results when the initial number is longer than 16 digits (Oracle integer supports 38 digits of precision), in my code I will throw an exception, stop the process and print a message similar to ORA-01438: "value larger than specified precision allowed for this column".

I was building random generators on both SAS and Oracle. It is required that given the same initial number, the random generators on SAS and Oracle will produce the same series of pseudo random numbers. Since it is impossible to match the SAS and Oracle results when the initial number is longer than 16 digits (Oracle integer supports 38 digits of precision), in my code I will throw an exception, stop the process and print a message similar to ORA-01438: "value larger than specified precision allowed for this column".

## Tuesday, May 29, 2012

### A Bug in SAS MOD Function.

MOD( m, n) is one of the most commonly used functions. It returns the remainder of m divided by n and the result is deterministic. We would expect it to be calculated correctly by any data analysis software packages.

However, we have found an issue in SAS mod function.The calculation we tested is mod(11169568236203649, 30269). SAS returns different results than Oracle and R do. SAS returns 11731. Both Oracle and R return 11732.

SAS script (SAS Version 9.2) and the result.

data test; val=mod(11169568236203649, 30269) ; run;

11731

or

proc sql; select mod(11169568236203649, 30269) from dummy;

11731

Oracle SQL script and the result.

select mod(11169568236203649, 30269) from dual;

11732

R script and the result.

11169568236203649 %% 30269

11732

We have verified that 11732 is the correct answer using two independent approaches. You may try the above scripts on your own computer and let me know what you find out. ( I explain my approaches to verify that 11732 is the correct answer of mod(11169568236203649,30269) and why I bother to calculate mod(x, 30269) in other posts on this blog.)

However, we have found an issue in SAS mod function.The calculation we tested is mod(11169568236203649, 30269). SAS returns different results than Oracle and R do. SAS returns 11731. Both Oracle and R return 11732.

SAS script (SAS Version 9.2) and the result.

data test; val=mod(11169568236203649, 30269) ; run;

11731

or

proc sql; select mod(11169568236203649, 30269) from dummy;

11731

Oracle SQL script and the result.

select mod(11169568236203649, 30269) from dual;

11732

R script and the result.

11169568236203649 %% 30269

11732

We have verified that 11732 is the correct answer using two independent approaches. You may try the above scripts on your own computer and let me know what you find out. ( I explain my approaches to verify that 11732 is the correct answer of mod(11169568236203649,30269) and why I bother to calculate mod(x, 30269) in other posts on this blog.)

### Why Bother to Calculate Mod(x, 30269)?

I was trying to implement Wichmann-Hill random number generator in Oracle. The calculation of mod(x, 30269) is part of the algorithm. The algorithm is described in Algorithm AS 183: An Efficient and Portable Pseudo-Random Number Generator.

### SAS Is Wrong! R and Oracle Are Right. MOD(11169568236203649, 30269)=11732

For function MOD(11169568236203649, 30269), SAS returns 11731 while both Oracle and R return 11732. I used the following two approaches to verify SAS Is Wrong. Oracle and R are right.

Approach 1. The following two Oracle Queries show that 11732 is the remainder

SQL> select ( 11169568236203649 -11732)/30269 vl from dual;

369010150193.000000

SQL> select ( 11169568236203649 -11731)/30269 vl from dual;

369010150193.000033

Approach 2. The following PL/SQL scripts.

At first, I wanted to run the following PL/SQL Script 1. It took too long and I had to kill the process. Instead, I ran Script 2 shown below and it returned 11732 at the end of the loop.

PL/SQL Script 1.

set serveroutput on;

declare

remainder number;

begin

remainder:=11169568236203649;

while (remainder >30269)

loop

remainder:=remainder-30269;

end loop;

dbms_output.put_line(remainder);

end;

/

PL/SQL Script 2.

set serveroutput on;

declare

remainder number;

begin

remainder:=11169568236203649;

while (remainder>302690000000)

loop

remainder:=remainder-302690000000; /* equivalent of 10,000,000 loops of -30269 */

end loop;

dbms_output.put_line(remainder);

while (remainder>30269)

loop

remainder:=remainder-30269;

end loop;

dbms_output.put_line(remainder);

end;

/

At the end of the loop, it returns 11732.

## Monday, May 28, 2012

### Randomly Sample a Precise Number of Records

With Oracle dbms_random.random and row_number function, we can easily randomly select a certain number of records from the data.

The following scripts select 50 records randomly from each group (strata). First, a number number is generate for each record. Then a rank is generated by row_number function based on the random number for each strata. Finally, only records ranked top 50 in each strata are selected. The rank of random number is also random.

create table tbl_id_w_rnd as select id, strata, dbms_random.random() rnd from tbl_data;

create materialized view mv_precise_50_samples as

With tbl as

(

Select a.*, row_number() over(partition by strata order by rnd) rnk from tbl_id_w_rnd a

)

Select * from tbl where rnk<=50;

--test

select strata, count(1) from mv_precise_50_samples group by strata order by strata;

## Sunday, May 27, 2012

### Calculate Gain Chart Using SQL in Oracle

The following script calculates the gain chart based on score and payment_ind (target variable, payment or non payment). The mod function is to reduce the number of data points in the gain chart

with tbl2 as

(select a.*, row_number() over(order by score desc ) risk_rank from tbl_testing_data a),

tbl3 as

( select a.*,

count(1) over(order by risk_rank) total_num,

sum(case when payment_ind='Y' then 1 else 0 end ) over(order by risk_rank) total_bad,

count(1) over(order by risk_rank)/count(1) over() pcnt_tot,

sum(case when payment_ind='Y' then 1 else 0 end ) over(order by risk_rank)/ sum(case when payment_ind='Y' then 1 else 0 end ) over()

pcnt_bad

from tbl2 a

)

select * from tbl3 where mod(risk_rank,100)=1 order by risk_rank;

### Remove Duplicates From Data

It is quite common to have duplicates in the data. From example, an application is submitted or a claim is filed multiple times. With Oracle function row_number function, we can easily remove the duplicates.

The following SQL first calculates the rank of records by app_date for each app_id, then it only selects those records with ranks equals 1. The only earliest record for each app_id will be selected.

with tbl as

(

select a.*,

row_number() over(partition by app_id order by APP_DATE ) rnk from CELL_PHONE_SERVICE_APPS a

)

select * from tbl where rnk=1;

If we want to keep the latest record for each app_id, we simply generate rank for each app_id by the descending order of app_date as shown below.

with tbl as

(

select a.*,

row_number() over(partition by app_id order by APP_DATE

**desc**) rnk from CELL_PHONE_SERVICE_APPS a

)

select * from tbl where rnk=1;

## Saturday, May 26, 2012

### SAS vs SQL Database: Select an Enterprise Analytic Platform

Without doubt, SAS has the most sophisticated statistical models. However, as data analyitcs practitioners in real business world, we have realized that 80% or more of the time is spent on data preparation and model deployment. Thus, to decide which analytic platform to use in an company, we need to carefully assess factors including how the models will be used (online or offline), the staff's knowlege levels about SAS and SQL, data security, data size, costs, etc.

In my opinion, data should be stored, managed and analyzed in a SQL based database as much as possible. For example, with Oracle 11g, we can perform all the tasks including data collection, summary, manipulation, model building, model scoring, reporting, within a single environment. This in-database analytics approach greatly increases the productivity, manageability and security.

In my opinion, data should be stored, managed and analyzed in a SQL based database as much as possible. For example, with Oracle 11g, we can perform all the tasks including data collection, summary, manipulation, model building, model scoring, reporting, within a single environment. This in-database analytics approach greatly increases the productivity, manageability and security.

### SAS and SQL (Oracle) Version of Univariate Statistics

The following SAS and SQL scripts produce exactly the same results.

SAS Version

proc univariate data = claims ;

by grp;

var payment_amt;

output out = claims_stats N = lines min = var_min

max = var_max mean = var_mean std = var_stnd_dev;

Run;

SQL Version

create materialized view claims

as select

grp,

count(1) lines,

minpayment_amt) var_min,

max(payment_amt) var_max,

avg(payment_amt) var_mean,

stddev(payment_amt) var_stnd_dec

from claim

group by grp;

### More On Oracle External Table

Personally, I prefer external table over SQL Loader. When possible, I created a external table and run some queries to make sure the text file looks good. Then I loaded them as a table or materialized view. Basically I run the one of the following queries to load the data.

create table tbl_actual_tab as select * from tbl_external_table;

or

create materialized view tbl_actual_tab as select * from tbl_external_table;

create table tbl_actual_tab as select * from tbl_external_table;

or

create materialized view tbl_actual_tab as select * from tbl_external_table;

### How to Query a Text File In Oracle

With Oracle external tables, we can run SQL queries against text files without physically loading them into the database. To create a external table, simply specify the "ORGANIZATION EXTERNAL" parameters after "Create table". Once it is created, you can query the text file like a regular table.

CREATE TABLE "BDM"."TBL_TRADE_EXT" (

"SYMBOL" VARCHAR2(32),

"DT" VARCHAR2(32),

"OPEN" NUMBER, "HIGH" NUMBER,

"LOW" NUMBER, "CLOSE" NUMBER,

"ADJ_CLOSE" NUMBER )

ORGANIZATION EXTERNAL (

TYPE ORACLE_LOADER DEFAULT DIRECTORY "DIR_MKT_DATA"

ACCESS PARAMETERS ( records delimited by '\n' SKIP 1 fields terminated by "," )

LOCATION ( 'all.csv' ) )

CREATE TABLE "BDM"."TBL_TRADE_EXT" (

"SYMBOL" VARCHAR2(32),

"DT" VARCHAR2(32),

"OPEN" NUMBER, "HIGH" NUMBER,

"LOW" NUMBER, "CLOSE" NUMBER,

"ADJ_CLOSE" NUMBER )

ORGANIZATION EXTERNAL (

TYPE ORACLE_LOADER DEFAULT DIRECTORY "DIR_MKT_DATA"

ACCESS PARAMETERS ( records delimited by '\n' SKIP 1 fields terminated by "," )

LOCATION ( 'all.csv' ) )

## Friday, May 25, 2012

### Calculate Histogram Using Oracle Function

The following Oracle SQL script calculates the histogram. width_bucket() and ratio_to_report() are two useful functions.

Width_buck function divides the CREDIT_SCORE into 10 equal length bins based on the minimum and maximum values.

Ratio_to_report function calculate the percentage of number of records, i.e., count(*), for each bin created by with_bucket function in step2.

with

tbl as (

select min(CREDIT_SCORE) low, max(CREDIT_SCORE) high

from CELL_PHONE_SERVICE_APPS),

tbl2 as (

select width_bucket(CREDIT_SCORE, low, high, 10) s, CREDIT_SCORE

from CELL_PHONE_SERVICE_APPS, tbl

)

select s,

min(CREDIT_SCORe) lower, max(credit_score) upper,

count(1) num,

round((ratio_to_report(count(1)) over())*100,1) pcnt

from tbl2 group by s order by s;

### Calculate Histogram for a Text File

Often the data are in text format before they are loaded into a database. It is a good practice to check the histogram of each data columns in the text file.

For example, we have csv file x.txt as shown below.

1,a

1,b

2,c

3,a

4,d

5,c

6,e

If the operating system is UNIX or Cygwin (a free unix simulator for Windows that can be downloaded here. we can use the follow commands to calculate histogram form column 2. Awk program defines "," as delimiter and print the second column ($2). Then the output data is sorted. Uniq counts the number of occurrences.

$ cat x.txt | awk -F"," '{print $2}' | sort | uniq -c

The following is the output.

2 a

1 b

2 c

1 d

1 e

For example, we have csv file x.txt as shown below.

1,a

1,b

2,c

3,a

4,d

5,c

6,e

If the operating system is UNIX or Cygwin (a free unix simulator for Windows that can be downloaded here. we can use the follow commands to calculate histogram form column 2. Awk program defines "," as delimiter and print the second column ($2). Then the output data is sorted. Uniq counts the number of occurrences.

$ cat x.txt | awk -F"," '{print $2}' | sort | uniq -c

The following is the output.

2 a

1 b

2 c

1 d

1 e

Subscribe to:
Posts (Atom)