Saturday, January 05, 2013

Watch out invisible characters.

When text files (or csv files when the filed delimiter is a comma) are loaded into a table, sometimes the columns contain weird and invisible characters. They cause unexpected results when we compare values or convert them from strings to numbers. Oracle function utl_raw.CAST_TO_RAW is a great way to look into the text strings and uncover those invisible characters.

In a project, we used ETL tool to move financial transactions into a data warehouse and found that in a very small number of cases the data showed discrepancies when we compared the data before and after the movement. However, the data looked the same visually. By using utl_raw.CAST_TO_RAW function, we were able to pinpoint those discrepancies caused by invisible characters and fixed the problem.

For example, there is a table containing text string about stock prices. The following query shows one record containing symbol ,Date,Open,High,Low,Close,Volume, and Adj Close.
SQL> select quotes from TBL_SYMBOL_QUOTES_CUR a where rownum <2; 

QUOTES
---------------------------------------------------------------- 
APEI,2013-01-02,36.89,37.34,36.59,36.96,198700,36.96 
The above query result looks OK. But the text string actually contains an invisible character new line (hex value 0A or decimal value 10). We can use utl_raw.CAST_TO_RAW to uncover this.
SQL> select utl_raw.CAST_TO_RAW(quotes) quotes_raw from TBL_SYMBOL_QUOTES_CUR a where rownum <2; 

QUOTES_RAW 
------------------------------------------------------------------------------------------------------------------------------------ 
415045492C323031332D30312D30322C33362E38392C33372E33342C33362E35392C33362E39362C3139383730302C33362E39360A 
Here are the complete ASCII tables showing the numerical representation of characters. We can remove the unwanted characters using Oracle replace function. chr(10) means the character with decimal value 10 or hex value 0A or new line character.
<
SQL> select replace(quotes,chr(10)) from TBL_SYMBOL_QUOTES_CUR a where rownum <2;
The following query shows new line character is removed.
SQL> select utl_raw.CAST_TO_RAW(replace(quotes,chr(10))) from TBL_SYMBOL_QUOTES_CUR a where rownum <2; 

UTL_RAW.CAST_TO_RAW(REPLACE(QUOTES,CHR(10))) 
------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------- 
415045492C323031332D30312D30322C33362E38392C33372E33342C33362E35392C33362E39362C3139383730302C33362E3936 

No comments: