Wednesday, November 13, 2013

"Fix" a Text File that Is in Unix Format

Problem

We expect a text file to be in the format similar to the following.
ADS_ID,DEVICE_OS,NUM_IMPRESSION,NUM_CONVERSION
 32, Android, 1, 0
 32, Android, 1, 0
 32, Android, 1, 0
 32, Android, 2, 0
 32, Android, 1, 0
 32, Android, 2, 0
 32, Android, 1, 0
 32, Android, 3, 0
 32, Android, 2, 0
However, when we open it in Microsoft Notepad, it look likes strange as shown below. The records are all in one line. How do we fix the text file that look "broken" in notepad?


Solution

This is most likely due to that the file is stored in Unix format instead of Windows.The line breaker for Unix text file is one character 10 ('\n' or '0A' in hexadecimal). While Windows text file uses two characters as the line breakers 10 and 14('\n\r', or '0A0D'). We can find out if a text file is in Unix or Windows format using the Linux command "od" (on Windows computer we can install free cygwin that allows us the use those Linux commands like "od")
$ cat ads_log_small.txt | od -c
0000000   A   D   S   _   I   D   ,   D   E   V   I   C   E   _   O   S
0000020   ,   N   U   M   _   I   M   P   R   E   S   S   I   O   N   ,
0000040   N   U   M   _   C   O   N   V   E   R   S   I   O   N  \n
0000060   3   2   ,       A   n   d   r   o   i   d   ,       1   ,
0000100   0  \n       3   2   ,       A   n   d   r   o   i   d   ,
0000120   1   ,       0  \n       3   2   ,       A   n   d   r   o   i
0000140   d   ,       1   ,       0  \n       3   2   ,       A   n   d
0000160   r   o   i   d   ,       2   ,       0  \n       3   2   ,
0000200   A   n   d   r   o   i   d   ,       1   ,       0  \n       3
0000220   2   ,       A   n   d   r   o   i   d   ,       2   ,       0
0000240  \n       3   2   ,       A   n   d   r   o   i   d   ,       1
0000260   ,       0  \n       3   2   ,       A   n   d   r   o   i   d
0000300   ,       3   ,       0  \n       3   2   ,       A   n   d   r
0000320   o   i   d   ,       2   ,       0  \n
0000332 
We see that the line breaker is a single character \n. We can fix the by converting it to Windows format using Linux (or cyswin on Windows) command unix2dos.

$ unix2dos ads_log_small.txt
unix2dos: converting file ads_log_small.txt to DOS format ...

We run "od" command again. It shows that the line breakers are two characters \r\n. Now notepad can display the file correctly. We can convert a Widows text file back into a Unix file using dos2unix command.
$ cat ads_log_small.txt | od -c

0000000   A   D   S   _   I   D   ,   D   E   V   I   C   E   _   O   S
0000020   ,   N   U   M   _   I   M   P   R   E   S   S   I   O   N   ,
0000040   N   U   M   _   C   O   N   V   E   R   S   I   O   N  \r  \n
0000060       3   2   ,       A   n   d   r   o   i   d   ,       1   ,
0000100       0  \r  \n       3   2   ,       A   n   d   r   o   i   d
0000120   ,       1   ,       0  \r  \n       3   2   ,       A   n   d
0000140   r   o   i   d   ,       1   ,       0  \r  \n       3   2   ,
0000160       A   n   d   r   o   i   d   ,       2   ,       0  \r  \n
0000200       3   2   ,       A   n   d   r   o   i   d   ,       1   ,
0000220       0  \r  \n       3   2   ,       A   n   d   r   o   i   d
0000240   ,       2   ,       0  \r  \n       3   2   ,       A   n   d
0000260   r   o   i   d   ,       1   ,       0  \r  \n       3   2   ,
0000300       A   n   d   r   o   i   d   ,       3   ,       0  \r  \n
0000320       3   2   ,       A   n   d   r   o   i   d   ,       2   ,
0000340       0  \r  \n
0000344

Unlike Notepad, Microsoft WordPa is able to display text files in both Unix and Windows formats correctly.

No comments: