Friday, May 25, 2012

Calculate Histogram for a Text File

Often the data are in text format before they are loaded into a database. It is a good practice to check the histogram of each data columns in the text file.
For example, we have csv file x.txt as shown below.

1,a
1,b
2,c
3,a
4,d
5,c
6,e
If  the operating system is UNIX or Cygwin (a  free unix simulator for Windows that can be downloaded here. we can use the follow commands to calculate histogram form column 2. Awk program defines "," as delimiter and print the second column ($2). Then the output data is sorted. Uniq counts the number of occurrences.

$ cat x.txt | awk -F"," '{print $2}' | sort | uniq -c
The following is the output.

     2 a
     1 b
     2 c
     1 d
     1 e

No comments: