Deep Data Mining Blog: Find All Unique Characters In a Text File

Wednesday, November 28, 2018

Find All Unique Characters In a Text File

There is a file containing 7,288 lines of Chinese text (see below). I want to find our all characters that appear in the text.

$ head -2 id_baike2.txt
1
宁明县爱店镇成立于1992年3月，位于中越陆地边境公母山下，与越南谅山省禄平县接壤，
边境线长达25.5公里，东兴至那坡沿边公路贯穿其间。
爱店为国家一类口岸，是我国西南经济板块通往东南亚的陆路效能要道。
全镇下辖3个村委会19个自然屯，总面积65.79平方公里，耕地面积7225亩，
粮食播种面积2700亩，总人8055口人，其中流动人口2000余人。
爱店镇1995年被国家建设部定为全国500家小城镇建设试点之一，
2002年被定为全区小城镇建设重点镇。城镇各种基础设施日臻完善。
镇区主要街道全部实现硬化和街砖铺设。
2006年以来，爱店镇结合崇左市开展市容环境综合整治竞赛活动和城乡清洁工程，
与有关部门共同筹措资金共561.7万元投入小城镇建设，
不断扩大城镇规模和完善城镇功能，搞好环境卫生，提高口岸服务水平，
树立良好国门形象。著名的“金牛潭”风景区，
常年流水潺潺，怪石嶙峋，古木参天，景色迷人。
历代的文人名士曾在潭边题字留墨，常年吸引大批游客前来观光旅游。
镇内公母山海拔1358米，山奇水秀，景色宜人。主要有金牛潭生态旅游度假村、
公母山庄、爱店起义纪念碑、

I use the following Linux command to accomplish this.

cat  id_baike2.txt | awk -F "" '{for (i=1;i <= NF;i++) print $i;}' | sort | 
uniq -c > char_list.txt

The following shows the partial result. The number on the left is the frequency of the characters appearing in the file

I also use the following python scripts to do the same calculation.

import collections
f = open('id_baike2.txt',  encoding="utf8")
lines = f.readlines()
lines_joined =  ' '.join(lines)
frq = collections.Counter(lines_joined)
frq.most_common(100)
f.close()

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics

Popular Topics

Wednesday, November 28, 2018

Find All Unique Characters In a Text File

No comments: