Deep Data Mining Blog: November 2018

Friday, November 30, 2018

Load Text Into Oracle Table CLOB Column Using Python Part 2

The texts in some records are too long, e.g., more thatn 22k characters. This is inconvenient when we try to edit it online. I use chunkstring(line2,900) to divide the text into multiple strings of fixed length of 900 characters and stored them in multiple records. uuid.uuid1().hex is used to generate an universally unique id.

import cx_Oracle
import uuid
def chunkstring(string, length):
    return (string[0+i:length+i] for i in range(0, len(string), length))
connection = cx_Oracle.connect("xxxxx","xxxx",\
                               "golden.maxapex.net:1521/xxxx",\
                               encoding = "UTF-8", nencoding = "UTF-8")
cur = connection.cursor()
cur.setinputsizes(T1 = cx_Oracle.CLOB)
cur.setinputsizes(T2 = cx_Oracle.CLOB)
f = open('id_baike2.txt',  \
        encoding="utf8")
line1 = f.readline()
line2 = f.readline()
while line2:
    vuid = uuid.uuid1().hex
    data_list = list(chunkstring(line2,900))
    for idx, v in enumerate(data_list):
        cur.execute("INSERT INTO t_text_clob_m (ID, SID, SEQ, T1)  \
               VALUES (:ID, :SID, :SEQ, :T1)",\
                    ID = int(line1), SID = vuid, SEQ = idx+1, T1 = v)
    # create table t_text_clob_m (id number, sid raw(32), seq number, t1 clob);
    line1 = f.readline()
    line2 = f.readline()
f.close()
connection.commit()
cur.close()
connection.close()

Thursday, November 29, 2018

Load Text Into Oracle Table CLOB Column Using Python

A text file contains Chinese texts that describe close to 4 thousand small towns. The ID of a town is always the first line followed by the text. The text could be as long as 22k characters. Thus, we have to use CLOB to store these texts.

1
宁明县爱店镇成立于1992年3月，位于中越陆地边境公母山下，与越南谅山省禄平县接壤，
边境线长达25.5公里，...
2
林逢镇位居田东县城东部，地处右江河谷中心地带、大西南出海通道咽喉要地，东经南宁可达北海、
防城及广东沿海各地，...

From SQL Developer client, I connect to the Oracle database and create a table with a CLOB column to store the text.

#create a table to hold texts
create table t_text (id number, val clob);

I load the file into the table on Amazone RDS Oracle database using the following scripts.

import cx_Oracle
connection = cx_Oracle.connect("xxxxx","xxx",\
                               "orcl.xxxxxx.us-east-1.rds.amazonaws.com:1521/orcl",\
                               encoding = "UTF-8", nencoding = "UTF-8")
cur = connection.cursor()
cur.setinputsizes(VAL = cx_Oracle.CLOB)
f = open('id_baike2.txt',  encoding="utf8")
line1 = f.readline()
line2 = f.readline()
while line2:
    cur.execute("INSERT INTO t_text (ID, VAL) VALUES (:ID, :VAL)", \
                ID=int(line1),VAL=line2)
    line1 = f.readline()
    line2 = f.readline()
f.close()
connection.commit()
cur.close()
connection.close()

Connect to Amazon RDS Oracle Database Using Python

The following scripts connect to an Oracle database and execute queries.

import cx_Oracle
connection = cx_Oracle.connect("aisxxx","xxxxxxx",\
              "orcl.ctbxxxxxxx.us-east-1.rds.amazonaws.com:1521/orcl")
# username, password, host:port number/service name
cur = connection.cursor()
# query table
cur.execute("select table_name, tablespace_name from user_tables order by 1")
for each in cur.description:
  print(each[0:2])
for col_1, col_2 in cur.fetchall():
 print(col_1+" "+col_2)
cur.close()
connection.close()

Outputs are the following.

('TABLE_NAME', )
('TABLESPACE_NAME', )
T_TEST USERS

Wednesday, November 28, 2018

Find All Unique Characters In a Text File

There is a file containing 7,288 lines of Chinese text (see below). I want to find our all characters that appear in the text.

$ head -2 id_baike2.txt
1
宁明县爱店镇成立于1992年3月，位于中越陆地边境公母山下，与越南谅山省禄平县接壤，
边境线长达25.5公里，东兴至那坡沿边公路贯穿其间。
爱店为国家一类口岸，是我国西南经济板块通往东南亚的陆路效能要道。
全镇下辖3个村委会19个自然屯，总面积65.79平方公里，耕地面积7225亩，
粮食播种面积2700亩，总人8055口人，其中流动人口2000余人。
爱店镇1995年被国家建设部定为全国500家小城镇建设试点之一，
2002年被定为全区小城镇建设重点镇。城镇各种基础设施日臻完善。
镇区主要街道全部实现硬化和街砖铺设。
2006年以来，爱店镇结合崇左市开展市容环境综合整治竞赛活动和城乡清洁工程，
与有关部门共同筹措资金共561.7万元投入小城镇建设，
不断扩大城镇规模和完善城镇功能，搞好环境卫生，提高口岸服务水平，
树立良好国门形象。著名的“金牛潭”风景区，
常年流水潺潺，怪石嶙峋，古木参天，景色迷人。
历代的文人名士曾在潭边题字留墨，常年吸引大批游客前来观光旅游。
镇内公母山海拔1358米，山奇水秀，景色宜人。主要有金牛潭生态旅游度假村、
公母山庄、爱店起义纪念碑、

I use the following Linux command to accomplish this.

cat  id_baike2.txt | awk -F "" '{for (i=1;i <= NF;i++) print $i;}' | sort | 
uniq -c > char_list.txt

The following shows the partial result. The number on the left is the frequency of the characters appearing in the file

I also use the following python scripts to do the same calculation.

import collections
f = open('id_baike2.txt',  encoding="utf8")
lines = f.readlines()
lines_joined =  ' '.join(lines)
frq = collections.Counter(lines_joined)
frq.most_common(100)
f.close()

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics