Deep Data Mining Blog: Calculate the similarity between two strings in Oracle

Friday, August 24, 2012

Calculate the similarity between two strings in Oracle

Oracle function utl_match.edit_distance_similarity (s1 IN VARCHAR2, s2 IN VARCHAR2) returns 100 for perfect match between s1 and s2, 0 for no similarity. It is case sensitive. I have used this functions to match the merchant names in debit card authorizations and claims. I noticed that this function behave slightly differently in 10g and 11g. In 10g, it is possible that it return number larger than 100 when null string is involved and I had to deal with them carefully. This function is particularly useful when performing fuzzing matching between two strings such as merchant names in two tables.

select utl_match.edit_distance_similarity('Hello', 'Hello') dis from dual;
100
/*case sensitive */
select utl_match.edit_distance_similarity('Hello', 'hello') dis from dual;
80
/*case sensitive */
select utl_match.edit_distance_similarity('Hello', 'HELLO') dis from dual;
20
select utl_match.edit_distance_similarity('Hello', 'HLO') dis from dual;
20
select utl_match.edit_distance_similarity('Hello', 'X') dis from dual;
0
select utl_match.edit_distance_similarity('Hello', ' ') dis from dual;
0
select utl_match.edit_distance_similarity('Hello', NULL) dis from dual;

0

3 comments:

Jay Zhou, PhD. said...: Somebody mentioned Smith-Waterman algorithm in his comment to check whether a substring in one sequence aligns well with a substring in the other. I deleted it by accident. I am sorry about that. If you are the person who posted the comment, please repost it. Thanks. Jay; 7:08 AM
Cristian Mesiano said...: if I'm not wrong it is based on the Jaro-Winkler distance (not a metric!!).
Around the topic: I would mention the Smith Waterman algorithm to find the best sub string alignment.
cheers
http://textanddatamining.blogspot.it/; 7:12 AM
Jay Zhou, PhD. said...: Cristian, Thank you for your comment.; 7:16 AM

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics

Popular Topics

Friday, August 24, 2012

Calculate the similarity between two strings in Oracle

3 comments: