Friday, August 24, 2012

Calculate the similarity between two strings in Oracle


Oracle function utl_match.edit_distance_similarity (s1 IN VARCHAR2, s2 IN VARCHAR2) returns 100 for perfect match between s1 and s2, 0 for no similarity. It is case sensitive. I have used this functions to match the merchant names in debit card authorizations and claims. I noticed that this function behave slightly differently in 10g and 11g. In 10g, it is possible that it return number larger than 100 when null string is involved and I had to deal with them carefully. This function is particularly useful when performing fuzzing matching between two strings such as merchant names in two tables.

select utl_match.edit_distance_similarity('Hello', 'Hello') dis from dual;
100
/*case sensitive */
select utl_match.edit_distance_similarity('Hello', 'hello') dis from dual;
 80
/*case sensitive */
select utl_match.edit_distance_similarity('Hello', 'HELLO') dis from dual;
 20
select utl_match.edit_distance_similarity('Hello', 'HLO') dis from dual;
 20
select utl_match.edit_distance_similarity('Hello', 'X') dis from dual;
0
select utl_match.edit_distance_similarity('Hello', ' ') dis from dual;
0
select utl_match.edit_distance_similarity('Hello', NULL) dis from dual;

0

3 comments:

Jay Zhou, PhD. said...

Somebody mentioned Smith-Waterman algorithm in his comment to check whether a substring in one sequence aligns well with a substring in the other. I deleted it by accident. I am sorry about that. If you are the person who posted the comment, please repost it. Thanks. Jay

Cristian Mesiano said...

if I'm not wrong it is based on the Jaro-Winkler distance (not a metric!!).
Around the topic: I would mention the Smith Waterman algorithm to find the best sub string alignment.
cheers
http://textanddatamining.blogspot.it/

Jay Zhou, PhD. said...

Cristian, Thank you for your comment.