Thursday, September 13, 2012

Fuzzy Match Data from Different Sources without Exact Keys

Merging data from difference sources is probably one of the most challenging data preparation work. For example, debit card authorizations are captured in real time at the point of sales. However, the actual settlement of the transactions may take a few days. The settlement date, amount and merchant name could all be different from those in the authorization data. It is very challenging to match transactions in these two tables. The following two tables illustrate this issue.
Bank card authorizations.
----------- -------------- ------------------------ ----------
     123457 20120304:11:00 WALMART                       85.35
     123457 20120304:17:30 ABC Restaurant                 56.2
     123457 20120305:08:25 ABC Restaurant                   23
     123457 20120305:08:36 WAL_MART                      135.2

Bank card settlements.
----------- -------------- ------------------------ ----------
     123457 20120305:00:00 WAL**                         85.35
     123457 20120305:00:00 ABC                              70
     123457 20120306:00:00 WAL-M                         135.2
     123457 20120315:00:00 A** Restaurant                   29
To match authorizations and their settlements, simple table joining will obviously not work because there are not exact keys to link them. Probably, the only way to do this is to use a fuzzy, iterative matching approach. We can measure three distances between an authorization and settlement, i.e., merchant name distance, transaction amount difference and transaction date difference.
1. Merchant name distance can be measured using the following function.
100-utl_match.EDIT_DISTANCE_SIMILARITY(upper(a.merchant_name), upper( b.merchant_name))
For exact string matches, utl_match.EDIT_DISTANCE_SIMILARITY returns 100. For no matches, it returns zero.
2. Transaction amount difference is  abs(a.amt-b.amt) .
3. Transaction date difference is trunc(b.txn_date,'DDD')-a.txn_date
Once we calculate the above three distances for each pair of records in authorization and settlement data sets, we can assign a weight to each of them and combine them into a single distance that measures the similarity between two records, i.e., record_distance= w1*merchant_name_distance + w2*amount_distance +w3*date_distance. We can identify those pairs having the least distances for each card.Those pairs are considered successful matches and are removed from the data. We repeat this matching process for the remaining data until we are happy with the results. This process can be fully automated. I have used the approach to merge about 50 millions authorizations from 3 millions debit cards. It only took about an hour.

No comments: