Deep Data Mining Blog: Fuzzy Match Data from Different Sources without Exact Keys

Thursday, September 13, 2012

Fuzzy Match Data from Different Sources without Exact Keys

Merging data from difference sources is probably one of the most challenging data preparation work. For example, debit card authorizations are captured in real time at the point of sales. However, the actual settlement of the transactions may take a few days. The settlement date, amount and merchant name could all be different from those in the authorization data. It is very challenging to match transactions in these two tables. The following two tables illustrate this issue.
Bank card authorizations.
CARD_NUMBER TXN_DATE MERCHANT_NAME AMT
----------- -------------- ------------------------ ----------
123457 20120304:11:00 WALMART 85.35
123457 20120304:17:30 ABC Restaurant 56.2
123457 20120305:08:25 ABC Restaurant 23
123457 20120305:08:36 WAL_MART 135.2

Bank card settlements.
CARD_NUMBER TXN_DATE MERCHANT_NAME AMT
----------- -------------- ------------------------ ----------
123457 20120305:00:00 WAL** 85.35
123457 20120305:00:00 ABC 70
123457 20120306:00:00 WAL-M 135.2
123457 20120315:00:00 A** Restaurant 29
To match authorizations and their settlements, simple table joining will obviously not work because there are not exact keys to link them. Probably, the only way to do this is to use a fuzzy, iterative matching approach. We can measure three distances between an authorization and settlement, i.e., merchant name distance, transaction amount difference and transaction date difference.
1. Merchant name distance can be measured using the following function.
100-utl_match.EDIT_DISTANCE_SIMILARITY(upper(a.merchant_name), upper( b.merchant_name))
For exact string matches, utl_match.EDIT_DISTANCE_SIMILARITY returns 100. For no matches, it returns zero.
2. Transaction amount difference is abs(a.amt-b.amt) .
3. Transaction date difference is trunc(b.txn_date,'DDD')-a.txn_date
Once we calculate the above three distances for each pair of records in authorization and settlement data sets, we can assign a weight to each of them and combine them into a single distance that measures the similarity between two records, i.e., record_distance= w1*merchant_name_distance + w2*amount_distance +w3*date_distance. We can identify those pairs having the least distances for each card.Those pairs are considered successful matches and are removed from the data. We repeat this matching process for the remaining data until we are happy with the results. This process can be fully automated. I have used the approach to merge about 50 millions authorizations from 3 millions debit cards. It only took about an hour.

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics

Popular Topics

Thursday, September 13, 2012

Fuzzy Match Data from Different Sources without Exact Keys

No comments: