RFM analysis is usually performed based on a "transaction" data set. Each record represents a transaction at specific point in time. For example, the following are a sequence of credit card transactions.
We talk about recency factor first. In the following table, transaction 4 as the highlighted is most suspicious. Not only it is an expensive electronics purchase but also the immediate transaction before it is a $1 gas station purchase that happens 11 minutes ago. Fraudsters often use gas station to test a credit cards to make sure they work before they make big purchase at a store.
CARD_NUMBER TXN_ID TXN_DATE TXN_TYPE TXN_AMT
----------- ---------- ---------- ------------ ----------
123457 1 0304:11:00 Grocery 85.35
123457 2 0304:17:30 Electronic 2100.75
123457 3 0305:08:25 Gas Station 1
123457 4 0305:08:36 Electronic 1435.2
123457 5 0305:11:23 Grocery 55.3
Thus, the recency is an important consideration. It provides the context information about a transaction. When we build predictive models or write ad hoc fraud detection rules, including recency factors can greatly improve detection accuracy.
Another example is to predict if a prospect responds to a direct mail offering. The following table shows a prospect's responses in 5 weeks to 5 products. The reason that she responds to offering of product C in week 3 could be related to the offering of B one week before. For example, a person will more likely accept a credit card with 6.99% APR if the she is offered 17.99% APR earlier. Again, the recency factors are important.
User_ID User_Category State Week Product_ID Response
1001 A MA 1 A No
1001 A MA 2 B No
1001 A MA 3 C Yes
1001 A MA 4 D No
1001 A MA 5 E No
One of the convenient ways to calculate recency factors is SQL analytic function lag. For example, the following query calculates three recency factors for credit transactions, i.e., time since recent transaction, recent transaction type and recent transaction amount. The "Partition by card_number" clause means the calculation is done independently for each card number.
( txn_date-lag(txn_date ) over(partition by card_number order by txn_date)) time_since_recent_txn,
lag(txn_type) over(partition by card_number order by txn_date) last_recent_type ,
lag(txn_amt) over(partition by card_number order by txn_date) last_recent_amt
from TBL_CARD_TXN a order by txn_date;
When there are time/date variables in the data, it is extremely important to take advantage of them by using RFM analysis to create derived variables. Those variables can then be used as inputs to predictive models. Please see another post Build Best Direct Marketing Predictive Models.