A number of years ago, I was applying for a PhD statistician position at an online advertising company. As part of the screening process, I was given a project. The following is a simple description of the problems.

Let's take a look at the following records. ACME sends 10 campaigns through emails to subscriber 1 for 26 weeks. The records are ordered by Week_number. USER_CAT is a demographic code, State_ID the home state where the subscriber is located. CAMPAIGN_ID is a number from 1 to 10. Response 1 means the subscriber responded to the campaign.

SQL> select * from TBL_CAMPAIGN where subscriber_id=1 order by week_number; WEEK_NUMBER SUBSCRIBER_ID USER_CAT STATE_ID GENDER CAMPAIGN_ID RESPONSE ----------- ------------- -------- ---------- ------ ----------- ---------- 1 1 B 2 M 1 1 2 1 B 2 M 2 0 3 1 B 2 M 3 0 4 1 B 2 M 4 0 5 1 B 2 M 5 0 6 1 B 2 M 6 1 7 1 B 2 M 7 0 8 1 B 2 M 8 0 9 1 B 2 M 9 0 10 1 B 2 M 10 0 11 1 B 2 M 1 0 12 1 B 2 M 2 0 13 1 B 2 M 3 0 14 1 B 2 M 4 0 15 1 B 2 M 5 0 16 1 B 2 M 6 1 17 1 B 2 M 7 0 18 1 B 2 M 8 0 19 1 B 2 M 9 0 20 1 B 2 M 10 1 21 1 B 2 M 1 0 22 1 B 2 M 2 0 23 1 B 2 M 3 0 24 1 B 2 M 4 1 25 1 B 2 M 5 0 26 1 B 2 M 6 1 26 rows selected.There are 26 weeks historical records like the above for 100,000 subscribers. Based on the data, can you answer the following questions?

For week 27, suppose we send emails to only 25% of its subscriber base:

(1) Which subscribers would you send email to?

(2) Which campaign(s) would you deliver to them?

(3) What do you expect the response rate to be?

There are several challenges to solve these problems:

1. It is not simply building predictive models based on static variables such as gender, home state, etc. We need to consider variables that capture the dynamic nature of a subscriber's past responses to campaigns. Things to consider include:

a). Is a subscriber who responded recently more likely to respond?

b). Is the sequence of offering affecting the response rate? For example, if a subscriber is first sent campaign 3, say a new credit care with APR 19%, and then 1 week later a campaign 4, one with APR 6%, we expect that he would more likely to respond to campaign 4.

2. We need to find out which campaign out of ten that a subscriber will most likely to respond to.

3. We need to accurately estimate the response rate. The scores returned by some predictive models are not necessarily probabilities.

I did extensive studies and figured out approaches to solve those problems. There are a number of conclusions that I have drawn from the studies:

1. Finding the best derived variables is the most important step for building a successful model.

2. All models, from simple logistic regression to sophisticated gradient boosting trees, perform reasonable well.

3. Pre-modeling tasks,e.g, data loading, merging, calculating derived variables, have taken more than 85% of the effort.