Saturday, April 05, 2014

More On Zip Code and Predictive Models - Variable of High Cardinality

In the post Zip Code and Predictive Models, we talk about how to use zip codes in predictive models. Zip codes have many distinct or unique values. Many other variables include MCC (Merchant Category Code), credit card transaction terminal ID, and IP address have similar characteristics. Actually, there is a terminology to describe the uniqueness or distinctness of variable, the cardinality. High cardinality variables have many unique values. In extreme case, high cardinality variables are unique for each data record and they practically become unique identifiers. These extremely high cardinality variables are not really useful for being included in predictive model. For example, customer names are very unique. If we include names as one of the input variables to build a predictive model, the model will likely perform extremely well on the training data set by simply memorizing association between the customer name and target variable. However, the model will perform poorly on new data that contain unseen names.

The relatively high cardinality variable such as MCC, credit card transaction terminal ID, and IP address can be handled using the same methodology described in Zip Code and Predictive Models to categorize them into smaller number of groups. For tree-based models, it is not necessary to do this for high cardinality varibles.

No comments: