Deep Data Mining Blog: More On Zip Code and Predictive Models

Saturday, April 05, 2014

More On Zip Code and Predictive Models - Variable of High Cardinality

In the post Zip Code and Predictive Models, we talk about how to use zip codes in predictive models. Zip codes have many distinct or unique values. Many other variables include MCC (Merchant Category Code), credit card transaction terminal ID, and IP address have similar characteristics. Actually, there is a terminology to describe the uniqueness or distinctness of variable, the cardinality. High cardinality variables have many unique values. In extreme case, high cardinality variables are unique for each data record and they practically become unique identifiers. These extremely high cardinality variables are not really useful for being included in predictive model. For example, customer names are very unique. If we include names as one of the input variables to build a predictive model, the model will likely perform extremely well on the training data set by simply memorizing association between the customer name and target variable. However, the model will perform poorly on new data that contain unseen names.

The relatively high cardinality variable such as MCC, credit card transaction terminal ID, and IP address can be handled using the same methodology described in Zip Code and Predictive Models to categorize them into smaller number of groups. For tree-based models, it is not necessary to do this for high cardinality varibles.

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics

Popular Topics

Saturday, April 05, 2014

More On Zip Code and Predictive Models - Variable of High Cardinality

No comments: