Sunday, September 02, 2012

Dealing with missing values in data mining

We have found that when analyzing large amount of data such as financial transactions, medical claims,  cell phone calls, and credit information, missing values are common. Simply ignoring data points with missing values is recommended in most cases. There are a number of ways to handle missing values for categorical and continuous variables.

The following examples illustrate how to replace missing values in categorical variables.
Original data:
Record_ID   Variable_X
 1                     A
 2                     B
 3                     A
........................
95                  Missing

Method 1. Replace missing value with a special value (e.g. X)
Record_ID   Variable_X
 1                     A
 2                     B
 3                     A
........................
95                   X
In Oracle, this can be done easily using function nvl(). The following is the SQL scripts to perform the replacement.
select nvl(Variable_X,'X')  as Variable_X, ..... from original_table;
Method 2. Replace missing value with the most frequent value.
Record_ID   Variable_X
 1                     A
 2                     B
 3                     A
........................
95                   A
The following is the SQL scripts to perform the replacement.
select nvl(Variable_X,'A')  as Variable_X, ..... from original_table;

The following examples illustrate how to replace missing values in continuous variables.
Original data:
Record_ID   Variable_X
 1                     1.3
 2                     0.5
 3                     2.9
........................
95                  Missing

Method 3. Replace missing value with mean or median value.
Record_ID   Variable_X
 1                     1.3
 2                     0.5
 3                     2.9
........................
95                  1.56
The following is the SQL scripts to perform the replacement.
select  nvl(Variable_X, 1.56) as Variable_X, ..... from original_table;

Method 4. We first convert continuous variable into categorical variable. This is also called binning. I will discuss binning in another post. We define missing as a special category.
The following is the SQL scripts to perform the above logic.
case when Variable_X <=1.2 then 'A'
        when Variable_X <=1.8 then 'B'
        when Variable_X  > 1.8 then 'C'
       else 'Z' end  Variable_X_BIN

New Variable Variable_X_BIN.
Record_ID   Variable_X  --> Variable_X_BIN
 1                     1.3           ---> B
 2                     0.5           ---> A
 3                     2.9           --->B
........................
95                  missing      ---->Z

There are more sophisticated methods to impute missing values using multiple variable models. But they are rarely used in real world applications.

No comments: