Saturday, January 18, 2014

Impute Missing Values Using Clustering Algorithms

Problem

Sometimes, there are missing values in independent variables. Simply dropping those records with missing values is usually not a good idea. How do we replace the missing values with something meaningful?

Solution

A common way to replace missing values is to use mean or median values. Here we present a more sophisticated method to data imputation based clustering. The idea is to first find clusters in the data sets based on variables that do not have missing values. Then for the missing values, we can find out which cluster the records belongs to. We replace the missing values with the means or medians of all non-missing value points falling into the same cluster. The diagram below describe this idea. The missing values in Z for record 4 will be replaced by the mean or median value of Z for records 1, 2, 3, etc., that fall into the same cluster.
The advantage of this approach is that clusters are built based on multiple variables. We can do this using SQL within a database.

The following table shows that income is missing sometimes.

```select * from TBL_CUSTOMER3 where rownum <=10;
CUST_ID     INCOME  YEARS_RES HOME_VALUE SEX        MARRIED
---------- ---------- ---------- ---------- ---------- ----------
7422   87499.50          2        117 "M"        "M"
3356                    15         80 "M"        "U"
4782   62499.50          3         91 "M"        "M"
7333   87499.50          7         85 "M"        "M"
890   42999.50          1         58 "F"        "U"
6401   87499.50          5        128 "M"        "M"
2356   87499.50          4         96 "M"        "M"
1638   87499.50         13        152 "M"        "M"
6713   62499.50          6         49 "M"        "U"
3674   87499.50          3        119 "M"        "M"

10 rows selected.
```
Step 1. Prepare a data set by dropping income.
```SQL> create view v_training_km as select cust_id, years_res,
home_value, sex, married from TBL_CUSTOMER3;
```
Step 2. Build a k-means clustering model based on years_res, home_value, sex and married.
```BEGIN
DBMS_DATA_MINING.CREATE_MODEL(
model_name          => 'KM_MODEL',
mining_function     => dbms_data_mining.clustering,
data_table_name     => 'v_training_km',
case_id_column_name => 'cust_id');
END;
/
```
We can apply the k-mean clustering model to the data and generate the clusters.
```SQL> select a.*, cluster_id(KM_MODEL using *) cls_id
from TBL_CUSTOMER3 a where rownum<=10;

CUST_ID     INCOME  YEARS_RES HOME_VALUE SEX        MARRIED        CLS_ID
---------- ---------- ---------- ---------- ---------- ---------- ----------
7422   87499.50          2        117 "M"        "M"                12
3356                    15         80 "M"        "U"                18
4782   62499.50          3         91 "M"        "M"                18
7333   87499.50          7         85 "M"        "M"                18
890   42999.50          1         58 "F"        "U"                 8
6401   87499.50          5        128 "M"        "M"                12
2356   87499.50          4         96 "M"        "M"                18
1638   87499.50         13        152 "M"        "M"                17
6713   62499.50          6         49 "M"        "U"                 8
3674   87499.50          3        119 "M"        "M"                12

10 rows selected.
```
Step 3. Generate imputed values for missing values based on cluster means.
```SQL> create view v_imputed_data as with tbl_cls as
(select a.*, cluster_id(KM_MODEL using *) as cls_id from TBL_CUSTOMER3 a )
select a.*, case when income is not null then income
else avg(income) over(partition by cls_id) end as imputed_income
from tbl_cls a ;

View created.
```
We can take a look at the missing values and their imputed values.
```SQL> select CUST_ID,INCOME,IMPUTED_INCOME,CLS_ID
from v_imputed_data where INCOME is null;

CUST_ID     INCOME IMPUTED_INCOME     CLS_ID
---------- ---------- -------------- ----------
6601                54704.0455          8
4898                64394.2368          9
6237                78894.6613         12
6111                78894.6613         12
6397                78894.6613         12
4288                72316.5732         13
2162                72316.5732         13
5296                72316.5732         13
3584                72316.5732         13
2114                     79687         16
3356                70261.2188         18
462                70261.2188         18
4474                70261.2188         18
73                70261.2188         18
3484                70261.2188         18
5820                57488.1364         19

16 rows selected.
```