Saturday, January 18, 2014

Impute Missing Values Using Clustering Algorithms

Problem

Sometimes, there are missing values in independent variables. Simply dropping those records with missing values is usually not a good idea. How do we replace the missing values with something meaningful?

Solution

A common way to replace missing values is to use mean or median values. Here we present a more sophisticated method to data imputation based clustering. The idea is to first find clusters in the data sets based on variables that do not have missing values. Then for the missing values, we can find out which cluster the records belongs to. We replace the missing values with the means or medians of all non-missing value points falling into the same cluster. The diagram below describe this idea. The missing values in Z for record 4 will be replaced by the mean or median value of Z for records 1, 2, 3, etc., that fall into the same cluster.
The advantage of this approach is that clusters are built based on multiple variables. We can do this using SQL within a database.

The following table shows that income is missing sometimes.

select * from TBL_CUSTOMER3 where rownum <=10;
   CUST_ID     INCOME  YEARS_RES HOME_VALUE SEX        MARRIED
---------- ---------- ---------- ---------- ---------- ----------
      7422   87499.50          2        117 "M"        "M"
      3356                    15         80 "M"        "U"
      4782   62499.50          3         91 "M"        "M"
      7333   87499.50          7         85 "M"        "M"
       890   42999.50          1         58 "F"        "U"
      6401   87499.50          5        128 "M"        "M"
      2356   87499.50          4         96 "M"        "M"
      1638   87499.50         13        152 "M"        "M"
      6713   62499.50          6         49 "M"        "U"
      3674   87499.50          3        119 "M"        "M"

10 rows selected.
Step 1. Prepare a data set by dropping income.
SQL> create view v_training_km as select cust_id, years_res, 
home_value, sex, married from TBL_CUSTOMER3;
Step 2. Build a k-means clustering model based on years_res, home_value, sex and married.
BEGIN
  DBMS_DATA_MINING.CREATE_MODEL(
    model_name          => 'KM_MODEL',
    mining_function     => dbms_data_mining.clustering,
    data_table_name     => 'v_training_km',
    case_id_column_name => 'cust_id');
END;
/
We can apply the k-mean clustering model to the data and generate the clusters.
SQL> select a.*, cluster_id(KM_MODEL using *) cls_id 
from TBL_CUSTOMER3 a where rownum<=10;

   CUST_ID     INCOME  YEARS_RES HOME_VALUE SEX        MARRIED        CLS_ID
---------- ---------- ---------- ---------- ---------- ---------- ----------
      7422   87499.50          2        117 "M"        "M"                12
      3356                    15         80 "M"        "U"                18
      4782   62499.50          3         91 "M"        "M"                18
      7333   87499.50          7         85 "M"        "M"                18
       890   42999.50          1         58 "F"        "U"                 8
      6401   87499.50          5        128 "M"        "M"                12
      2356   87499.50          4         96 "M"        "M"                18
      1638   87499.50         13        152 "M"        "M"                17
      6713   62499.50          6         49 "M"        "U"                 8
      3674   87499.50          3        119 "M"        "M"                12

10 rows selected.
Step 3. Generate imputed values for missing values based on cluster means.
SQL> create view v_imputed_data as with tbl_cls as 
(select a.*, cluster_id(KM_MODEL using *) as cls_id from TBL_CUSTOMER3 a ) 
select a.*, case when income is not null then income 
else avg(income) over(partition by cls_id) end as imputed_income 
from tbl_cls a ;

View created.
We can take a look at the missing values and their imputed values.
SQL> select CUST_ID,INCOME,IMPUTED_INCOME,CLS_ID 
from v_imputed_data where INCOME is null;

   CUST_ID     INCOME IMPUTED_INCOME     CLS_ID
---------- ---------- -------------- ----------
      6601                54704.0455          8
      4898                64394.2368          9
      6237                78894.6613         12
      6111                78894.6613         12
      6397                78894.6613         12
      4288                72316.5732         13
      2162                72316.5732         13
      5296                72316.5732         13
      3584                72316.5732         13
      2114                     79687         16
      3356                70261.2188         18
       462                70261.2188         18
      4474                70261.2188         18
        73                70261.2188         18
      3484                70261.2188         18
      5820                57488.1364         19

16 rows selected.

2 comments:

Logesh Waran said...
This comment has been removed by the author.
Logesh Waran said...

Thanks for sharing this informative blog on data mining! Not only for learners or beginners, It will be more helpful blog for professionals too.