Problem
Sometimes, there are missing values in independent variables. Simply dropping those records with missing values is usually not a good idea. How do we replace the missing values with something meaningful?
Solution
A common way to replace missing values is to use mean or median values. Here we present a more sophisticated method to data imputation based clustering. The idea is to first find clusters in the data sets based on variables that do not have missing values. Then for the missing values, we can find out which cluster the records belongs to. We replace the missing values with the means or medians of all non-missing value points falling into the same cluster. The diagram below describe this idea. The missing values in Z for record 4 will be replaced by the mean or median value of Z for records 1, 2, 3, etc., that fall into the same cluster. The advantage of this approach is that clusters are built based on multiple variables. We can do this using SQL within a database. The following table shows that income is missing sometimes.
select * from TBL_CUSTOMER3 where rownum <=10; CUST_ID INCOME YEARS_RES HOME_VALUE SEX MARRIED ---------- ---------- ---------- ---------- ---------- ---------- 7422 87499.50 2 117 "M" "M" 3356 15 80 "M" "U" 4782 62499.50 3 91 "M" "M" 7333 87499.50 7 85 "M" "M" 890 42999.50 1 58 "F" "U" 6401 87499.50 5 128 "M" "M" 2356 87499.50 4 96 "M" "M" 1638 87499.50 13 152 "M" "M" 6713 62499.50 6 49 "M" "U" 3674 87499.50 3 119 "M" "M" 10 rows selected.Step 1. Prepare a data set by dropping income.
SQL> create view v_training_km as select cust_id, years_res, home_value, sex, married from TBL_CUSTOMER3;Step 2. Build a k-means clustering model based on years_res, home_value, sex and married.
BEGIN DBMS_DATA_MINING.CREATE_MODEL( model_name => 'KM_MODEL', mining_function => dbms_data_mining.clustering, data_table_name => 'v_training_km', case_id_column_name => 'cust_id'); END; /We can apply the k-mean clustering model to the data and generate the clusters.
SQL> select a.*, cluster_id(KM_MODEL using *) cls_id from TBL_CUSTOMER3 a where rownum<=10; CUST_ID INCOME YEARS_RES HOME_VALUE SEX MARRIED CLS_ID ---------- ---------- ---------- ---------- ---------- ---------- ---------- 7422 87499.50 2 117 "M" "M" 12 3356 15 80 "M" "U" 18 4782 62499.50 3 91 "M" "M" 18 7333 87499.50 7 85 "M" "M" 18 890 42999.50 1 58 "F" "U" 8 6401 87499.50 5 128 "M" "M" 12 2356 87499.50 4 96 "M" "M" 18 1638 87499.50 13 152 "M" "M" 17 6713 62499.50 6 49 "M" "U" 8 3674 87499.50 3 119 "M" "M" 12 10 rows selected.Step 3. Generate imputed values for missing values based on cluster means.
SQL> create view v_imputed_data as with tbl_cls as (select a.*, cluster_id(KM_MODEL using *) as cls_id from TBL_CUSTOMER3 a ) select a.*, case when income is not null then income else avg(income) over(partition by cls_id) end as imputed_income from tbl_cls a ; View created.We can take a look at the missing values and their imputed values.
SQL> select CUST_ID,INCOME,IMPUTED_INCOME,CLS_ID from v_imputed_data where INCOME is null; CUST_ID INCOME IMPUTED_INCOME CLS_ID ---------- ---------- -------------- ---------- 6601 54704.0455 8 4898 64394.2368 9 6237 78894.6613 12 6111 78894.6613 12 6397 78894.6613 12 4288 72316.5732 13 2162 72316.5732 13 5296 72316.5732 13 3584 72316.5732 13 2114 79687 16 3356 70261.2188 18 462 70261.2188 18 4474 70261.2188 18 73 70261.2188 18 3484 70261.2188 18 5820 57488.1364 19 16 rows selected.