Sunday, March 30, 2014

Oracle On The Fly Model- Clustering

One of the features in Oracle 12c is the ability of building and running models on the fly ( as opposed to the conventional two steps of building a persistent model first and then applying it to the data). On the fly predictive models provides data miners powerful tools for accomplishing sophisticated tasks with easy.
I will use on the fly k-means clustering as an example. Take a look at the following simple consumer data. Cust_ID is the unique identifier for a consumer.

SQL> select * from TBL_CUSTOMER3 where rownum <=10;

   CUST_ID     INCOME  YEARS_RES HOME_VALUE SEX        MARRIED
---------- ---------- ---------- ---------- ---------- ----------
      7422    87499.5          2        117 "M"        "M"
      3356                    15         80 "M"        "U"
      4782    62499.5          3         91 "M"        "M"
      7333    87499.5          7         85 "M"        "M"
       890    42999.5          1         58 "F"        "U"
      6401    87499.5          5        128 "M"        "M"
      2356    87499.5          4         96 "M"        "M"
      1638    87499.5         13        152 "M"        "M"
      6713    62499.5          6         49 "M"        "U"
      3674    87499.5          3        119 "M"        "M"

10 rows selected.

To understand consumers better, usually we want to categorize consumers into small number of groups, instead of dealing with them individually. We can using Oracle predictive function cluster_id() over() to find clusters on the fly as shown below. Here, cluster_id(into 5 using income, YEARS_RES, HOME_VALUE) returns 5 cluster identifiers using the default clustering algorithms (K-means) based on variables income, YEARS_RES, HOME_VALUE. Missing input variables are OK as they will be replaced by the means. Please notice that the numbering of cluster identifier is not important at all,i.e., as long as clusters are different we can number cluster id arbitrarily.

SQL> select * from (select cust_id, 
   cluster_id(into 5 using income, YEARS_RES, HOME_VALUE ) over() cid 
from TBL_CUSTOMER3) where rownum <=10;

   CUST_ID        CID
---------- ----------
      7422          9
      3356          6
      4782          7
      7333          8
       890          5
      6401          9
      2356          9
      1638          8
      6713          7
      3674          9

10 rows selected.

We can calculate some summary information about each group using the following query to gain insight into the data.
SQL> select cid, count(*), avg(income) avg_income, 
avg(years_res) avg_years_res, 
avg(home_value) avg_home_value from 
(select cust_id, income, years_res, home_value, 
cluster_id(into 5 using income, YEARS_RES, HOME_VALUE ) over() cid 
from TBL_CUSTOMER3) group by cid order by count(*) desc;

       CID   COUNT(*)  AVG_INCOME AVG_YEARS_RES AVG_HOME_VALUE
---------- ---------- ----------- ------------- --------------
         7        152    62499.50          3.88          92.89
         9        126    87499.50          2.75         117.98
         5        110    47254.05          2.58          67.37
         8         65    87499.50          9.15         105.58
         6         31    55749.50         13.45          82.39
The clustering models are generated on the fly. When the queries finish, the models are gone. Thus it is very "clean" to run queries involving on the fly models.

Please notice that in the above query, we combine predictive model with standard SQL (group by cid). This is extremely powerful in that we seamlessly raise conventional SQL queries up to a higher level of predictive modeling. We instantly turn a SQL developer into a data miner. Bingo! (If you are interested in this point, please also see my early posts From Oracle SQL Developer to Data Miner and How easy is it for a SQL developer/analyst to become a data miner?

No comments: