Tuesday, March 5, 2013

Clustering in Oracle Data Miner–Part 4

This is a the fourth part of a five (5) part blog post on building and using Clustering in Oracle Data Miner. The following outlines the contents of each post in this series on Clustering.
  1. The first part we looked at what clustering features exist in ODM and how to setup the data that we will be using in the examples
  2. The second part will focus on how to building Clusters in ODM .
  3. The third post will focus on examining the clusters produced by ODM and how to use the Clusters to apply to new data using ODM.
  4. The fourth post will look at how you can build and evaluate a Clustering model using the ODM SQL and PL/SQL functions.
  5. The fifth and final post will look at how you can apply your Clustering model to new data using the ODM SQL and PL/SQL functions.
With Clustering we are trying to find hidden patterns in our data. Unlike classification we a not directing the algorithms on what areas/attributes to focus on.
In my preceding post on Clustering in Oracle Data Miner I showed how you can use the Oracle Data Miner (ODM) tool, that is part of SQL Developer, to build a cluster model and to apply it to new data.
As I have shown in previous blog posts we can do all of the same steps using SQL and PL/SQL. In this blog post I will show you how to to these SQL and PL/SQL functions to generate a Clustering model and how you can evaluate the model produced.
Step 1 – What Cluster Models to we already have
Oracle Data Miner comes with a number of data dictionary tables. We can use these to see what already exists in our schema. In the example below we will have a look to see the Cluster models that were produced in the Part 2 blog post.
format model_name format a20
format mining_function format a20
format algorithm format a20

SELECT model_name,
       mining_function, 
       algorithm, 
       build_duration, 
       model_size
FROM ALL_MINING_MODELS
WHERE mining_function = 'CLUSTERING';

image
We can also look at the model settings that ODM produced. We can look at the settings of the K-Means model. This model (in my case) is called CLUS_KM_1_25.
column setting_value format a20
select setting_name,
         setting_value,
         setting_type
from  all_mining_model_settings
where model_name = 'CLUS_KM_1_25'

image
We can also look to see what attributes are used in the K-Mean model.
SELECT attribute_name, 
       attribute_type, 
       usage_type, 
       target
from  all_mining_model_attributes
where model_name = 'CLUS_KM_1_25'

image
I will show you how to use this Clustering model or the one that we are about to generate in my next blog post.
Step 2 – Creating the Setting table
When creating an Oracle Data Mining model in SQL you need to create Setting table. This will contain a record for each setting they you need to set for the algorithm you want to run. The settings table has a very basic structure and only consists of two columns. It is a good idea to create a separate Settings table for each algorithm or type of data mining you want to do. To create the settings table we run the following:
CREATE TABLE CLUSTER_SETTINGS (
Setting_Name  VARCHAR2(30),
Setting_Value VARCHAR2(4000));

The next step is to define the setting we want to use. The first of these will be the actual algorithm you want to run. Oracle has two Clustering algorithms: algo_kmeans and algo_o_cluster. In the examples we are working through we are using the K-Mean algorithm and we want to set the number of clusters to 10. We are also going to set the Automatic Data Preparation (ADP) on. Here is the code.
BEGIN
   INSERT INTO CLUSTER_SETTINGS (setting_name, setting_value)
   VALUES (dbms_data_mining.algo_name, dbms_data_mining.ALGO_KMEANS);
   INSERT INTO CLUSTER_SETTINGS (setting_name, setting_value)
   VALUES (dbms_data_mining.prep_auto, dbms_data_mining.PREP_AUTO_ON);
   INSERT INTO CLUSTER_SETTINGS (setting_name, setting_value)
   VALUES (dbms_data_mining.clus_num_clusters, 10);
  
   COMMIT;
END;

We can check to see that these records were inserted.
column setting_value format a25
select * from CLUSTER_SETTINGS;
image
The other settings will be dependent on what clustering algorithm is being used. Each one has their own setting. If you do not define any additional setting Oracle will use the in-build default settings.
To see what the default settings:
column setting_value format a35
select *
from table(dbms_data_mining.get_default_settings)
where setting_name like 'KM%';

image
If you want to use a different value to the default, just create a new record in the CLUSTER_SETTINGS table with the new value.

Step 3 – Create the new Cluster Model
We have the algorithm settings ready, so the next steps is to create the Cluster model. We will be using the DBMS_DATA_MINING.CREATE_MODEL function and pointing it to our CLUSTER_SETTINGS table.
BEGIN
   DBMS_DATA_MINING.CREATE_MODEL(
      model_name          => 'CLUSTER_KMEANS_MODEL',
      mining_function     => dbms_data_mining.clustering,
      data_table_name     => 'INSURANCE_CUST_LTV',
      case_id_column_name => 'CUSTOMER_ID',
      target_column_name  => null,
      settings_table_name => 'CLUSTER_SETTINGS');
END;

This takes a few seconds to run on my laptop. When finished we can uses queries like we used in Step 1 above to find out the details of what was produced.
Step 4 – Examining the Cluster model
The first thing that we will need to check is that the model was created.
format model_name format a20
format mining_function format a15
format algorithm format a15

SELECT model_name, 
       mining_function,
       algorithm,
       build_duration,
       model_size
FROM ALL_MINING_MODELS
WHERE mining_function = 'CLUSTERING';

image
One thing that is a bit curious is the mode size. We have basically created two K-means models using the same settings, but the final models have different sizes. See below for an explanation.
Now lets look at the algorithms settings for our new Cluster model (Cluster KMeans Model)
select setting_name,
       setting_value,
       setting_type
from  all_mining_model_settings
where model_name = 'CLUSTER_KMEANS_MODEL'

image
We can also look at the attributes used in the clusters.
SELECT attribute_name, 
       attribute_type, 
       usage_type, 
       target
from  all_mining_model_attributes
where model_name = 'CLUSTER_KMEANS_MODEL'
image
If we compare this list back to the list of attributes that were part of the model that we developed in the ODM tool, you will see that we have some extra attributes listed for the new K-means model.
The reason for these extra attributes and the bigger size (although still small) will be due to us having applied a sampling in the ODM tool. We took a subset of the data in the ODM tool and built the model based on this data. For the model we just created in SQL we took all the data (no sampling). That is why it would have looked at including more attributes because it was looking at a much bigger data set, etc.
Step 5 – Examining the Cluster Details
In this step we are going to look at some of the clusters produced by our new Cluster model.
We can produce the information relating to the clusters that were produced and what level in the hierarchy each cluster belongs. We can use this information to create/draw our own cluster hierarchy diagram.
set lines 90
column child format a40

select id, 
       record_count, 
       parent, 
       tree_level, 
       child
from table(dbms_data_mining.get_model_details_km('CLUSTER_KMEANS_MODEL'))

image
To look at the Centroid details for a particular cluster e.g. Cluster ID = 7, we can run:
column attribute_name format a25
column mode_value format a25

select t.id,
       c.attribute_name, 
       c.mean,
       c.mode_value,
       c.variance
from table (dbms_data_mining.get_model_details_KM('CLUSTER_KMEANS_MODEL')) t,
     table(t.centroid) c
where t.id = 7
order by c.attribute_name

The results will contain details of the centroid for cluster number 7. The centroid details will include the attributes that were used to form the centroid and the values for each attribute.
image
The last thing we can look at is the rule for a cluster. The following query will give you the full details of the cluster rule for Cluster 7. This will give for each attribute what the values or ranges of values that is used for that cluster. I’ll let you run this as it returns a large number of records. Run it and have a look at some of the output.
select t.id, 
       a.attribute_name, 
       a.conditional_operator,
       nvl(a.attribute_str_value,
       a.attribute_num_value) AS value,
       a.attribute_support,
       a.attribute_confidence
from TABLE(dbms_data_mining.get_model_details_km('CLUSTER_KMEANS_MODEL'))  t, 
     TABLE(t.rule.antecedent)  a
where t.id = 7
ORDER BY t.id, a.attribute_name, attribute_support, attribute_confidence desc, value;


My next blog post on Clustering in Oracle, will be the final one in the series. It will show you how to apply a Clustering model to new data in your database.

No comments:

Post a Comment