SPSS flow and K-means

As we mentioned earlier in this chapter, a popular type of clustering algorithm is the K-means clustering algorithm. Again, without the use of a labeled or target field, rather than trying to predict an outcome, K-means tries to uncover patterns and find structure in the data, by grouping and/or clustering data points in the set of input fields within data.

Using the sample data that we have been working with in this chapter, let's say that we don't know whether a person has chronic kidney disease or not and would like to use the K-means algorithm to build an unsupervised model to see whether we can identify any pattern for chronic kidney disease.

We'll choose the K-Means node in our flow to accomplish this task.

The K-Means node offers a method of cluster analysis which you can refer to Chapter 11 in the documentation of IBM SPSS Modeller 15 from the following link: http://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/15.0/en/ModelingNodes.pdf.

Let's take a look at the following steps to learn how to learn about the K-means algorithm:

  1. From the left, under Modeling, we can select the K-Means node and drop it onto the canvas.
  2. Next, connect the node to the Type node as shown in the following screenshot:

Note that I have disconnected the Partition node that we used earlier.
  1. Once we have added the K-Class node, right-click and open it to change its settings (on the right-hand side of the canvas). Specifically, under BUILD OPTIONS, we'll set Number of clusters to 2 based upon the idea that we would want to organize our data into two groups (or clusters): those with chronic kidney disease and those who do not have chronic kidney disease. All of the other settings can remain defaults. Finally, click on Save.
  2. Now, after you run the flow, a golden K-Means node will appear (shown in the following screenshot) on which you can right-click and select View Model:

  1. SPSS visualizations offer interactive tables and charts to help evaluate a predictive model. These visualizations provide a single all-inclusive set of output so that you don't need to create multiple charts and tables to determine the model’s performance. Depending on the algorithm, you'll see a set of visualizations that are related to your specific data set and model. The following is the output from our K-means model:

The output includes information on Cluster Quality (shown in the preceding screenshot) as well as Predictor Importance (shown in the following screenshot):

Cluster Quality Evaluation is a complex subject and is beyond the scope of this chapter, however IBM Watson Studio provides the typical Cluster Quality details such as the Cluster Sizes Chart which is a horizontal bar chart displaying the relative sizes of the clustering in descending order. Hovering over a bar shows the precise percentage of the total number of instances in that cluster based on the K-Means model.  All of the clustering information should be reviewed and evaluated in respect to various project options and outcomes.  

And finally (although there are other informational visualizations generated), it shows the basic Model Information (as shown below):

An awesome feature of the SPSS modeler flow is that you can build multiple, different models within the same canvas!

It is literally so easily to make changes to the nodes, rerun (the flow), and then re-evaluate the results to determine the best algorithm and parameters, that you should just assume multiple iterations as part of the process.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset