As we mentioned earlier in this chapter, a popular type of clustering algorithm is the K-means clustering algorithm. Again, without the use of a labeled or target field, rather than trying to predict an outcome, K-means tries to uncover patterns and find structure in the data, by grouping and/or clustering data points in the set of input fields within data.
Using the sample data that we have been working with in this chapter, let's say that we don't know whether a person has chronic kidney disease or not and would like to use the K-means algorithm to build an unsupervised model to see whether we can identify any pattern for chronic kidney disease.
We'll choose the K-Means node in our flow to accomplish this task.
Let's take a look at the following steps to learn how to learn about the K-means algorithm:
- From the left, under Modeling, we can select the K-Means node and drop it onto the canvas.
- Next, connect the node to the Type node as shown in the following screenshot:
- Once we have added the K-Class node, right-click and open it to change its settings (on the right-hand side of the canvas). Specifically, under BUILD OPTIONS, we'll set Number of clusters to 2 based upon the idea that we would want to organize our data into two groups (or clusters): those with chronic kidney disease and those who do not have chronic kidney disease. All of the other settings can remain defaults. Finally, click on Save.
- Now, after you run the flow, a golden K-Means node will appear (shown in the following screenshot) on which you can right-click and select View Model:
- SPSS visualizations offer interactive tables and charts to help evaluate a predictive model. These visualizations provide a single all-inclusive set of output so that you don't need to create multiple charts and tables to determine the model’s performance. Depending on the algorithm, you'll see a set of visualizations that are related to your specific data set and model. The following is the output from our K-means model:
The output includes information on Cluster Quality (shown in the preceding screenshot) as well as Predictor Importance (shown in the following screenshot):
And finally (although there are other informational visualizations generated), it shows the basic Model Information (as shown below):
It is literally so easily to make changes to the nodes, rerun (the flow), and then re-evaluate the results to determine the best algorithm and parameters, that you should just assume multiple iterations as part of the process.