Relationships between attributes

The relationships between attributes are important to understand and visualization can help in understanding these relationships. Attributes may be correlated with one another and viewing this may help shed light on the data and new ways to perform further processing to help understand it and make progress towards the overall objective.

There are many ways to show how attributes relate to one another. These include scatter plots, 3D scatter plots, parallel plots, deviation plots, and quartile plots, which are described in the following sections.

Scatter plots

To start answering the question about the relationships between attributes and examples, the scatter plot is a quick summary method which has already been mentioned earlier. A good next step is the scatter matrix plotter, which summarizes all possible pair-wise permutations for a given attribute. This is used to determine the color for the points.

An example using the Iris dataset is shown in the next screenshot. The idea is to spot patterns in the data and see if it is possible to explain them. A general rule for classification is to see if groups of colored points representing labeled data can be separated by simple lines; in effect the observer is becoming a support vector classifier. By doing this, we can gain a better understanding of the data.

For example, the upper-right corner of the screenshot shows a graph of a4 on the x axis and a1 on the y axis. The points are colored based on the class of the example. The graph shows that there is an approximate correlation between a1 and a4, and that low values favor one class very clearly and higher values favor the others with a reasonably clear threshold. Understanding this and deciding what it means for the data mining task is an important step.

Scatter plots

The jitter parameter allows each point to be given a random nudge. This allows points that are very close together to be seen more easily, and this gives a sense of the density of the points in space. The way to understand how jitter works is to imagine that all the points that share the same x and y attribute are stacked one on top of another and you are looking down on them from a great height so only a single point is seen. Applying a small jitter causes the co-located points to become visible.

Of course, real data is never this easy and it can quickly become impossible to see the detail if there are too many attributes. 20 attributes plotted this way would also be difficult to visualize. In these situations, it may be appropriate to select groups of attributes using the Select Attributes operator to see how the attributes within these groups interact with one another. For example, if there are 20 attributes, selecting the first 10 with this operator and plotting them using the scatter matrix plotter will show how these 10 attributes relate to one another. The next 10 could then be selected and plotted. Of course, the interactions between groups selected in this way would not be seen, so care should be taken or else the permutations would quickly get out of hand.

Sometimes, however, two dimensions are not enough and so RapidMiner provides us with the scatter 3D color plotter.

Scatter 3D color

Again, using the Iris dataset, a 3D representation is shown in the following screenshot:

Scatter 3D color

Large datasets can be difficult to view using this plotter because there is a lot to plot and this can be beyond the capabilities of the computer running the GUI. Sampling using the Sample operator is one possibility in this situation. A process named scatterPlotAndSample.xml is included with the files that accompany this book. This shows the Iris dataset and also a 100,000 point dataset which has been sampled. Comparing the sampled and unsampled data on a plot gives a sense of whether the sampled data is still representative of the unsampled data, and therefore, whether it is useful to help understand the data or not.

An alternative is to use the parallel and deviation plotters described in the next section.

Parallel and deviation

The parallel plotter is used to see the relationships between attributes when there are many attributes and examples. The plotter lists each attribute on the x axis and deviation plots, then it plots the value of each attribute for each example. One of the attributes is chosen as the color and the line is colored based on the value of the example within the example set.

Some data is provided along with this book to allow the following illustrations to be recreated. The data is contained in a file called DataToVisualize.csv and it can be imported using the Read CSV operator. Be sure to set the role for the attribute att16 to label, the attribute id to have the role id, and the attribute date to have the role date_time using the Set Role operator or by getting the parameters correctly set when importing the CSV file. A sample process to read this CSV file with the correct parameters is provided. It is called readDataToVisualize.xml. This process is very straightforward and the resulting example set contains 3,848 examples with 15 regular attributes called att1 to att15, one label attribute called att16, an ID attribute, and a date attribute. The label attribute is nominal and has three values based on an original value of att16; these are range1, range2, and range3.

An illustration of a parallel plot using this real but obfuscated data is shown in the following screenshot:

Parallel and deviation

There are 16 attributes and 3,848 examples in the example set. This means that there are 3,848 different lines on the graphic, one for each example. att16 is chosen as the color so that when it has a high value, the line is colored red and when it has a low value it is colored blue. The graph has been locally normalized by checking the local normalization check box on the plotter, so that the range of the attributes is between zero and one. In monochrome, this may not show up very well. So, the area of focus is the the right-hand side of the graph, where att13 and att14 are shown as having a relatively higher value at the same time as att16 has a higher value (as indicated by the color of the lines).

This indicates correlation between these attributes. The data can be systematically investigated to determine relationships and as before, questions will be raised about the data that will give a greater understanding when answered.

This plotter works well with larger numbers of attributes (that is, more points on the x axis). For large numbers of examples, however, the number of lines can make the display look cluttered. One way to reduce this is to use the Deviation plotter. This plots an average value for each attribute with different lines for different values of another chosen attribute. It also includes an upper and a lower bound of one standard deviation. One attribute is chosen as the color, but this must be a nominal value to get multiple colors and different averages—and hence, different lines. An example plot is shown in the following screenshot:

Parallel and deviation

This is the same data that we presented earlier, and it clearly shows the relationship between att13, att14, and att16. The removal of the clutter makes it easier now to see for the first time that there is perhaps a negative correlation for most of the other attributes against att16.

Quartile color

To get a sense of the range of data points for attributes as a function of another attribute, the quartile color plot can be used. This is similar to the quartile plotter described earlier except that the color is set by another attribute that must be a nominal. Focusing on att13 and att16, a quartile color plot is shown in the following screenshot (a small number of outlying points have been removed to allow the image to display more clearly in this book):

Quartile color

att16 dictates the number and colors of the bars that are drawn. The left y axis shows the range for att13. The graph shows that higher values of att16 correspond to higher values of att13. The outliers are significant and it is possible to see many that overlap with the middle range value of att16. This provides evidence of outliers in the data and is worth investigating in order to determine the root cause.

When exploring real data, a systematic investigation would be done with all the attributes to get a sense of how the attributes depend on one another.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset