Relations between examples

Understanding how examples relate to each other is important. This is because examples that are close to one another may be duplicates, so it is worth considering and understanding how they arise and what needs to be done, if anything, about them.

Closeness in this context is some sort of distance measure such as Euclidean distance or cosine similarity. Many possible distances can be calculated using RapidMiner and a brief explanation of Euclidean distance is given in the next section.

The following screenshot shows three data points in two dimensions:

Relations between examples

The points are labeled 1, 2, and 3 and the Euclidean distances between them are shown in the inset table. The Euclidean distance between the First and Second point is given by the following equation:

Relations between examples

Intuitively, we can see that the distance between points 1 and 2 is smaller than their distance from 3. This gives the idea that these two points could be more closely related than the third, and this information is valuable to help us understand the data.

This approach extends to higher dimensions, but it quickly becomes impossible to visualize when there is a lot of data. There are two approaches described here that can help us with this. The first of these involves plotting a histogram of the distances.

Using histograms

As an example, the following graph shows all the pair-wise distances for the DataToVisualize.csv data provided with this book. Simply run the DistancesPlotter.xml process provided. This process uses the Data to Similarity operator to create data for this histogram view. Using this example set in the results view, select the histogram plotter and plot the distance to create the following screenshot:

Using histograms

This is a large dataset containing nearly 15 million pairs and it may be the limit of what can be realistically displayed on the RapidMiner GUI. Nonetheless, examination of this shows that there are no outrageous outliers and the peaks at distances of 0.2, 0.35, and the small peak at 1.0, indicate interesting things in the data.

Using block plots

An alternative way to display relations between examples is to display them in a grid, with one set of examples represented along the x axis and the other set along the y axis. The intersection is then colored to represent the distance between the examples.

Calculating distances can be done with the Data to Similarity Data operator (as done with the histogram in the preceding diagram) but a better alternative is the Cross Distances operator. This operator provides a method for selecting the nearest or furthest distances, which can be vital if the number of pairs of attributes is very large because too many pairs will not be displayable in the RapidMiner GUI.

The following screenshot shows such a plot. This is the block plotter from the result of the Cross Distances operator within the process DistancesPlotter.xml. The x axis is set to request, the y axis is set to document, and the color is the distance.

Using block plots

There is considerable structure in the data. Given that this data has a time series element, the graphic shows how examples are changing as a function of time. The most interesting things that stand out are the diagonal lines that give evidence of a repeating pattern (every 26 minutes, interestingly). The horizontal lines at 550, 1,300, 2,100, and 2,900 are also interesting and need to be understood.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset