Statistical summaries

Having promised pictures, we start, however, by never underestimating a simple statistical summary. The Statistics View in the RapidMiner Studio GUI gives such a summary, and this is very useful to get a sense of how big the data is and what its range is. This view is available to show example sets when the Results view is selected. It is always worthwhile to take a careful look at this view to check that the attributes are of the correct type. Numerical attributes should have an average and a standard deviation that looks sensible and nominal values should have a full set of valid values and dates within an expected range. The Statistics View also shows which attributes have missing values.

Sensibly, the GUI does not attempt to calculate statistics when there is too much data. In this situation, it is possible to calculate statistics in a process by using the Extract Macro operator. This operator is used to set a macro from some aspect of an example set that is being processed. By selecting the macro type to be statistics, and selecting one of the possible calculations such as average, count, sum, and so on, a macro value for an attribute can be calculated.

For large numbers of attributes this can be laborious, but it is perfectly possible to create a process using the Loop Attributes operator that loops over all attributes and generates a set of macros for each type of statistical measure. Then, these macros can be logged using Provide Macro as the Log Value operator and can be converted to an example set using Log to Data. An example process named logAttributeDetails.xml is provided with the files that accompany this book. This process shows the use of these operators to provide a statistical summary of some data that replicates the Statistics View.

To identify missing attributes, the Filter Examples operator can be used. Set the condition class for this operator to missing_attributes and only examples with at least one missing attribute will be shown. An example of this operator being used is included in the logAttributeDetails.xml process.

This chapter is about visualization and one useful plotter to supplement the Statistics view is the quartile plotter, which can be a useful way to summarize numerical attributes in particular.

For example, the following screenshot shows the Statistics for the Iris dataset:

Statistical summaries

The same data plotted using the quartile plotter is shown in the following screenshot:

Statistical summaries

This plot can be obtained by holding down the Ctrl key and selecting multiple attributes. The plot shows the mean with the help of the dot within the colored area and the range of the standard deviation is shown by the vertical line offset toward the right, inside the colored area. The colored area itself represents the 25th to 75th quartiles and the 10th and 90th quartiles are the horizontal lines above and below (outside) the colored area. Finally, the range is represented by the dots at the extremes. Examination of the figures will confirm that the numbers match the screenshot. For example, the maximum for attribute a3 (the third from the left) is 6.9 and this is shown on the graph as the highest point for that attribute. Similarly, the average for the same attribute is 3.79 and this is represented by the dot within the shaded area.

The metadata gives information about individual attributes in isolation. Another important aspect is to understand how attributes relate to one another. This aspect is covered in the following section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset