Chapter 8. Reducing Data Size

One definition of Big Data states that it is big if it is just at or beyond the edge of the capabilities of an organization to process it. There is always too much data, and human nature, being what it is, makes it inevitable that the boundaries of what is possible will be reached. The main problem is that more data takes more time to process. To a certain extent, more money, more computing power, and a more sophisticated parallel approach can help, but some data mining processes scale as the second, third, or worse power of the number of examples and attributes. Doubling the data size quadruples the runtime and there comes a point where there is not enough money or time to finish the job. Using techniques that scale linearly is possible for certain types of problems. However, it still costs money and effort to get there.

An important activity is to recognize this and find ways to reduce both the number of examples and attributes, while balancing this against the accuracy of predictions or modeling requirements. This chapter, therefore, discusses methods for reducing data size.

The chapter starts with methods for removing examples using the Sample operator and its variants. From there, the chapter progresses to methods to remove attributes, including removal of useless attributes and attribute weighting, and also illustrates model-based approaches.

Removing examples using sampling

The Sample operators allow example subsets to be chosen, and there are a number of different techniques to use depending on what is required and the characteristics of the data.

A very common use of the operator is to simply reduce the size of data to test the flow of a complex process. If the full data consists of millions of rows, the execution time may be immense and it is annoying to find a bug right at the end of a long run. Reducing the data size to a few percent of the total allows bugs to be found early.

The Sample operator allows a proportion of an example set to be selected. There are three possible options. First, an absolute size of the result; second, a proportion of the example set; and third, a probability that an example will appear. The absolute size is useful when a fixed number of examples is required, while the proportion is useful if a fixed percentage of the whole is required. Probability is similar to the proportion option but considers each example, and based on the probability, causes it to be filtered, which can lead to a different number of examples when compared to the proportion case. If the data contains a label, it is possible to balance the proportion of labels in the generated data and specifically choose the proportion of examples within the filtered example for different values of the label. This is useful if you have data where one label dominates, leading to class imbalances; the sampling can even up the class distribution. The Sample (Stratified) operator can also be used to create a sample where the proportion of label values in the sampled data matches the original data.

The Sample (Bootstrapping) operator is used to build datasets that are larger than the original dataset. It does this by sampling with replacement. At first sight, this may seem pointless but when faced with a dataset with a large class imbalance, it is often important to build training sets that have an equal class balance. This is done by bootstrapping the original data to increase it in size until the desired number of examples of one class are present. From there, the example set is sampled, so different label proportions appear in the result.

A process called sampleExamples.xml is provided with the files that accompany this book. This contains examples of all the sample operators described in the previous paragraphs.

Sampling inevitably introduces errors. The size of the error will be driven completely by the data exploration and mining processes that are being performed, and investigation and analysis will be needed to estimate or measure errors.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset