Options for handling missing data

The exploration of data identifies missing data, and the overall process outside the scope of the exploration needs to consider the options for handling it and how these are affected by the type of missing data. Some guidelines are given in the following sections to help you make your decisions.

Returning to the root cause

It is obvious that missing data is a bad thing. So if it happens, it's always worth stepping back and determining why it's missing in the first place. The time spent on fixing the root cause of missing data will save time later and improve the quality of the data exploration and mining process in general.

Ignore it

Some learning algorithms cope with missing values but some do not. An example of one that is extremely sensitive to missing data is the support vector machine, which will produce very poor results even with one missing attribute. For example, using the LibSVM operator with the 10,000 examples from an earlier section in this chapter, it is possible to achieve a 99.6 percent classification performance. Adding a few missing values reduces this immediately to 50 percent and there is no warning message given. It could easily happen that when processing unseen data, model accuracy will be completely compromised if a few rogue missing values creep in.

Ignoring missing data in the sense of allowing it to happen without understanding that it is there and additionally understanding what effect it could have is unwise. The key point is to ensure that if missing data is to be allowed, an active decision must be taken to do so.

Manual editing

Manual editing has some drawbacks. Not only is this not scalable as the amount of missing data increases, but also it is error prone, leading to bias and furthermore, it does not address the deployment problem when unseen test data is presented to a model. In this case, the person doing the manual editing has to be available, has to remember the rules used to edit the data, and may have to cope with missing values that do not fit the manual rule.

Generally, it is not wise to perform manual editing. If you feel you are doing it, ensure that whatever is done is turned into an automated rule that can be applied later.

Deletion of examples

This is a common approach, also known as case deletion or list-wise deletion. Most people do this, and it is acceptable if the number of examples to be deleted is small, and more importantly, if the missing data has been determined to be MCAR.

If the missing data is NMAR, deleting examples may risk bias being introduced. Deletion of examples also leads to a loss of all the other attributes in the example, which are not missing. This loss of data is generally to be avoided since the data is precious.

Deletion of attributes

This too is a common approach and is acceptable if the number of missing values represents a large proportion of the whole. It is also acceptable if the attribute does not have much influence on the final result. Clearly, such an attribute would be a candidate for removal anyway. If it additionally turns out to have missing values with an MCAR profile, this would be enough reason to remove it.

If the missing values are MAR or NMAR, deleting the attribute is likely to affect model accuracy and more careful consideration would need to be given to deletion.

Imputation with single values

A simple approach to replacing missing values is to replace them with a value, for example the value 0, or the mean of the non-missing attributes. If the missing data is MCAR, this is acceptable and usually the mean is the best choice since it will have the smallest impact on the characteristics of the data as a whole.

If the missing data is MAR, it depends on other attributes, and it is better in this case to use a modeling technique to work out a value for the missing attributes. This is discussed in the next sections.

If the data is NMAR, there is no easy way to choose a single value for a replacement and case-by-case consideration is required. As an example, the NMAR data from an earlier section in this chapter looked as though att1 was missing if its own value was between 3 and 7. In this case, the average for the non-missing values of att1 is approximately -1.3. Using this to replace the missing values of att1 may have an adverse effect since we know that the value should be between 3 and 7. In this case, it would be more sensible to use the value 5 for a single replacement. Of course, in this case, we have the luxury of knowing how the data was generated. This will not normally be the case.

Modeling

If the data is MAR, it is sometimes possible to use modeling techniques to determine a value for the missing data based on other attributes. In effect, the missing values become labels to be predicted based on the values of the other attributes.

Using the 10,000 example data used earlier in this chapter, the missing values of att1 can be predicted from the att2 and label attributes. Clearly, att2 by itself can tell you nothing about att1, only the chance that it is missing but the addition of the label allows some prediction to be made about what att1 would have been. Using the label to predict a missing attribute value and then using this later will lead to a problem, and so should be avoided. In practice, real data usually contains many attributes and the values of the missing attributes will depend in some way on them. This means that it should always be possible to construct a model to predict missing attributes from others without needing the label.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset