Types of missing data

Little and Rubin, the authors of the book, Statistical Analysis with Missing Data (Second edition, 1987), categorized missing data in three ways. They represent three different mechanisms, which cause missing data to arise and are described in more detail in the following sections. Understanding the type can help us take an informed decision about how to deal with missing values.

Missing completely at random

This is the situation where the missing data neither depends on the value of available data nor on missing data itself. Another way to think of this is to imagine how missing completely at random (MCAR) data could be synthesized. Imagine a dataset consisting of 100 attributes and 10,000 examples starting with no missing values. Randomly select an attribute from the 100 available and then randomly select an example from the 10,000 available examples. Set the value to missing and repeat this process to obtain the desired amount of missing data. The missing values in this case are MCAR.

Missing at random

Missing at random (MAR) is the situation where the missing values are dependent in some way on the values of the other attributes (including potentially the label if it is present) but not on the values of the missing data itself. To synthesize data like this, return to the consideration of the 100 attributes and 10,000 examples dataset mentioned in the previous section. Choose one of the attributes to potentially be missing, then for each example use the values of the other attributes to decide if it really should be missing. For example, if we consider attribute1 to be missing, a simple rule could be to look at the value of attribute2, and if this is greater than a threshold, set attribute1 as missing.

An example of this might be a situation where some test equipment fails to record a value. A closer examination of the data shows that the failure coincides with the equipment being powered off for routine maintenance and this is indicated by some other attribute in the data. The measurement of values, which are available to be gathered, is not affected at all by whether the equipment is available or not.

Not missing at random

This is the situation where the missing values depend on the value of the missing data itself. An example of this might be a computer that measures, once a minute, the average CPU load that it experiences. When the computer is busy and the CPU is very loaded, values are more likely to be missing. This happens because the computer cannot keep up and record a measurement, as it is too busy. It is observed that the data will contain fewer values for the CPU load at the high end. This illustrates the importance of understanding the mechanism that leads to missing values. Ignoring the missing values in this case will bias any investigation, so that it will appear that the computer is not loaded, and this may lead to a failure to spot a major problem.

To synthesize not missing at random (NMAR) data using the previous 100 attributes and 10,000 examples dataset, it is necessary to choose an attribute to be missing, observe its value and then apply a rule to decide if it should be marked as missing.

Note that we are using the value itself to decide if it should be missing and this, of course, raises an interesting issue. In real life, the data is missing and you do not have the value before it was missing. You can speculate about the mechanism that leads to the missing values but fundamentally there is no way to be sure. Discussing this further is out of the scope of this book, so we can settle on the mechanism for creating the data and from there, see if there are ways to detect it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset