Categorizing missing data

Having settled on the types of missing data, the question that arises is what are the approaches for categorizing it?

This section gives a detailed set of worked examples using synthetic data and a RapidMiner Studio process that is available with the files that accompany this book. These are intended to be followed with the text. The process is called MCARDetection.xml.

The first step is to make some synthetic data containing missing data of each type. In order to illustrate the key points, it is necessary to reduce the size of the synthetic data, so it can be easily displayed and understood. Of course, real data will not be like this, but the techniques are usable with high-dimension data.

The RapidMiner Studio process to be used is shown in the following screenshot:

Categorizing missing data

This generates a simple example set with 10,000 examples and two attributes named att1 and att2 (the operators labeled 1 and 2 in the screenshot). The label is generated from the values of att1 and att2 using the sign of the result to dictate the label. The following screenshot shows a plot using RapidMiner Studio, where the two attributes are placed on the x and y axes and the color is dictated by the label, which corresponds to the sign of the sum of the attributes.

Categorizing missing data

The RapidMiner process then generates a new attribute for att1 using MCAR, MAR, and NMAR rules. In other words, some of the values of att1 are set to missing based on these rules, but rather than change att1, a new attribute is created to hold the value. In addition, another attribute is generated that contains a simple true or false flag to indicate whether the new att1 attribute is missing or not in each of the three cases (operators 3 to 10 included in the process screenshot).

Simple rules are given here for the three cases:

  • Firstly, for MCAR, the attribute att1MCAR is missing based on the following expression in the RapidMiner Generate Attributes operator:

    if (rand() > 0.1,false,true)

  • For the MAR case, the attribute att1MAR is missing as seen from the following expression:

    if (att2 > 2 && att2 < 6,true,false)

  • Finally, for the NMAR case, the following expression is used to generate att1NMAR:

    if (att1 > 3 && att1 < 7,true,false)

A small fragment of the data in tabular form is shown in the following screenshot:

Categorizing missing data

In real life, all the examples in the att1 column will not be available. All that is available is only one of either the att1MCAR, att1MAR, or att1NMAR columns. Furthermore, the underlying mechanism that generates the missing data is also not available. The point of this exercise is to see if it is possible to determine the mechanism which leads to the missing data using various techniques that, in turn, will drive the best method to handle them.

In the table view shown earlier, the status columns have been generated from the missing values and true means the value is missing. The status columns are useful when plotting the data.

Given this data, the next step is to see whether the mechanism that generated the missing data can be determined. Two approaches will be taken. First, the use of correlation to determine if the attributes depend on one another, and second, manual inspection. Operators 11 to 16 perform the correlation calculations for the three missing attribute generation regimes. Operators 12, 14, and 16 are the Correlation Matrix operators, which use squared correlation.

Finding MCAR data

The MCAR correlation matrix, which is the output of operator 12, within the process screenshot is shown in the following table:

Finding MCAR data

Bear in mind that the attributes att2, att1MCAR, and label are all that will be available, and the att1MCARStatus attribute is derived from att1MCAR. The correlation operator determines how two attributes depend on one another. A squared correlation is used in this case, and a value of 0 indicates no correlation, while a value of 1 indicates correlation or anticorrelation.

As shown in the table, att2 is partially correlated with the label (the value is 0.328), and for the non-missing values, att1MCAR is also partially correlated with the label (the value is 0.344). The missing state of att1MCAR is shown by the attMCARStatus attribute, and this shows no correlation between both att2 and the label attribute (both the values are very close to 0). This is evidence that the missing values of att1MCAR are MCAR but as we shall see, the attribute could still be NMAR.

Manual inspection of the data will give the opportunity to spot patterns. A histogram generated using the plot capabilities within RapidMiner Studio on the example set output from the Correlation Matrix operator, which displays the distribution of attMCAR and att1 values, is shown in the following screenshot:

Finding MCAR data

There are 10,000 examples and the values for the attributes range between -10 and +10. The histogram has been set to have 10 bins and shows the count of the number of values in that bin, and as can be seen, there are approximately an equal number of values across all the bins for both attributes. If, as domain experts, we know that the distribution of att1 follows a certain distribution, such as the one shown, and we see that the distribution of att1MCAR is the same, then this is evidence that the missing values for att1 have been generated completely at random. This gives us evidence that the att1MCAR attribute has not been generated with an NMAR mechanism. In real life, att1 has missing values, so we will not get the att1 histogram seen previously. In this situation, we have to rely on domain knowledge.

Finding MAR data

To find the MAR data, repeat the MCAR investigation. But in the case of MAR, it results in a correlation matrix as shown in the following screenshot:

Finding MAR data

This differs from the MCAR case because there is now a correlation between att2 and the missing or present status of att1, as indicated by att1MARStatus. Note that there is no correlation between att1MAR and att2.

There is some relationship between att2 and att1MAR, and this can clearly be seen by plotting them together on a scatter plot as shown in the following screenshot:

Finding MAR data

The graph clearly shows that att1MAR is missing if att2 is between 2 and 6 (in accordance with the formula in the Generate Attributes operator). This means that the missing values of att1MAR follow the MAR mechanism.

Finding NMAR data

For the NMAR case, the correlation between the available attributes is shown in the following figure:

Finding NMAR data

Recall that NMAR means the missing status of att1NMAR depends on the value of the attribute before it was missing. By definition, therefore, there is no value to see because it is missing, so there is no way to prove that the underlying mechanism is NMAR. The correlation matrix is also very similar to the MCAR case, and this can lead to a mistaken conclusion that the data is MCAR, leading to an incorrect method for handling the missing data.

There is one subtle difference, however, and it is the slight correlation between the att1NMARStatus and label attributes. If we believe that att1 itself has predictive powers for the label attribute and we can see that the corresponding missing status attribute, which is either true or false, also has predictive power, we may hypothesize that the missing status depends somehow on the value before it was missing.

If a histogram of att1NMAR is plotted, the result is shown as follows:

Finding NMAR data

Compare this to the histogram for the MCAR case. If we expected an even distribution for att1NMAR, we can see that it's missing if its value is between 3 and 7. Of course, if we don't know, as domain experts, what the distribution of att1NMAR should be, it is completely possible that the distribution represents the MCAR one.

A cautionary note

Real data will never be as easy to interpret, and you are very likely to find that the missing attributes exhibit MAR, NMAR, and MCAR behavior at once. As it very often happens, each has to be dealt with on a case-by-case basis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset