Chapter 6. Missing Values

Very often, the values of attributes within examples do not have a value. This is missing data. It normally arises in many ways and is very important to deal with since some algorithms suffer profoundly even with a small percentage of missing data. There are different types of missing data, and these can affect the approach used to deal with it.

Deleting the examples with missing data is not a good strategy. Not only is all the data potentially valuable, but it is also entirely possible that the missing data is correlated to the predictions, which might be the whole point of the data mining process. It is also a bad idea to manually fill in the missing values. Not only is this not scalable, but this also risks introducing a bias that can ruin subsequent modeling activities. Instead, a systematic approach based on an understanding of how the missing data arises is better.

RapidMiner allows investigations to be performed quickly, and this chapter gives some very detailed explanations of the exploratory processes available using various looping operators.

Missing or empty?

Before starting, it is important to be clear on the distinction between missing and empty data. They are very different. Missing data may have a value, whereas empty data may not. It is unfortunate that they both look the same, and sometimes, only a domain expert can tell them apart. A good analogy is to compare the journey of a commuter train and an express train, where attributes are the times when they stop at stations along the route. The commuter train stops at many intermediate stations, whereas the express train stops at far fewer stations for the same length of track. The absence of attribute values for stopping times at the intermediate stations for the express train is not missing data. Domain knowledge of trains is needed to know this, but it is quite clear to all train users that inventing a value for stopping times at the intermediate stations for the express train is wrong.

Having got this distinction clear, the next step is to understand what the different types of missing data are. This is important because it dictates how the missing values should be handled.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset