Chapter 7. Using Transformation to Clean Data

The third type of data transformation cleans a dataset to fix quality and consistency issues. Cleaning predominately involves manipulating individual field values within records. The most common variants of cleaning involve addressing missing (or NULL) values and addressing invalid values.

Addressing Missing/NULL Values

There are two basic approaches to addressing missing/null values. On the one hand, you can filter out records with missing or NULL fields. On the other hand, you can replace missing or NULL values. Often referred to as data imputation, filling in missing or NULL values might utilize many different strategies. In some cases, the best approach involves inserting the average or median value. In other cases, it is better to generate values from similar records; for example, similar customers or similar transactions. Alternatively, if your data has strong ordering (because it is a time-series dataset, for example), you might be able to fill in missing values by using the last valid value.

Addressing Invalid Values

Extending beyond missing values, another key set of cleaning transformations deals with invalid values—invalid because they are inconsistent with other fields (e.g., a customer age compared with their data of birth), ambiguous (e.g., two-digit years or abbreviations like “CT”—is that Connecticut or Court?), or improperly encoded. In some cases, the correct or consistent value for the field can be calculated and used to overwrite the original value in the dataset. In other cases, it might make sense for you to simply mark values as invalid. You can then conduct two parallel analyses, one that includes the invalid values and one that excludes them, providing insight into the impact that invalid data is having on your insights.

A more complex variety of fixing invalid values involves data standardization. Furthermore, suppose that every customer represented in that dataset is known to reside in the United States. A reasonable validity check on the current-state-of-residence field is that it should fall into one of the known US states. Suppose, however, that there are misspellings: “Californa,” “Westvirginia,” and “Dakota.” Standardizing these field values to a fixed library of valid values is a good way to improve dataset quality. There are a number of ways to perform this kind of standardization. The most common method involves editing distance around misspelling; that is, strings that are similar, like “Californa” and “California,” should be treated as the same entity and converted to the same spelling.

More specific standardization techniques rely on domain knowledge. For example, is “Dakota” supposed to be “North Dakota” or “South Dakota”? If we have a ZIP code in another field of the record, perhaps we can use a mapping of ZIP codes to states to make this determination. A slightly less reliable mapping, now that cell phone numbers can be transferred across carriers, could use the area code on a customer phone number field.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset