Categorizing data quality

It is perhaps an accepted notion that issues with data quality may be categorized into one of the following areas:

  • Accuracy
  • Completeness
  • Update status
  • Relevance
  • Consistency (across sources)
  • Reliability
  • Appropriateness
  • Accessibility

The quality or level of quality of your data can be affected by the way it is entered, stored, and managed. The process of addressing data quality (referred to most often as data quality assurance (DQA)) requires a routine and regular review and evaluation of the data and performing ongoing processes termed profiling and scrubbing (this is vital even if the data is stored in multiple disparate systems, making these processes difficult).

Here, tidying the data will be much more project centric in that we're probably not concerned with creating a formal DQA process, but are only concerned with making certain that the data is correct for your particular predictive project.

In statistics, data unobserved or not yet reviewed by the data scientist is considered raw and cannot be reliably used in predictive projects. The process of tidying the data will usually involve several steps. Taking the extra time to break out the work is strongly recommended (rather than haphazardly addressing multiple data issues together).

The first step

The first step requires bringing the data to what may be called mechanical correctness. In this first step, you focus on things such as:

  • File format and organization: Field order, column headers, number of records, and so on
  • Record data typing (such as numeric values stored as strings)
  • Date and time processing (typically reformatting values into standard formats or consistent formats)
  • Miss-content: Wrong category labels, unknown or unexpected character encoding, and so on

The next step

The second step is to address the statistical soundness of the data. Here we correct issues that may be mechanically correct but will most likely (depending upon the subject matter) impact a statistical outcome.

These issues may include:

  • Positive/negative mismatch: Age variables may be reported as negative
  • Invalid (based on accepted logic) data: An under-aged person may be registered to possess a driver's license
  • Missing data: Key data values may just be missing from the data source

The final step

Finally, the last step (before actually attempting to use the data) may be the re-formatting step. In this step, the data scientist will determine the form that the data must be in in order to most efficiently process it, based upon the intended use or objective.

For example, one might decide to:

  • Reorder or repeat columns; that is to say, some final processing may require redundant or repeated data be generated within a file source to be correctly or more easily processed
  • Drop columns and/or records (based upon specific criteria)
  • Set decimal places
  • Pivot data
  • Truncate or rename values
  • And so on

There are a variety of somewhat routine methods for using R to resolve the aforementioned data errors.

For example:

  • Changing a data type: Also referred to as "data type conversion," one can utilize the R is functions to test for an object's data type and the as functions for an explicit conversion. A simplest example is shown here:
    The final step
  • Date and time: There are multiple ways to manage date information with R. In fact, we can extend the preceding example and mention the as.Date function. Typically, date values are important to a statistical model and therefore it is important to take the time to understand the format of a model's date fields and ensure that they are properly dealt with. Mostly, dates and times will appear in raw data format as strings, which can be converted and formatted as required. In the following code, the string fields containing a saledate and a returndate are converted to date type values and used with a common time function, difftime:
    The final step
  • Category labels are critical to statistical modeling as well as data visualization. An example of using labels with a sample of categorized data might be assigning a label to a participant in a study, perhaps by level of education: 1 = Doctoral, 2 = Masters, 3 = Bachelors, 4 = Associates, 5 = Nondegree, 6 = Some College, 7 = High School, or 8 = None:
    > participant<-c(1,2,3,4,5,6,7,8)
    > recode<-c(Doctoral=1, Masters=2, Bachelors=3, Associates=4, Nondegree=5, SomeCollege=6, HighSchool=7, None=8))
    > (participant<-factor (participant, levels=recode, labels=names(recode)))
    [1] Doctoral Masters Bachelors Associates Nondegree SomeCollege HighSchool None       
    Levels: Doctoral Masters Bachelors Associates Nondegree SomeCollege HighSchool None
  • Assigning labels to data not only helps with readability, but allows a machine learning algorithm to learn from the sample, and apply the same labels to other, unlabeled data.
  • Missing data parameters: many times missing data can be excluded from a calculation simply by setting an appropriate parameter value. For example, the R functions var, cov, and cor compute variance, covariance or correlation of variables. These functions have the option to set na.rm to TRUE. Doing this tells R to exclude any and all records or cases with missing values.
  • Various other data tidying nuisances can exist within your data, such as incorrectly signed numeric data (that is, a negative value for data such as a participant's age), invalid data values based upon accepted scenario logic (for example, participant's age versus level of education, in that it isn't feasible that a 10-year-old would have earned a Master's degree), data values simply missing (is a participant's lack of response an indication of a not applicable question or an error?), and more. Thankfully, there are at least several approaches to these data scenarios with R.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset