Tidying Up Your Data

We are at that point in the data processing pipeline where we need to look at the data that we have retrieved and address any anomalies that may present themselves during analysis. These anomalies can exist for a multitude of reasons. Sometimes, certain parts of the data are not recorded or perhaps get lost. Maybe there are units that don't match your system's units. Many times, certain data points can be duplicated.

This process of dealing with anomalous data is often referred to as tidying your data, and you will see this term used many times in data analysis. This is a very important step in the pipeline, and it can consume much of your time before you even get to working on simple analyses.

Tidying of data can be a tedious problem, particularly when using programming tools that are not designed for the specific task of data cleanup. Fortunately for us, pandas has many tools that can be used to address these issues, and also help us be very efficient at the same time.

In this chapter, we will cover many of the tasks involved in tidying data. Specifically, you will learn:

  • The concept of tidy data
  • How to work with missing data
  • How to find NaN values in data
  • How to filter (drop) missing data
  • How pandas handles missing values in calculations
  • How to find, filter, and fix unknown values
  • Performing interpolation of missing values
  • How to identify and remove duplicate data
  • How to transform values using replace, map, and apply
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset