Creating a data quality report

Data quality is a fundamental issue for business intelligence. The reliability of your analysis and, by extension, the decisions you make based on that analysis, depend on the quality of data you use.

A data quality report provides objective measures of the quality of your data making it a critical first step of the business intelligence process.

Getting ready

For creating our report, we are going to create a number of DataFrames from our dataset, and then merge them together at the end. The parts of our report will include the following:

  1. Available columns
  2. For each column:
    • Data type
    • Count of missing values
    • Count of present values
    • Number of unique values
    • Minimum value
    • Maximum value

How to do it…

  1. To create your data quality report, start by importing the libraries that you need:
    import pandas as pd
  2. Next, import the data from the source CSV file using the Create a Pandas DataFrame From a CSV File recipe:
    accidents_data_file = '/Users/robertdempsey/Dropbox/private/Python Business Intelligence Cookbook/Data/Stats19-Data1979-2004/Accidents7904.csv'
    accidents = pd.read_csv(accidents_data_file,
                            sep=',',
                            header=0,
                            index_col=False,
                            parse_dates=['Date'],
                            dayfirst=True,
                            tupleize_cols=False,
                            error_bad_lines=True,
                            warn_bad_lines=True,
                            skip_blank_lines=True
                            )
  3. Create a second DataFrame containing the columns in the accidents DataFrame:
    columns = pd.DataFrame(list(accidents.columns.values))
    columns
  4. Next, create a DataFrame of the data type of each column:
    data_types = pd.DataFrame(accidents.dtypes,
                              columns=['Data Type'])
    data_types
  5. After that, create a DataFrame with the count of missing values in each column:
    missing_data_counts = pd.DataFrame(accidents.isnull().sum(),
                                       columns=['Missing Values'])
    missing_data_counts
  6. Next, create a DataFrame with the count of present values in each column:
    present_data_counts = pd.DataFrame(accidents.count(),
                                       columns=['Present Values'])
    present_data_counts
  7. Next, create a DataFrame with the count of unique values in each column:
    unique_value_counts = pd.DataFrame(columns=['Unique Values'])
    for v in list(accidents.columns.values):
        unique_value_counts.loc[v] = [accidents[v].nunique()]
    unique_value_counts
  8. After that, create a DataFrame with the minimum value in each column:
    minimum_values = pd.DataFrame(columns=['Minimum Value'])
    for v in list(accidents.columns.values):
        minimum_values.loc[v] = [accidents[v].min()]
    minimum_values
  9. The last DataFrame that we'll create is a DataFrame with the maximum value in each column:
    maximum_values = pd.DataFrame(columns=['Maximum Value'])
    for v in list(accidents.columns.values):
        maximum_values.loc[v] = [accidents[v].max()]
    maximum_values
  10. Finally, merge all the DataFrames together by the index:
    data_quality_report = data_types.join(present_data_counts).join(missing_data_counts).join(unique_value_counts).join(minimum_values).join(maximum_values)
  11. Lastly, print out a nice report:
    print("
    Data Quality Report")
    print("Total records: {}".format(len(accidents.index)))
    data_quality_report

How it works…

The first thing we do is import the Python libraries that we'll need to create the report. Once we use the Create a Pandas DataFrame From a CSV File recipe to create a DataFrame from our CSV file, we're ready to create each of the DataFrames that will comprise the report:

# Create a dataframe of the columns in the accidents dataframe
columns = pd.DataFrame(list(accidents.columns.values))
columns

We first create a DataFrame of columns. As in the Create a Pandas DataFrame from a MongoDB query recipe, we need to iterate over the list of column values in the accidents DataFrame. We use list() to create a list of the column names, and use the results to create the columns DataFrame:

# Create a dataframe of the data type of each column
data_types = pd.DataFrame(accidents.dtypes,
                          columns=['Data Type'])
data_types

We next create a DataFrame of the data types for each column. To do this, we use the dtypes function to get the list of data types in the DataFrame, and use the columns=['Data Type'] to specify the name of the column for our new DataFrame:

# Create a dataframe with the count of missing values in each column
missing_data_counts = pd.DataFrame(accidents.isnull().sum(),
                                   columns=['Missing Values'])
missing_data_counts

To create the DataFrame of missing data counts, we chain together two functions provided by a Pandas DataFrame, isnull() and sum(). The count of the cells with missing data is returned for each column in the DataFrame. We then put that count into the 'Missing Values' column of the new DataFrame:

# Create a dataframe with the count of present values in each column
present_data_counts = pd.DataFrame(accidents.count(),
                                   columns=['Present Values'])
present_data_counts

Next, we create a DataFrame with the per-column count of cells that contain a value. To do this, we simply call count() on the DataFrame, which returns a count of non-null columns by default:

# Create a dataframe with the count of unique values in each column
unique_value_counts = pd.DataFrame(columns=['Unique Values'])
for v in list(accidents.columns.values):
    unique_value_counts.loc[v] = [accidents[v].nunique()]
unique_value_counts

The unique value count DataFrame is a bit more complicated to create. We first create an empty DataFrame with a single column: 'Unique Values'. We then create a Python list from the values in each column of the DataFrame and loop through it. In our loop, we do the following:

  • Use .nunique() to get a count of the unique values in the given column
  • Use the .loc function of the DataFrame to look up the value in the row, which in this case would be the label value, such as 'Accident Index'
  • Assign the unique value count to the row in our DataFrame, based on the label value:
    # Create a dataframe with the minimum value in each column
    minimum_values = pd.DataFrame(columns=['Minimum Value'])
    for v in list(accidents.columns.values):
        minimum_values.loc[v] = [accidents[v].min()]
    minimum_values

Similar to what we did when creating the unique_value_counts DataFrame, to create the minimum_values DataFrame, we loop through a list of column values in the accidents DataFrame, get the minimum value for each column using the min() function, and insert it into our minimum_values DataFrame beside the appropriate column name:

# Create a dataframe with the minimum value in each column
maximum_values = pd.DataFrame(columns=['Maximum Value'])
for v in list(accidents.columns.values):
    maximum_values.loc[v] = [accidents[v].max()]
maximum_values

We create the maximum_values DataFrame just as we did with minimum_values, only for this one, we use the max() function to get the maximum value in each column:

# Merge all the dataframes together by the index
data_quality_report = data_types.join(present_data_counts).join(missing_data_counts).join(unique_value_counts).join(minimum_values).join(maximum_values)

With all of our DataFrames created, we merge them all together starting with the data_types DataFrame. Pandas allows us to chain together statements, so rather than having to write multiple lines of code, we can use a join() statement for each of our DataFrames.

Something to note here is that since each of our DataFrames has exactly the same index values, which in this example are the column names of the accidents DataFrame, we don't need to provide any arguments to the join statements; we simply call them, passing the DataFrame as the only argument.

This joining by index is very similar to the way a join statement works in a relational database. If your DataFrames have the same index column, you can simply join them together as we've done here. If, however, you have a primary/foreign key type of a relationship, you can use those keys to perform an SQL-like join.

# Print out a nice report
print("
Data Quality Report")
print("Total records: {}".format(len(accidents.index)))
data_quality_report

At last, we print out the report and discover the quality of our data. For good measure, we also print out the total record count. This number provides the context for all the counts in our report.

Another thing to note is that we've included the minimum and maximum values of our object type columns. The only reason they are included is because they are a part of the report we are creating. When presenting this information to others, you can tell them to disregard the minimum and maximum values of those columns. However, we want to keep all the columns there to report on the other metrics.

How it works…

Tip

Extra Credit

In this recipe, we printed out the record count before we showed the report data. You could add an additional column to each DataFrame containing a count which showed the percentage for the count. For example, the Longitude column has 4887131 missing values, which is 78.5 percent of all the records. Depending on what analysis you are going to perform, that amount of missing values could be a problem.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset