Data quality is a fundamental issue for business intelligence. The reliability of your analysis and, by extension, the decisions you make based on that analysis, depend on the quality of data you use.
A data quality report provides objective measures of the quality of your data making it a critical first step of the business intelligence process.
For creating our report, we are going to create a number of DataFrames from our dataset, and then merge them together at the end. The parts of our report will include the following:
import pandas as pd
accidents_data_file = '/Users/robertdempsey/Dropbox/private/Python Business Intelligence Cookbook/Data/Stats19-Data1979-2004/Accidents7904.csv' accidents = pd.read_csv(accidents_data_file, sep=',', header=0, index_col=False, parse_dates=['Date'], dayfirst=True, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skip_blank_lines=True )
DataFrame
containing the columns in the accidents
DataFrame
:columns = pd.DataFrame(list(accidents.columns.values)) columns
DataFrame
of the data type of each column:data_types = pd.DataFrame(accidents.dtypes, columns=['Data Type']) data_types
DataFrame
with the count of missing values in each column:missing_data_counts = pd.DataFrame(accidents.isnull().sum(), columns=['Missing Values']) missing_data_counts
DataFrame
with the count of present values in each column:present_data_counts = pd.DataFrame(accidents.count(), columns=['Present Values']) present_data_counts
DataFrame
with the count of unique values in each column:unique_value_counts = pd.DataFrame(columns=['Unique Values']) for v in list(accidents.columns.values): unique_value_counts.loc[v] = [accidents[v].nunique()] unique_value_counts
DataFrame
with the minimum value in each column:minimum_values = pd.DataFrame(columns=['Minimum Value']) for v in list(accidents.columns.values): minimum_values.loc[v] = [accidents[v].min()] minimum_values
DataFrame
that we'll create is a DataFrame
with the maximum value in each column:maximum_values = pd.DataFrame(columns=['Maximum Value']) for v in list(accidents.columns.values): maximum_values.loc[v] = [accidents[v].max()] maximum_values
DataFrames
together by the index:data_quality_report = data_types.join(present_data_counts).join(missing_data_counts).join(unique_value_counts).join(minimum_values).join(maximum_values)
print(" Data Quality Report") print("Total records: {}".format(len(accidents.index))) data_quality_report
The first thing we do is import the Python libraries that we'll need to create the report. Once we use the Create a Pandas DataFrame From a CSV File recipe to create a DataFrame from our CSV file, we're ready to create each of the DataFrames that will comprise the report:
# Create a dataframe of the columns in the accidents dataframe columns = pd.DataFrame(list(accidents.columns.values)) columns
We first create a DataFrame of columns. As in the Create a Pandas DataFrame from a MongoDB query recipe, we need to iterate over the list of column values in the accidents DataFrame. We use list()
to create a list of the column names, and use the results to create the columns DataFrame
:
# Create a dataframe of the data type of each column data_types = pd.DataFrame(accidents.dtypes, columns=['Data Type']) data_types
We next create a DataFrame of the data types for each column. To do this, we use the dtypes
function to get the list of data types in the DataFrame, and use the columns=['Data Type']
to specify the name of the column for our new DataFrame:
# Create a dataframe with the count of missing values in each column missing_data_counts = pd.DataFrame(accidents.isnull().sum(), columns=['Missing Values']) missing_data_counts
To create the DataFrame of missing data counts, we chain together two functions provided by a Pandas DataFrame, isnull()
and sum()
. The count of the cells with missing data is returned for each column in the DataFrame. We then put that count into the 'Missing Values'
column of the new DataFrame:
# Create a dataframe with the count of present values in each column present_data_counts = pd.DataFrame(accidents.count(), columns=['Present Values']) present_data_counts
Next, we create a DataFrame with the per-column count of cells that contain a value. To do this, we simply call count()
on the DataFrame, which returns a count of non-null columns by default:
# Create a dataframe with the count of unique values in each column unique_value_counts = pd.DataFrame(columns=['Unique Values']) for v in list(accidents.columns.values): unique_value_counts.loc[v] = [accidents[v].nunique()] unique_value_counts
The unique value count DataFrame is a bit more complicated to create. We first create an empty DataFrame with a single column: 'Unique Values'
. We then create a Python list from the values in each column of the DataFrame and loop through it. In our loop, we do the following:
.nunique()
to get a count of the unique values in the given column.loc
function of the DataFrame to look up the value in the row, which in this case would be the label value, such as 'Accident Index'
# Create a dataframe with the minimum value in each column minimum_values = pd.DataFrame(columns=['Minimum Value']) for v in list(accidents.columns.values): minimum_values.loc[v] = [accidents[v].min()] minimum_values
Similar to what we did when creating the unique_value_counts
DataFrame, to create the minimum_values
DataFrame, we loop through a list of column values in the accidents DataFrame, get the minimum value for each column using the min()
function, and insert it into our minimum_values
DataFrame beside the appropriate column name:
# Create a dataframe with the minimum value in each column maximum_values = pd.DataFrame(columns=['Maximum Value']) for v in list(accidents.columns.values): maximum_values.loc[v] = [accidents[v].max()] maximum_values
We create the maximum_values
DataFrame just as we did with minimum_values
, only for this one, we use the max()
function to get the maximum value in each column:
# Merge all the dataframes together by the index data_quality_report = data_types.join(present_data_counts).join(missing_data_counts).join(unique_value_counts).join(minimum_values).join(maximum_values)
With all of our DataFrames created, we merge them all together starting with the data_types
DataFrame. Pandas allows us to chain together statements, so rather than having to write multiple lines of code, we can use a join()
statement for each of our DataFrames.
Something to note here is that since each of our DataFrames has exactly the same index values, which in this example are the column names of the accidents DataFrame, we don't need to provide any arguments to the join
statements; we simply call them, passing the DataFrame as the only argument.
This joining by index is very similar to the way a join
statement works in a relational database. If your DataFrames have the same index column, you can simply join them together as we've done here. If, however, you have a primary/foreign key type of a relationship, you can use those keys to perform an SQL-like join.
# Print out a nice report print(" Data Quality Report") print("Total records: {}".format(len(accidents.index))) data_quality_report
At last, we print out the report and discover the quality of our data. For good measure, we also print out the total record count. This number provides the context for all the counts in our report.
Another thing to note is that we've included the minimum and maximum values of our object type columns. The only reason they are included is because they are a part of the report we are creating. When presenting this information to others, you can tell them to disregard the minimum and maximum values of those columns. However, we want to keep all the columns there to report on the other metrics.
Extra Credit
In this recipe, we printed out the record count before we showed the report data. You could add an additional column to each DataFrame containing a count which showed the percentage for the count. For example, the Longitude
column has 4887131 missing values, which is 78.5 percent of all the records. Depending on what analysis you are going to perform, that amount of missing values could be a problem.