One of the first steps that business intelligence professionals perform on a new dataset is creating summary statistics. These statistics can be generated for an entire dataset or a part of it. In this recipe, you'll learn how to create summary statistics for the entire dataset.
import pandas as pd
accidents_data_file = '/Users/robertdempsey/Dropbox/private/Python Business Intelligence Cookbook/Data/Stats19-Data1979-2004/Accidents7904.csv' accidents = pd.read_csv(accidents_data_file, sep=',', header=0, index_col=False, parse_dates=['Date'], dayfirst=True, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skip_blank_lines=True )
describe
function to generate summary stats for the entire dataset:accidents.describe()
describe()
to make the results more readable:accidents.describe().transpose()
We first import the Python libraries we need, and create a new Pandas DataFrame from the data file:
accidents.describe()
Next, we use the describe()
function provided by Pandas to show the count, mean, standard deviation (std), minimum value, maximum value, and the 25 percent, 50 percent and 75 percent quartiles.
In order to read the results of describe()
a bit more easily, we use the transpose()
function to convert the columns into rows and rows into columns:
accidents.describe().transpose()