Merging two datasets in Pandas

In order to show a consolidated view of the data contained in two datasets, you need to merge them. Pandas has a built-in functionality to perform SQL-like joins of two DataFrames.

Getting ready

Create two DataFrames, one each from the accident and casualty datasets:

import pandas as pd
accidents_data_file = 'Accidents7904.csv'
casualty_data_file = 'Casualty7904.csv'
af = base_file_path + accidents_data_file
# Create a DataFrame from the accidents data
accidents = pd.read_csv(af,
                        sep=',',
                        header=0,
                        index_col=0,
                        parse_dates=False,
                        tupleize_cols=False,
                        error_bad_lines=False,
                        warn_bad_lines=False,
                        skip_blank_lines=True,
                        nrows=1000
                        )
# Create a DataFrame from the casualty data
cf = base_file_path + casualty_data_file
casualties = pd.read_csv(cf,
                        sep=',',
                        header=0,
                        index_col=0,
                        parse_dates=False,
                        tupleize_cols=False,
                        error_bad_lines=False,
                        warn_bad_lines=False,
                        skip_blank_lines=True,
                        nrows=1000
                        )

How to do it…

merged_data = pd.merge(accidents,
    casualties,
    how = 'left',
    left_index = True,
    right_index = True)

How it works…

We first create two DataFrames, one for the accidents data and another for the casualty data. Each of these datasets contains millions of rows; so, for this recipe, we are only importing the first thousand records from each file. The index for each of the DataFrames will be the first row of data, which in this case is the accident index.

With the DataFrames created, we perform an SQL-like left join using the merge() method, specifying the DataFrames to be joined, the way to join them, and telling Pandas to use the index from both the left (accidents) and the right (casualties) DataFrames. What Pandas then does is attempt to join the rows from the DataFrames using the index values. Because Pandas requires a column to exist in both the DataFrames, we used the index to perform the join. By virtue of how we created the DataFrames, and by using the accident_index as the index of the DataFrame, Pandas is able to join the two DataFrames.

The result is a single DataFrame containing all the columns from both the DataFrames. Merge() can be used with any two DataFrames as long as the column that you use to join exists in both the DataFrames.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset