Importing a CSV file into a Pandas DataFrame

Pandas is an open-source, high-performance library that provides easy-to-use data structures and data analysis tools for Python. Pandas was created to aid in the analysis of time series data, and has become a standard in the Python community. Not only does it provide data structures, such as a Series and a DataFrame, that help with all aspects of data science, it also has built-in analysis methods which we'll use later in the book.

Before we can start cleaning and standardizing data using Pandas, we need to get the data into a Pandas DataFrame, the primary data structure of Pandas. You can think of a DataFrame like an Excel document—it has rows and columns. Once data is in a DataFrame, we can use the full power of Pandas to manipulate and query it.

Getting ready

Pandas provides a highly configurable function—read_csv()—that we'll use to import our data. On a modern laptop with 4+ GB of RAM, we can easily and quickly import the entire accidents dataset, more than 7 million rows.

How to do it…

The following code tells us how to import a CSV file into a Pandas DataFrame:

import pandas as pd
import numpy as np
data_file = '../Data/Stats19-Data1979-2004/Accidents7904.csv'
raw_data = pd.DataFrame.from_csv(data_file,
                       header=0,
                       sep=',',
                       index_col=0,
    encoding=None,
                       tupleize_cols=False)
print(raw_data.head())

How it works…

In order to use Pandas, we need to import it along with the numpy library:

import pandas as pd
import numpy as np

Next we set the path to our data file. In this case, I've used a relative path. I suggest using the full path to the file in production applications:

data_file = '../Data/Stats19-Data1979-2004/Accidents7904.csv'

After that, we use the read_csv() method to import the data. We've passed a number of arguments to the function:

  • header: The row number to use as the column names
  • sep: Tells Pandas how the data is separated
  • index_col: The column to use as the row labels of the DataFrame
  • encoding: The encoding to use for UTF when reading/writing
  • tupleize_cols: To leave the list of tuples on columns as is

Finally, we print out the top five rows of the DataFrame using the head() method.

print(raw_data.head())

There's more…

Tip

read_csv() isn't the only game in town

If you search the Internet for ways to import data into a Pandas DataFrame, you'll come across the from_csv() method. The from_csv() method is still available in Pandas 0.16.2; however, there are plans to deprecate it. To keep your code from breaking, use read_csv() instead.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset