Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Importing a CSV file into a Pandas DataFrame

Pandas is an open-source, high-performance library that provides easy-to-use data structures and data analysis tools for Python. Pandas was created to aid in the analysis of time series data, and has become a standard in the Python community. Not only does it provide data structures, such as a Series and a DataFrame, that help with all aspects of data science, it also has built-in analysis methods which we'll use later in the book.

Before we can start cleaning and standardizing data using Pandas, we need to get the data into a Pandas DataFrame, the primary data structure of Pandas. You can think of a DataFrame like an Excel document—it has rows and columns. Once data is in a DataFrame, we can use the full power of Pandas to manipulate and query it.

Getting ready

Pandas provides a highly configurable function—read_csv()—that we'll use to import our data. On a modern laptop with 4+ GB of RAM, we can easily and quickly import the entire accidents dataset, more than 7 million rows.

How to do it…

The following code tells us how to import a CSV file into a Pandas DataFrame:

import pandas as pd
import numpy as np
data_file = '../Data/Stats19-Data1979-2004/Accidents7904.csv'
raw_data = pd.DataFrame.from_csv(data_file,
                       header=0,
                       sep=',',
                       index_col=0,
    encoding=None,
                       tupleize_cols=False)
print(raw_data.head())

How it works…

In order to use Pandas, we need to import it along with the numpy library:

import pandas as pd
import numpy as np

Next we set the path to our data file. In this case, I've used a relative path. I suggest using the full path to the file in production applications:

data_file = '../Data/Stats19-Data1979-2004/Accidents7904.csv'

After that, we use the read_csv() method to import the data. We've passed a number of arguments to the function:

header: The row number to use as the column names
sep: Tells Pandas how the data is separated
index_col: The column to use as the row labels of the DataFrame
encoding: The encoding to use for UTF when reading/writing
tupleize_cols: To leave the list of tuples on columns as is

Finally, we print out the top five rows of the DataFrame using the head() method.

print(raw_data.head())

There's more…

Tip

read_csv() isn't the only game in town

If you search the Internet for ways to import data into a Pandas DataFrame, you'll come across the from_csv() method. The from_csv() method is still available in Pandas 0.16.2; however, there are plans to deprecate it. To keep your code from breaking, use read_csv() instead.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Importing a CSV file into a Pandas DataFrame

Create new playlist

Sign In

Sign Up

Importing a CSV file into a Pandas DataFrame

Getting ready

How to do it…

How it works…

There's more…

Tip

Table of Contents for
Importing a CSV file into a Pandas DataFrame