Pandas is an open-source, high-performance library that provides easy-to-use data structures and data analysis tools for Python. Pandas was created to aid in the analysis of time series data, and has become a standard in the Python community. Not only does it provide data structures, such as a Series and a DataFrame, that help with all aspects of data science, it also has built-in analysis methods which we'll use later in the book.
Before we can start cleaning and standardizing data using Pandas, we need to get the data into a Pandas DataFrame, the primary data structure of Pandas. You can think of a DataFrame like an Excel document—it has rows and columns. Once data is in a DataFrame, we can use the full power of Pandas to manipulate and query it.
Pandas provides a highly configurable function—read_csv()
—that we'll use to import our data. On a modern laptop with 4+ GB of RAM, we can easily and quickly import the entire accidents dataset, more than 7 million rows.
The following code tells us how to import a CSV file into a Pandas DataFrame:
import pandas as pd import numpy as np data_file = '../Data/Stats19-Data1979-2004/Accidents7904.csv' raw_data = pd.DataFrame.from_csv(data_file, header=0, sep=',', index_col=0, encoding=None, tupleize_cols=False) print(raw_data.head())
In order to use Pandas, we need to import it along with the numpy
library:
import pandas as pd import numpy as np
Next we set the path to our data file. In this case, I've used a relative path. I suggest using the full path to the file in production applications:
data_file = '../Data/Stats19-Data1979-2004/Accidents7904.csv'
After that, we use the read_csv()
method to import the data. We've passed a number of arguments to the function:
Finally, we print out the top five rows of the DataFrame using the head()
method.
print(raw_data.head())
read_csv() isn't the only game in town
If you search the Internet for ways to import data into a Pandas DataFrame, you'll come across the from_csv()
method. The from_csv()
method is still available in Pandas 0.16.2; however, there are plans to deprecate it. To keep your code from breaking, use read_csv()
instead.