Until now we have seen how to import and export data using mostly the tools provided in the Python standard library. Now, we'll see how to do some of the operations shown above in just few lines using the Pandas library. Pandas is an open source, BSD-licensed library that simplifies the process of data import and manipulation thus providing data structures and parsing functions.
We will demonstrate how to import, manipulate and export data using Pandas.
To be able to use the code in this section, we need to install Pandas.This can be done again using pip as shown here:
pip install pandas
Here, we will import again the data ch2-data.csv
, add a new column to the original data and export the result in csv, as shown in the following code snippet:
data = pd.read_csv('ch02-data.csv') data['amount_x_2'] = data['amount']*2 data.to_csv('ch02-data_more.csv)
First, we import Pandas in our environment and then we use the function read_csv
on the file that we want to read. This function automatically parses the csv format and nicely organizes the data in an indexed structure called DataFrame. Then, we take the columns amount, we multiply each of its element by two and store the result in a new columns called amount_x_2. Finally, we save the result into a new file named ch02-data_more.csv
using the method to_csv
. A DataFrame is a Pandas object which represents a table and we can access its columns as shown in the following section
DataFrames are very handy structures; they're designed to be fast and easy to access. Each column that they contain becomes an attribute of the object that represents the data frame. For example, we can print the values in the column amount of the object data defined earlier as shown here:
>>>print data.amount >>>0 323 1 233 2 433 3 555 4 123 5 0 6 221 Name: amount, dtype: int64
We can also print the list of all the columns in a dataframe as shown in the following code:
>>>print data.columns >>>Index([u'day', u'amount'], dtype='object')
Also, the function read_csv
that we used to import the data has many parameters that we make use of to deal with messy files and parse particular data formats. For example, if the values of our files are delimited by spaces instead of commas, we can use the parameter delimiter to correctly parse the data. Here's an example of where we import data from a file, where the values are separated by a variable number of spaces and we specify our custom header:
pd.read_csv('ch02-data.tab', skiprows=1, delimiter=' *', names=['day','amount'])