Let's talk about pandas, which is one of the most exciting Python libraries, especially for people who love R and want to play around with the data in a more vectorized manner. We will devote this part of the chapter only to pandas; we will discuss some basic data manipulation and handling in pandas frames.
Let's start with one of the most important tasks in any data analysis to parse the data from a CSV/other file.
I am using https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names
Feel free to use any other CSV file.
To begin, please download the data to your local storage from the preceding links, and load it into a pandas data-frame, as shown here:
>>>import pandas as pd >>># Please provide the absolute path of the input file >>>data = pd.read_csv("PATH\iris.data.txt",header=0") >>>data.head()
4.9 |
3.0 |
1.4 |
0.2 |
Iris-setosa | |
---|---|---|---|---|---|
0 |
4.7 |
3.2 |
1.3 |
0.2 |
Iris-setosa |
1 |
4.6 |
3.1 |
1.5 |
0.2 |
Iris-setosa |
2 |
5.0 |
3.6 |
1.4 |
0.2 |
Iris-setosa |
This will read a CSV file and store it in a DataFrame. Now, there are many options you have while reading a CSV file. One of the problems is that we read the first line of the data in this DataFrame as a header; to use the actual header, we need to set the option header to None, and pass a list of names as column names. If we already have the header in perfect form in the CSV, we don't need to worry about the header as pandas, by default, assumes the first line to be the header. The header 0 in the preceding code is actually the row number that will be treated as the header.
So let's use the same data, and add the header into the frame:
>>>data = pd.read_csv("PATH\iris.data.txt", names=["sepal length", "sepal width", "petal length", "petal width", "Cat"], header=None) >>>data.head()
sepal length |
sepal width |
petal length |
petal width |
Cat | |
---|---|---|---|---|---|
0 |
4.9 |
3.0 |
1.4 |
0.2 |
Iris-setosa |
1 |
4.7 |
3.2 |
1.3 |
0.2 |
Iris-setosa |
2 |
4.6 |
3.1 |
1.5 |
0.2 |
Iris-setosa |
This has created temporary column names for the frame so that, in case you have headers in the file as a first row, you can drop the header option, and pandas will detect the first row of the file as the header. The other common options are Sep/Delimiter, where you want to specify the delimiter used to separate the columns. There are at least 20 different options available, which can be used to optimize the way we read and cleanse our data, for example removing Na's, removing blank lines, and indexing based on the specific column. Please have a look at the different type of files:
read_csv
: reading a CSV file.read_excel
: reading a XLS file.read_hdf
: reading a HDFS file.read_sql
: reading a SQL file.read_json
: reading a JSON file.These can be the substitutes for all the different parsing methods we discussed in Chapter 2, Text Wrangling and Cleansing. The same numbers of options are available to write files too.
Now let's see the power of pandas frames. If you are an R programmer, you would love to see the summary and header option we have in R.
>>>data.describe()
The describe()
function will give you a brief summary of each column and the unique values.
>>>sepal_len_cnt=data['sepal length'].value_counts() >>>sepal_len_cnt 5.0 10 6.3 9 6.7 8 5.7 8 5.1 8 dtype: int64 >>>data['Iris-setosa'].value_counts() Iris-versicolor 50 Iris-virginica 50 Iris-setosa 48 dtype: int64
Again for R lovers, we are now dealing with vectors, so that we can look for each value of the column by using something like this:
>>>data['Iris-setosa'] == 'Iris-setosa' 0 True 1 True 147 False 148 False Name: Iris-setosa, Length: 149, dtype: bool
Now we can filter the DataFrame in place. Here the setosa will have only entries related to Iris-setosa
.
>>>sntsosa=data[data['Cat'] == 'Iris-setosa'] >>>sntsosa[:5]
This is our typical SQL Group By
function. We have all kinds of aggregate functions as well.
You can browse through the following link to look at Dow Jones data:
https://archive.ics.uci.edu/ml/machine-learning-databases/00312/
Pandas also have a neat way of indexing by date, and then using the frame for all sorts of time series kind of analysis. The best part is that once we have indexed the data by date some of the most painful operations on the dates will be a command away from us. Let's take a look at series data, such as stock price data for a few stocks, and how the values of the opening and closing stock change weekly.
>>>import pandas as pd >>>stockdata = pd.read_csv("dow_jones_index.data",parse_dates=['date'], index_col=['date'], nrows=100) >>>>stockdata.head()
date |
quarter |
stock |
open |
high |
low |
close |
volume |
percent_change_price |
---|---|---|---|---|---|---|---|---|
01/07/2011 |
1 |
AA |
$15.82 |
$16.72 |
$15.78 |
$16.42 |
239655616 |
3.79267 |
01/14/2011 |
1 |
AA |
$16.71 |
$16.71 |
$15.64 |
$15.97 |
242963398 |
-4.42849 |
01/21/2011 |
1 |
AA |
$16.19 |
$16.38 |
$15.60 |
$15.79 |
138428495 |
-2.47066 |
>>>max(stockdata['volume']) 1453438639 >>>max(stockdata['percent_change_price']) 7.6217399999999991 >>>stockdata.index <class 'pandas.tseries.index.DatetimeIndex'> [2011-01-07, ..., 2011-01-28] Length: 100, Freq: None, Timezone: None >>>stockdata.index.day array([ 7, 14, 21, 28, 4, 11, 18, 25, 4, 11, 18, 25, 7, 14, 21, 28, 4,11, 18, 25, 4, 11, 18, 25, 7, 14, 21, 28, 4])
The preceding command gives the day of the week for each date.
>>>stockdata.index.month
The preceding command lists different values by month.
>>>stockdata.index.year
The preceding command lists different values by year.
You can aggregate the data using a resample with whatever aggregation you want. It could be sum, mean, median, min, or max.
>>>import numpy as np >>>stockdata.resample('M', how=np.sum)
Say we want to filter out columns or to add a column. We can achieve this by just by providing a list of columns as an argument to axis 1
. We can drop the columns from a data frame like this:
>>>stockdata.drop(["percent_change_volume_over_last_wk"],axis=1)
Let's filter out some of the unwanted columns, and work with a limited set of columns. We can create a new DataFrame
like this:
>>>stockdata_new = pd.DataFrame(stockdata, columns=["stock","open","high","low","close","volume"]) >>>stockdata_new.head()
We can also run R-like operations on the columns. Say I want to rename the columns. I can do something like this:
>>>stockdata["previous_weeks_volume"] = 0
This will change all the values in the column to 0. We can do it conditionally and create derived variables in place.
A typical day in the life of a data scientist starts with data cleaning. Removing noise, cleaning unwanted files, making sure that date formats are correct, ignoring noisy records, and dealing with missing values. Typically, the biggest chunk of time is spent on data cleansing rather than on any other activity.
In a real-world scenario, the data is messy in most cases, and we have to deal with missing values, null values, Na's, and other formatting issues. So one of the major features of any data library is to deal with all these problems and address them in an efficient way. pandas provide some amazing features to deal with some of these problems.
>>>stockdata.head() >>>stockdata.dropna().head(2)
Using the preceding command we get rid of all the Na's from our data.
date |
quarter |
Stock |
open |
high |
low |
close |
volume |
percent_change_price |
---|---|---|---|---|---|---|---|---|
01/14/2011 |
1 |
AA |
$16.71 |
$16.71 |
$15.64 |
$15.97 |
242963398 |
-4.42849 |
01/21/2011 |
1 |
AA |
$16.19 |
$16.38 |
$15.60 |
$15.79 |
138428495 |
-2.47066 |
01/28/2011 |
1 |
AA |
$15.87 |
$16.63 |
$15.82 |
$16.13 |
151379173 |
1.63831 |
You also noticed that we have a $
symbol in front of the value, which makes the numeric operation hard. Let's get rid of that, as it will give us noisy results otherwise (for example. $43.86 is not among the top values here).
>>>import numpy >>>stockdata_new.open.describe() count 100 unique 99 top $43.86 freq 2 Name: open, dtype: object
We can perform some operations on two columns, and derive a new variable out of this:
>>>stockdata_new.open = stockdata_new.open.str.replace('$', '').convert_objects(convert_numeric=True) >>>stockdata_new.close = stockdata_new.close.str.replace('$', '').convert_objects(convert_numeric=True) >>>(stockdata_new.close - stockdata_new.open).convert_objects(convert_numeric=True) >>>stockdata_new.open.describe() count 100.000000 mean 51.286800 std 32.154889 min 13.710000 25% 17.705000 50% 46.040000 75% 72.527500 max 106.900000 Name: open, dtype: float64
We can also perform some arithmetic operations, and create new variables out of it.
>>>stockdata_new['newopen'] = stockdata_new.open.apply(lambda x: 0.8 * x) >>>stockdata_new.newopen.head(5)
We can filter the data on the value of a column in this way too. For example, let's filter out a dataset for one of the companies among all those that we have the stock values for.
>>>stockAA = stockdata_new.query('stock=="AA"') >>>stockAA.head()
To summarize, we have seen some useful functions related to data reading, cleaning, manipulation, and aggregation in this section of pandas. In the next section, will try to use some of these data frames to generate visualization out of this data.