pandas

Let's talk about pandas, which is one of the most exciting Python libraries, especially for people who love R and want to play around with the data in a more vectorized manner. We will devote this part of the chapter only to pandas; we will discuss some basic data manipulation and handling in pandas frames.

Reading data

Let's start with one of the most important tasks in any data analysis to parse the data from a CSV/other file.

To begin, please download the data to your local storage from the preceding links, and load it into a pandas data-frame, as shown here:

>>>import pandas as pd
>>># Please provide the absolute path of the input file
>>>data = pd.read_csv("PATH\iris.data.txt",header=0")
>>>data.head()
 

4.9

3.0

1.4

0.2

Iris-setosa

0

4.7

3.2

1.3

0.2

Iris-setosa

1

4.6

3.1

1.5

0.2

Iris-setosa

2

5.0

3.6

1.4

0.2

Iris-setosa

This will read a CSV file and store it in a DataFrame. Now, there are many options you have while reading a CSV file. One of the problems is that we read the first line of the data in this DataFrame as a header; to use the actual header, we need to set the option header to None, and pass a list of names as column names. If we already have the header in perfect form in the CSV, we don't need to worry about the header as pandas, by default, assumes the first line to be the header. The header 0 in the preceding code is actually the row number that will be treated as the header.

So let's use the same data, and add the header into the frame:

>>>data = pd.read_csv("PATH\iris.data.txt", names=["sepal length", "sepal width", "petal length", "petal width", "Cat"], header=None)
>>>data.head()
 

sepal length

sepal width

petal length

petal width

Cat

0

4.9

3.0

1.4

0.2

Iris-setosa

1

4.7

3.2

1.3

0.2

Iris-setosa

2

4.6

3.1

1.5

0.2

Iris-setosa

This has created temporary column names for the frame so that, in case you have headers in the file as a first row, you can drop the header option, and pandas will detect the first row of the file as the header. The other common options are Sep/Delimiter, where you want to specify the delimiter used to separate the columns. There are at least 20 different options available, which can be used to optimize the way we read and cleanse our data, for example removing Na's, removing blank lines, and indexing based on the specific column. Please have a look at the different type of files:

  • read_csv: reading a CSV file.
  • read_excel: reading a XLS file.
  • read_hdf: reading a HDFS file.
  • read_sql: reading a SQL file.
  • read_json: reading a JSON file.

These can be the substitutes for all the different parsing methods we discussed in Chapter 2, Text Wrangling and Cleansing. The same numbers of options are available to write files too.

Now let's see the power of pandas frames. If you are an R programmer, you would love to see the summary and header option we have in R.

>>>data.describe()

The describe() function will give you a brief summary of each column and the unique values.

>>>sepal_len_cnt=data['sepal length'].value_counts()
>>>sepal_len_cnt

5.0    10
6.3     9
6.7     8
5.7     8
5.1     8
dtype: int64
>>>data['Iris-setosa'].value_counts()
Iris-versicolor    50
Iris-virginica     50
Iris-setosa        48
dtype: int64

Again for R lovers, we are now dealing with vectors, so that we can look for each value of the column by using something like this:

>>>data['Iris-setosa'] == 'Iris-setosa'
0     True
1     True

147    False
148    False
Name: Iris-setosa, Length: 149, dtype: bool

Now we can filter the DataFrame in place. Here the setosa will have only entries related to Iris-setosa.

>>>sntsosa=data[data['Cat'] == 'Iris-setosa']
>>>sntsosa[:5]

This is our typical SQL Group By function. We have all kinds of aggregate functions as well.

Note

You can browse through the following link to look at Dow Jones data:

https://archive.ics.uci.edu/ml/machine-learning-databases/00312/

Series data

Pandas also have a neat way of indexing by date, and then using the frame for all sorts of time series kind of analysis. The best part is that once we have indexed the data by date some of the most painful operations on the dates will be a command away from us. Let's take a look at series data, such as stock price data for a few stocks, and how the values of the opening and closing stock change weekly.

>>>import pandas as pd
>>>stockdata = pd.read_csv("dow_jones_index.data",parse_dates=['date'], index_col=['date'], nrows=100)
>>>>stockdata.head()

date

quarter

stock

open

high

low

close

volume

percent_change_price

01/07/2011

1

AA

$15.82

$16.72

$15.78

$16.42

239655616

3.79267

01/14/2011

1

AA

$16.71

$16.71

$15.64

$15.97

242963398

-4.42849

01/21/2011

1

AA

$16.19

$16.38

$15.60

$15.79

138428495

-2.47066

>>>max(stockdata['volume'])
   1453438639
>>>max(stockdata['percent_change_price'])
   7.6217399999999991
>>>stockdata.index
<class 'pandas.tseries.index.DatetimeIndex'>
[2011-01-07, ..., 2011-01-28]
Length: 100, Freq: None, Timezone: None
>>>stockdata.index.day
array([ 7, 14, 21, 28, 4, 11, 18, 25, 4, 11, 18, 25, 7, 14, 21, 28, 4,11, 18, 25, 4, 11, 18, 25, 7, 14, 21, 28, 4])

The preceding command gives the day of the week for each date.

>>>stockdata.index.month

The preceding command lists different values by month.

>>>stockdata.index.year

The preceding command lists different values by year.

You can aggregate the data using a resample with whatever aggregation you want. It could be sum, mean, median, min, or max.

>>>import numpy as np
>>>stockdata.resample('M', how=np.sum)

Column transformation

Say we want to filter out columns or to add a column. We can achieve this by just by providing a list of columns as an argument to axis 1. We can drop the columns from a data frame like this:

>>>stockdata.drop(["percent_change_volume_over_last_wk"],axis=1)

Let's filter out some of the unwanted columns, and work with a limited set of columns. We can create a new DataFrame like this:

>>>stockdata_new = pd.DataFrame(stockdata, columns=["stock","open","high","low","close","volume"])
>>>stockdata_new.head()

We can also run R-like operations on the columns. Say I want to rename the columns. I can do something like this:

>>>stockdata["previous_weeks_volume"] = 0

This will change all the values in the column to 0. We can do it conditionally and create derived variables in place.

Noisy data

A typical day in the life of a data scientist starts with data cleaning. Removing noise, cleaning unwanted files, making sure that date formats are correct, ignoring noisy records, and dealing with missing values. Typically, the biggest chunk of time is spent on data cleansing rather than on any other activity.

In a real-world scenario, the data is messy in most cases, and we have to deal with missing values, null values, Na's, and other formatting issues. So one of the major features of any data library is to deal with all these problems and address them in an efficient way. pandas provide some amazing features to deal with some of these problems.

>>>stockdata.head()
>>>stockdata.dropna().head(2)

Using the preceding command we get rid of all the Na's from our data.

date

quarter

Stock

open

high

low

close

volume

percent_change_price

01/14/2011

1

AA

$16.71

$16.71

$15.64

$15.97

242963398

-4.42849

01/21/2011

1

AA

$16.19

$16.38

$15.60

$15.79

138428495

-2.47066

01/28/2011

1

AA

$15.87

$16.63

$15.82

$16.13

151379173

1.63831

You also noticed that we have a $ symbol in front of the value, which makes the numeric operation hard. Let's get rid of that, as it will give us noisy results otherwise (for example. $43.86 is not among the top values here).

>>>import numpy
>>>stockdata_new.open.describe()
count        100
unique        99
top        $43.86
freq        2
Name: open, dtype: object

We can perform some operations on two columns, and derive a new variable out of this:

>>>stockdata_new.open = stockdata_new.open.str.replace('$', '').convert_objects(convert_numeric=True)
>>>stockdata_new.close = stockdata_new.close.str.replace('$', '').convert_objects(convert_numeric=True)
>>>(stockdata_new.close - stockdata_new.open).convert_objects(convert_numeric=True)
>>>stockdata_new.open.describe()
count    100.000000
mean      51.286800
std       32.154889
min       13.710000
25%       17.705000
50%       46.040000
75%       72.527500
max      106.900000
Name: open, dtype: float64

We can also perform some arithmetic operations, and create new variables out of it.

>>>stockdata_new['newopen'] = stockdata_new.open.apply(lambda x: 0.8 * x)
>>>stockdata_new.newopen.head(5)

We can filter the data on the value of a column in this way too. For example, let's filter out a dataset for one of the companies among all those that we have the stock values for.

>>>stockAA = stockdata_new.query('stock=="AA"')
>>>stockAA.head()

To summarize, we have seen some useful functions related to data reading, cleaning, manipulation, and aggregation in this section of pandas. In the next section, will try to use some of these data frames to generate visualization out of this data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset