In this chapter, we will focus on the indexing and selection of data from pandas objects. This is important since effective use of pandas requires a good knowledge of the indexing and selection of data. The topics that we will address in this chapter include the following:
We have already discussed basic indexing on Series and DataFrames in the previous chapter, but here we will include some examples for the sake of completeness. Here, we list a time series of crude oil spot prices for the 4 quarters of 2013, taken from IMF data: http://www.imf.org/external/np/res/commod/pdf/monthly/011014.pdf.
In [642]:SpotCrudePrices_2013_Data={ 'U.K. Brent' : {'2013-Q1':112.9, '2013-Q2':103.0, '2013-Q3':110.1, '2013-Q4':109.4}, 'Dubai':{'2013-Q1':108.1, '2013-Q2':100.8, '2013-Q3':106.1,'2013-Q4':106.7}, 'West Texas Intermediate':{'2013-Q1':94.4, '2013-Q2':94.2, '2013-Q3':105.8,'2013-Q4':97.4}} SpotCrudePrices_2013=pd.DataFrame.from_dict(SpotCrudePrices_2013_Data) SpotCrudePrices_2013 Out[642]: Dubai U.K. Brent West Texas Intermediate 2013-Q1 108.1 112.9 94.4 2013-Q2 100.8 103.0 94.2 2013-Q3 106.1 110.1 105.8 2013-Q4 106.7 109.4 97.4
We can select the prices for the available time periods of Dubai crude oil by using the []
operator:
In [644]: dubaiPrices=SpotCrudePrices_2013['Dubai']; dubaiPrices Out[644]: 2013-Q1 108.1 2013-Q2 100.8 2013-Q3 106.1 2013-Q4 106.7 Name: Dubai, dtype: float64
We can pass a list of columns to the []
operator in order to select the columns in a particular order:
In [647]: SpotCrudePrices_2013[['West Texas Intermediate','U.K. Brent']] Out[647]: West Texas Intermediate U.K. Brent 2013-Q1 94.4 112.9 2013-Q2 94.2 103.0 2013-Q3 105.8 110.1 2013-Q4 97.4 109.4
If we specify a column that is not listed in the DataFrame, we will get a KeyError
exception:
In [649]: SpotCrudePrices_2013['Brent Blend'] -------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-649-cd2d76b24875> in <module>() ... KeyError: u'no item named Brent Blend'
We can avoid this error by using the get
operator and specifying a default value in the case when the column is not present, as follows:
In [650]: SpotCrudePrices_2013.get('Brent Blend','N/A') Out[650]: 'N/A'
Hence, we get an error in the following case:
In [755]:SpotCrudePrices_2013['2013-Q1'] -------------------------------------------------- KeyError Traceback (most recent call last) ... KeyError: u'no item named 2013-Q1'
This was a design decision made by the creators in order to avoid ambiguity. In the case of a Series, where there is no ambiguity, selecting rows by using the []
operator works:
In [756]: dubaiPrices['2013-Q1'] Out[756]: 108.1
We shall see how we can perform row selection by using one of the newer indexing operators later in this chapter.
One can retrieve values from a Series, DataFrame, or Panel directly as an attribute as follows:
In [650]: SpotCrudePrices_2013.Dubai Out[650]: 2013-Q1 108.1 2013-Q2 100.8 2013-Q3 106.1 2013-Q4 106.7 Name: Dubai, dtype: float64
However, this only works if the index element is a valid Python identifier as follows:
In [653]: SpotCrudePrices_2013."West Texas Intermediate" File "<ipython-input-653-2a782563c15a>", line 1 SpotCrudePrices_2013."West Texas Intermediate" ^ SyntaxError: invalid syntax
Otherwise, we get SyntaxError
as in the preceding case because of the space in the column name. A valid Python identifier must follow the following lexical convention:
identifier::= (letter|"_") (letter | digit | "_")*
Thus, a valid Python identifier cannot contain a space. See the Python Lexical Analysis documents for more details at http://docs.python.org/2.7/reference/lexical_analysis.html#identifiers.
We can resolve this by renaming the column index names so that they are all valid identifiers:
In [654]: SpotCrudePrices_2013 Out[654]: Dubai U.K. Brent West Texas Intermediate 2013-Q1 108.1 112.9 94.4 2013-Q2 100.8 103.0 94.2 2013-Q3 106.1 110.1 105.8 2013-Q4 106.7 109.4 97.4 In [655]:SpotCrudePrices_2013.columns=['Dubai','UK_Brent', 'West_Texas_Intermediate'] SpotCrudePrices_2013 Out[655]: Dubai UK_Brent West_Texas_Intermediate 2013-Q1 108.1 112.9 94.4 2013-Q2 100.8 103.0 94.2 2013-Q3 106.1 110.1 105.8 2013-Q4 106.7 109.4 97.4
We can then select the prices for West Texas Intermediate as desired:
In [656]:SpotCrudePrices_2013.West_Texas_Intermediate Out[656]:2013-Q1 94.4 2013-Q2 94.2 2013-Q3 105.8 2013-Q4 97.4 Name: West_Texas_Intermediate, dtype: float64
We can also select prices by specifying a column index number to select column 1 (U.K. Brent) as follows:
In [18]: SpotCrudePrices_2013[[1]] Out[18]: U.K. Brent 2013-Q1 112.9 2013-Q2 103.0 2013-Q3 110.1 2013-Q4 109.4
As we saw in the section on NumPy ndarrays in Chapter 3, The pandas Data structures, we can slice a range by using the []
operator. The syntax of the slicing operator exactly matches that of NumPy:
ar[startIndex: endIndex: stepValue]
where the default values if not specified are as follows:
For a DataFrame, []
slices across rows as follows:
Obtain the first 2 rows:
In [675]: SpotCrudePrices_2013[:2] Out[675]: Dubai UK_Brent West_Texas_Intermediate 2013-Q1 108.1 112.9 94.4 2013-Q2 100.8 103.0 94.2
Obtain all rows starting from index 2:
In [662]: SpotCrudePrices_2013[2:] Out[662]: Dubai UK_Brent West_Texas_Intermediate 2013-Q3 106.1 110.1 105.8 2013-Q4 106.7 109.4 97.4
Obtain rows at intervals of two, starting from row 0:
In [664]: SpotCrudePrices_2013[::2] Out[664]: Dubai UK_Brent West_Texas_Intermediate 2013-Q1 108.1 112.9 94.4 2013-Q3 106.1 110.1 105.8
Reverse the order of rows in DataFrame:
In [677]: SpotCrudePrices_2013[::-1] Out[677]: Dubai UK_Brent West_Texas_Intermediate 2013-Q4 106.7 109.4 97.4 2013-Q3 106.1 110.1 105.8 2013-Q2 100.8 103.0 94.2 2013-Q1 108.1 112.9 94.4
For a Series, the behavior is just as intuitive:
In [666]: dubaiPrices=SpotCrudePrices_2013['Dubai']
Obtain the last 3 rows or all rows but the first:
In [681]: dubaiPrices[1:] Out[681]: 2013-Q2 100.8 2013-Q3 106.1 2013-Q4 106.7 Name: Dubai, dtype: float64
Obtain all rows but the last:
In [682]: dubaiPrices[:-1] Out[682]: 2013-Q1 108.1 2013-Q2 100.8 2013-Q3 106.1 Name: Dubai, dtype: float64
Reverse the rows:
In [683]: dubaiPrices[::-1] Out[683]: 2013-Q4 106.7 2013-Q3 106.1 2013-Q2 100.8 2013-Q1 108.1 Name: Dubai, dtype: float64