We use Boolean indexing to filter or select parts of the data. The operators are as follows:
Operators |
Symbol |
---|---|
OR |
| |
AND |
& |
NOT |
~ |
These operators must be grouped using parentheses when used together. Using the earlier DataFrame from the previous section, here, we display the trading dates for which the NASDAQ closed above 4300:
In [311]: sharesIndexDataDF.ix[(sharesIndexDataDF['PriceType']=='close') & (sharesIndexDataDF['Nasdaq']>4300) ] Out[311]: PriceType Nasdaq S&P 500 Russell 2000 TradingDate 2014/02/27 close 4318.93 1854.29 1187.94 2014/02/28 close 4308.12 1859.45 1183.03 2 rows × 4 columns
You can also create Boolean conditions in which you use arrays to filter out parts of the data:
In [316]: highSelection=sharesIndexDataDF['PriceType']=='high' NasdaqHigh=sharesIndexDataDF['Nasdaq']<4300 sharesIndexDataDF.ix[highSelection & NasdaqHigh] Out[316]: TradingDate PriceType Nasdaq S&P 500 Russell 2000 2014/02/21 high 4284.85 1846.13 1168.43
Thus, the preceding code snippet displays the only date in the dataset for which the Nasdaq Composite index stayed below the 4300 level for the entire trading session.
These methods enable the user to achieve more with Boolean indexing than the standard operators used in the preceding sections. The isin
method takes a list of values and returns a Boolean array with True
at the positions within the Series or DataFrame that match the values in the list. This enables the user to check for the presence of one or more elements within a Series. Here is an illustration using Series:
In [317]:stockSeries=pd.Series(['NFLX','AMZN','GOOG','FB','TWTR']) stockSeries.isin(['AMZN','FB']) Out[317]:0 False 1 True 2 False 3 True 4 False dtype: bool
Here, we use the Boolean array to select a sub-Series containing the values that we're interested in:
In [318]: stockSeries[stockSeries.isin(['AMZN','FB'])] Out[318]: 1 AMZN 3 FB dtype: object
For our DataFrame example, we switch to a more interesting dataset for those of us who are of a biological anthropology bent, that of classifying Australian mammals (a pet interest of mine):
In [324]: australianMammals= {'kangaroo': {'Subclass':'marsupial', 'Species Origin':'native'}, 'flying fox' : {'Subclass':'placental', 'Species Origin':'native'}, 'black rat': {'Subclass':'placental', 'Species Origin':'invasive'}, 'platypus' : {'Subclass':'monotreme', 'Species Origin':'native'}, 'wallaby' : {'Subclass':'marsupial', 'Species Origin':'native'}, 'palm squirrel' : {'Subclass':'placental', 'Origin':'invasive'}, 'anteater': {'Subclass':'monotreme', 'Origin':'native'}, 'koala': {'Subclass':'marsupial', 'Origin':'native'} }
Some more information on mammals: Marsupials are pouched mammals, monotremes are egg-laying, and placentals give birth to live young. The source of this information is the following: http://en.wikipedia.org/wiki/List_of_mammals_of_Australia.
The source of the preceding image is Bennett's wallaby at http://bit.ly/NG4R7N.
In [328]: ozzieMammalsDF=pd.DataFrame(australianMammals) In [346]: aussieMammalsDF=ozzieMammalsDF.T; aussieMammalsDF Out[346]: Subclass Origin anteater monotreme native black rat placental invasive flying fox placental native kangaroo marsupial native koala marsupial native palm squirrel placental invasive platypus monotreme native wallaby marsupial native 8 rows × 2 columns
Let us try to select mammals that are native to Australia:
In [348]: aussieMammalsDF.isin({'Subclass':['marsupial'],'Origin':['native']}) Out[348]: Subclass Origin anteater False True black rat False False flying fox False True kangaroo True True koala True True palm squirrel False False platypus False True wallaby True True 8 rows × 2 columns
The set of values passed to isin
can be an array or a dictionary. That works somewhat, but we can achieve better results by creating a mask as a combination of the isin
and all()
methods:
In [349]: nativeMarsupials={'Mammal Subclass':['marsupial'], 'Species Origin':['native']} nativeMarsupialMask=aussieMammalsDF.isin(nativeMarsupials).all(True) aussieMammalsDF[nativeMarsupialMask] Out[349]: Subclass Origin kangaroo marsupial native koala marsupial native wallaby marsupial native 3 rows × 2 columns
Thus, we see that kangaroo, koala, and wallaby are the native marsupials in our dataset. The any()
method returns whether any element is True
in a Boolean DataFrame. The all()
method filters return whether all elements are True
in a Boolean DataFrame.
The source for this is http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.any.html.
The where()
method is used to ensure that the result of Boolean filtering is the same shape as the original data. First, we set the random number generator seed to 100 so that the user can generate the same values as shown next:
In [379]: np.random.seed(100) normvals=pd.Series([np.random.normal() for i in np.arange(10)]) normvals Out[379]: 0 -1.749765 1 0.342680 2 1.153036 3 -0.252436 4 0.981321 5 0.514219 6 0.221180 7 -1.070043 8 -0.189496 9 0.255001 dtype: float64 In [381]: normvals[normvals>0] Out[381]: 1 0.342680 2 1.153036 4 0.981321 5 0.514219 6 0.221180 9 0.255001 dtype: float64 In [382]: normvals.where(normvals>0) Out[382]: 0 NaN 1 0.342680 2 1.153036 3 NaN 4 0.981321 5 0.514219 6 0.221180 7 NaN 8 NaN 9 0.255001 dtype: float64
This method appears to be useful only in the case of a Series, as we get this behavior for free in the case of a DataFrame:
In [393]: np.random.seed(100) normDF=pd.DataFrame([[round(np.random.normal(),3) for i in np.arange(5)] for j in range(3)], columns=['0','30','60','90','120']) normDF Out[393]: 0 30 60 90 120 0 -1.750 0.343 1.153 -0.252 0.981 1 0.514 0.221 -1.070 -0.189 0.255 2 -0.458 0.435 -0.584 0.817 0.673 3 rows × 5 columns In [394]: normDF[normDF>0] Out[394]: 0 30 60 90 120 0 NaN 0.343 1.153 NaN 0.981 1 0.514 0.221 NaN NaN 0.255 2 NaN 0.435 NaN 0.817 0.673 3 rows × 5 columns In [395]: normDF.where(normDF>0) Out[395]: 0 30 60 90 120 0 NaN 0.343 1.153 NaN 0.981 1 0.514 0.221 NaN NaN 0.255 2 NaN 0.435 NaN 0.817 0.673 3 rows × 5 columns
The inverse operation of the where
method is mask:
In [396]: normDF.mask(normDF>0) Out[396]: 0 30 60 90 120 0 -1.750 NaN NaN -0.252 NaN 1 NaN NaN -1.070 -0.189 NaN 2 -0.458 NaN -0.584 NaN NaN 3 rows × 5 columns
To complete this chapter, we will discuss operations on indexes. We sometimes need to operate on indexes when we wish to re-align our data or select it in different ways. There are various operations:
The set_index
- allows for the creation of an index on an existing DataFrame and returns an indexed DataFrame.
As we have seen before:
In [939]: stockIndexDataDF=pd.read_csv('./stock_index_data.csv') In [940]: stockIndexDataDF Out[940]: TradingDate Nasdaq S&P 500 Russell 2000 0 2014/01/30 4123.13 1794.19 1139.36 1 2014/01/31 4103.88 1782.59 1130.88 2 2014/02/03 3996.96 1741.89 1094.58 3 2014/02/04 4031.52 1755.20 1102.84 4 2014/02/05 4011.55 1751.64 1093.59 5 2014/02/06 4057.12 1773.43 1103.93
Now, we can set the index as follows:
In [941]: stockIndexDF=stockIndexDataDF.set_index('TradingDate') In [942]: stockIndexDF Out[942]: Nasdaq S&P 500 Russell 2000 TradingDate 2014/01/30 4123.13 1794.19 1139.36 2014/01/31 4103.88 1782.59 1130.88 2014/02/03 3996.96 1741.89 1094.58 2014/02/04 4031.52 1755.20 1102.84 2014/02/05 4011.55 1751.64 1093.59 2014/02/06 4057.12 1773.43 1103.93
The reset_index
reverses set_index
:
In [409]: stockIndexDF.reset_index() Out[409]: TradingDate Nasdaq S&P 500 Russell 2000 0 2014/01/30 4123.13 1794.19 1139.36 1 2014/01/31 4103.88 1782.59 1130.88 2 2014/02/03 3996.96 1741.89 1094.58 3 2014/02/04 4031.52 1755.20 1102.84 4 2014/02/05 4011.55 1751.64 1093.59 5 2014/02/06 4057.12 1773.43 1103.93 6 rows × 4 columns