- Read in the movie dataset, set the index as the title, and then create a boolean Series matching all movies with a content rating of G and an IMDB score less than 4:
>>> movie = pd.read_csv('data/movie.csv', index_col='movie_title')
>>> c1 = movie['content_rating'] == 'G'
>>> c2 = movie['imdb_score'] < 4
>>> criteria = c1 & c2
- Let's first pass these criteria to the .loc indexer to filter the rows:
>>> movie_loc = movie.loc[criteria]
>>> movie_loc.head()
- Let's check whether this DataFrame is exactly equal to the one generated directly from the indexing operator:
>>> movie_loc.equals(movie[criteria])
True
- Now let's attempt the same boolean indexing with the .iloc indexer:
>>> movie_iloc = movie.iloc[criteria]
ValueError: iLocation based boolean indexing cannot use an indexable as a mask
- It turns out that we cannot directly use a Series of booleans because of the index. We can, however, use ndarray of booleans. To extract the array, use the values attribute:
>>> movie_iloc = movie.iloc[criteria.values]
>>> movie_iloc.equals(movie_loc)
True
- Although not very common, it is possible to do boolean indexing to select particular columns. Here, we select all the columns that have a data type of 64-bit integers:
>>> criteria_col = movie.dtypes == np.int64
>>> criteria_col.head()
color False
director_name False
num_critic_for_reviews False
duration False
director_facebook_likes False
dtype: bool
>>> movie.loc[:, criteria_col].head()
- As criteria_col is a Series, which always has an index, you must use the underlying ndarray to make it work with .iloc. The following produces the same result as step 6.
>>> movie.iloc[:, criteria_col.values].head()
- A boolean Series may be used to select rows and then simultaneously select columns with either integers or labels. Remember, you need to put a comma between the row and column selections. Let's keep the row criteria and select content_rating, imdb_score, title_year, and gross:
>>> cols = ['content_rating', 'imdb_score', 'title_year', 'gross']
>>> movie.loc[criteria, cols].sort_values('imdb_score')
- This same operation may be replicated with .iloc, but you need to get the integer location of all the columns:
>>> col_index = [movie.columns.get_loc(col) for col in cols]
>>> col_index
[20, 24, 22, 8]
>>> movie.iloc[criteria.values, col_index]