How to do it...

Read in the movie dataset, set the index to the movie title, and inspect the first few rows:

>>> movie = pd.read_csv('data/movie.csv', index_col='movie_title')
>>> movie.head()

Determine whether the duration of each movie is longer than two hours by using the greater than comparison operator with the duration Series:

>>> movie_2_hours = movie['duration'] > 120
>>> movie_2_hours.head(10)
movie_title
Avatar                                         True
Pirates of the Caribbean: At World's End       True
Spectre                                        True
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
John Carter                                    True
Spider-Man 3                                   True
Tangled                                       False
Avengers: Age of Ultron                        True
Harry Potter and the Half-Blood Prince         True
Name: duration, dtype: bool

We can now use this Series to determine the number of movies that are longer than two hours:

>>> movie_2_hours.sum()
1039

To find the percentage of movies in the dataset longer than two hours, use the mean method:

>>> movie_2_hours.mean()
0.2114

Unfortunately, the output from step 4 is misleading. The duration column has a few missing values. If you look back at the DataFrame output from step 1, you will see that the last row is missing a value for duration. The boolean condition in step 2 returns False for this. We need to drop the missing values first, then evaluate the condition and take the mean:

>>> movie['duration'].dropna().gt(120).mean()
.2112

Use the describe method to output a few summary statistics on the boolean Series:

>>> movie_2_hours.describe()
count      4916
unique        2
top       False
freq       3877
Name: duration, dtype: object

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...