Read in the movie dataset, set the index to the movie title, and inspect the first few rows:
>>> movie = pd.read_csv('data/movie.csv', index_col='movie_title') >>> movie.head()
Determine whether the duration of each movie is longer than two hours by using the greater than comparison operator with the duration Series:
>>> movie_2_hours = movie['duration'] > 120 >>> movie_2_hours.head(10) movie_title
Avatar True
Pirates of the Caribbean: At World's End True
Spectre True
The Dark Knight Rises True
Star Wars: Episode VII - The Force Awakens False
John Carter True
Spider-Man 3 True
Tangled False
Avengers: Age of Ultron True
Harry Potter and the Half-Blood Prince True
Name: duration, dtype: bool
We can now use this Series to determine the number of movies that are longer than two hours:
>>> movie_2_hours.sum() 1039
To find the percentage of movies in the dataset longer than two hours, use the mean method:
>>> movie_2_hours.mean() 0.2114
Unfortunately, the output from step 4 is misleading. The duration column has a few missing values. If you look back at the DataFrame output from step 1, you will see that the last row is missing a value for duration. The boolean condition in step 2 returns False for this. We need to drop the missing values first, then evaluate the condition and take the mean: