How to do it...

  1. Read in the movie dataset, set the index to the movie title, and inspect the first few rows:
>>> movie = pd.read_csv('data/movie.csv', index_col='movie_title')
>>> movie.head()
  1. Determine whether the duration of each movie is longer than two hours by using the greater than comparison operator with the duration Series:
>>> movie_2_hours = movie['duration'] > 120
>>> movie_2_hours.head(10)
movie_title Avatar True Pirates of the Caribbean: At World's End True Spectre True The Dark Knight Rises True Star Wars: Episode VII - The Force Awakens False John Carter True Spider-Man 3 True Tangled False Avengers: Age of Ultron True Harry Potter and the Half-Blood Prince True Name: duration, dtype: bool
  1. We can now use this Series to determine the number of movies that are longer than two hours:
>>> movie_2_hours.sum()
1039
  1. To find the percentage of movies in the dataset longer than two hours, use the mean method:
>>> movie_2_hours.mean()
0.2114
  1. Unfortunately, the output from step 4 is misleading. The duration column has a few missing values. If you look back at the DataFrame output from step 1, you will see that the last row is missing a value for duration. The boolean condition in step 2 returns False for this. We need to drop the missing values first, then evaluate the condition and take the mean:
>>> movie['duration'].dropna().gt(120).mean()
.2112
  1. Use the describe method to output a few summary statistics on the boolean Series:
>>> movie_2_hours.describe()
count 4916 unique 2 top False freq 3877 Name: duration, dtype: object
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset