- Read the movie dataset, set the movie title as the index, and select all the values in the actor_1_facebook_likes column that are not missing:
>>> movie = pd.read_csv('data/movie.csv', index_col='movie_title')
>>> fb_likes = movie['actor_1_facebook_likes'].dropna()
>>> fb_likes.head()
movie_title
Avatar 1000.0
Pirates of the Caribbean: At World's End 40000.0
Spectre 11000.0
The Dark Knight Rises 27000.0
Star Wars: Episode VII - The Force Awakens 131.0
Name: actor_1_facebook_likes, dtype: float64
- Let's use the describe method to get a sense of the distribution:
>>> fb_likes.describe(percentiles=[.1, .25, .5, .75, .9])
.astype(int)
count 4909
mean 6494
std 15106
min 0
10% 240
25% 607
50% 982
75% 11000
90% 18000
max 640000
Name: actor_1_facebook_likes, dtype: int64
- Additionally, we may plot a histogram of this Series to visually inspect the distribution:
>>> fb_likes.hist()
- This is quite a bad visualization and very difficult to get a sense of the distribution. On the other hand, the summary statistics from step 2 appear to be telling us that it is highly skewed to the right with many observations more than an order of magnitude greater than the median. Let's create criteria to test whether the number of likes is less than 20,000:
>>> criteria_high = fb_likes < 20000
>>> criteria_high.mean().round(2)
.91
- About 91% of the movies have an actor 1 with fewer than 20,000 likes. We will now use the where method, which accepts a boolean condition. The default behavior is to return a Series the same size as the original but which has all the False locations replaced with a missing value:
>>> fb_likes.where(criteria_high).head()
movie_title
Avatar 1000.0
Pirates of the Caribbean: At World's End NaN
Spectre 11000.0
The Dark Knight Rises NaN
Star Wars: Episode VII - The Force Awakens 131.0
Name: actor_1_facebook_likes, dtype: float64
- The second parameter to the where method, other, allows you to control the replacement value. Let's change all the missing values to 20,000:
>>> fb_likes.where(criteria_high, other=20000).head()
movie_title
Avatar 1000.0
Pirates of the Caribbean: At World's End 20000.0
Spectre 11000.0
The Dark Knight Rises 20000.0
Star Wars: Episode VII - The Force Awakens 131.0
Name: actor_1_facebook_likes, dtype: float64
- Similarly, we can create criteria to put a floor on the minimum number of likes. Here, we chain another where method and replace the values not meeting with the condition to 300:
>>> criteria_low = fb_likes > 300
>>> fb_likes_cap = fb_likes.where(criteria_high, other=20000)
.where(criteria_low, 300)
>>> fb_likes_cap.head()
movie_title
Avatar 1000.0
Pirates of the Caribbean: At World's End 20000.0
Spectre 11000.0
The Dark Knight Rises 20000.0
Star Wars: Episode VII - The Force Awakens 300.0
Name: actor_1_facebook_likes, dtype: float64
- The length of the original Series and modified Series is the same:
>>> len(fb_likes), len(fb_likes_cap)
(4909, 4909)
- Let's make a histogram with the modified Series. With the data in a much tighter range, it should produce a better plot:
>>> fb_likes_cap.hist()