How to do it...

Read the movie dataset, set the movie title as the index, and select all the values in the actor_1_facebook_likes column that are not missing:

>>> movie = pd.read_csv('data/movie.csv', index_col='movie_title')
>>> fb_likes = movie['actor_1_facebook_likes'].dropna()
>>> fb_likes.head()
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World's End      40000.0
Spectre                                       11000.0
The Dark Knight Rises                         27000.0
Star Wars: Episode VII - The Force Awakens      131.0
Name: actor_1_facebook_likes, dtype: float64

Let's use the describe method to get a sense of the distribution:

>>> fb_likes.describe(percentiles=[.1, .25, .5, .75, .9]) 
            .astype(int)
count      4909
mean       6494
std       15106
min           0
10%         240
25%         607
50%         982
75%       11000
90%       18000
max      640000
Name: actor_1_facebook_likes, dtype: int64

Additionally, we may plot a histogram of this Series to visually inspect the distribution:

>>> fb_likes.hist()

This is quite a bad visualization and very difficult to get a sense of the distribution. On the other hand, the summary statistics from step 2 appear to be telling us that it is highly skewed to the right with many observations more than an order of magnitude greater than the median. Let's create criteria to test whether the number of likes is less than 20,000:

>>> criteria_high = fb_likes < 20000
>>> criteria_high.mean().round(2)
.91

About 91% of the movies have an actor 1 with fewer than 20,000 likes. We will now use the where method, which accepts a boolean condition. The default behavior is to return a Series the same size as the original but which has all the False locations replaced with a missing value:

>>> fb_likes.where(criteria_high).head()
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World's End          NaN
Spectre                                       11000.0
The Dark Knight Rises                             NaN
Star Wars: Episode VII - The Force Awakens      131.0
Name: actor_1_facebook_likes, dtype: float64

The second parameter to the where method, other, allows you to control the replacement value. Let's change all the missing values to 20,000:

>>> fb_likes.where(criteria_high, other=20000).head()
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World's End      20000.0
Spectre                                       11000.0
The Dark Knight Rises                         20000.0
Star Wars: Episode VII - The Force Awakens      131.0
Name: actor_1_facebook_likes, dtype: float64

Similarly, we can create criteria to put a floor on the minimum number of likes. Here, we chain another where method and replace the values not meeting with the condition to 300:

>>> criteria_low = fb_likes > 300
>>> fb_likes_cap = fb_likes.where(criteria_high, other=20000)
                           .where(criteria_low, 300)
>>> fb_likes_cap.head()
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World's End      20000.0
Spectre                                       11000.0
The Dark Knight Rises                         20000.0
Star Wars: Episode VII - The Force Awakens      300.0
Name: actor_1_facebook_likes, dtype: float64

The length of the original Series and modified Series is the same:

>>> len(fb_likes), len(fb_likes_cap)
(4909, 4909)

Let's make a histogram with the modified Series. With the data in a much tighter range, it should produce a better plot:

>>> fb_likes_cap.hist()

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...