How it works...

After importing the data and identifying the three entities, we must create a unique identifier for each observation so that we can link to the movies, actors and directors together once they have been separated into different tables. In step 2, we simply set the ID column as the row number beginning from zero. In step 3, we use the wide_to_long function to simultaneously melt the actor and director columns. It uses the integer suffix of the columns to align the data vertically and places this integer suffix in the index. The parameter j is used to control its name. The values in the columns not in the stubnames list repeat to align with the columns that were melted.

In step 4, we create our three new tables, keeping the id column in each. We also keep the num column to identify the exact director/actor column from which it was derived. Step 5 condenses each table by removing duplicates and missing values.

After step 5, the three observational units are in their own tables, but they still contain the same amount of data as the original (and a bit more), as seen in step 6. To return the correct number of bytes from the memory_usage method for object data type columns, you must set the deep parameter to True.

Each actor/director needs only one entry in his or her respective tables. We can't simply make a table of just actor name and Facebook likes, as there would be no way to link the actors back to the original movie. The relationship between movies and actors is called a many-to-many relationship. Each movie is associated with multiple actors, and each actor can appear in multiple movies. To resolve this relationship, an intermediate or associative table is created, which contains the unique identifiers (primary keys) of both the movie and actor.

To create associative tables, we must uniquely identify each actor/director. One trick is to create a categorical data type out of each actor/director name with pd.Categorical. Categorical data types have an internal map from each value to an integer. This integer is found in the codes attribute, which is used as the unique ID. To set up the creation of the associative table, we add this unique ID to the actor/director tables.

Step 8 and step 9 create the associative tables by selecting both of the unique identifiers. Now, we can reduce the actor and director tables to just the unique names and Facebook likes. This new arrangement of tables uses 20% less memory than the original. Formal relational databases have entity-relationship diagrams to visualize the tables. In step 10, we use the simple ERDPlus tool to make the visualization, which greatly eases the understanding of the relationships between the tables.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset