Data exploration

The movie and the corresponding rating dataset were downloaded from the MovieLens website (https://movielens.org). According to the data description on the MovieLens website, all the ratings are described in the ratings.csv file. Each row of this file, followed by the header, represents one rating of one movie by one user.

The CSV dataset has the following columns: userId, movieId, rating, and timestamp. These are shown in Figure 14. The rows are ordered first by userId and within the user by movieId. Ratings are made on a five-star scale, with half-star increments (0.5 stars up to a total of 5.0 stars). The timestamps represent the seconds since midnight in Coordinated Universal Time (UTC) on January 1, 1970. We have 105,339 ratings from 668 users on 10,325 movies:

Figure 2: A snap of the rating dataset

On the other hand, movie information is contained in the movies.csv file. Each row, apart from the header information, represents one movie containing these columns: movieId, title, and genres (see Figure 2). Movie titles are either created or inserted manually or imported from the website of the movie database at https://www.themoviedb.org/. The release year, however, is shown in brackets.

Since movie titles are inserted manually, some errors or inconsistencies may exist in these titles. Readers are, therefore, recommended to check the IMDb database (https://www.imdb.com/) to make sure that there are no inconsistencies or incorrect titles with the corresponding release year:

Figure 3: Title and genres for top 20 movies

Genres are in a separated list and are selected from the following genre categories:

  • Action, Adventure, Animation, Children's, Comedy, and Crime
  • Documentary, Drama, Fantasy, Film-Noir, Horror, and Musical
  • Mystery, Romance, Sci-Fi, Thriller, Western, and War
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset