Linear Regression – Revisited

In Lesson 1, From Data to Decisions – Getting Started with TensorFlow we have seen an example of linear regression. We have observed how to work TensorFlow on the randomly generated dataset, that is, fake data. We have seen that the regression is a type of supervised machine learning for predicting the continuous-valued output. However, running a linear regression on fake data is just like buying a new car and never driving it. This awesome machinery begs to manifest itself in the real world!

Fortunately, many datasets are available online to test your new-found knowledge of regression:

Therefore, in this section, by defining a set of models, we will see how to reduce the search space of possible functions. Moreover, TensorFlow takes advantage of the differentiable property of the functions by running its efficient gradient descent optimizers to learn the parameters. To avoid overfitting our data, we regularize the cost function by penalizing larger valued parameters.

The linear regression is shown in Lesson 1, From Data to Decision – Getting Started with TensorFlow, shows some tensors that just contained a single scalar value, but you can, of course, perform computations on arrays of any shape. In TensorFlow, operations such as addition and multiplication take two inputs and produce an output. In contrast, constants and variables do not take any input. We will also see an example of how TensorFlow can manipulate 2D arrays to perform linear regression like operations.

Problem Statement

Online movie ratings and recommendations have become a serious business around the world. For example, Hollywood generates about $10 billion at the U.S. box office each year. Websites like Rotten Tomatoes aggregates movie reviews into one overall rating and also reports poor opening weekends. Although a single movie critic or a single negative review can't make or break a film, thousands of reviews and critics do.

Rotten Tomatoes, Metacritic, and IMDb have their own way of aggregating film reviews and distinct rating systems. On the other hand, Fandango, an NBCUniversal subsidiary uses a five-star rating system in which most of the movies get at least three stars, according to a FiveThirtyEight analysis.

An exploratory analysis of the dataset used by Fandango shows that out of 510 films, 437 films got at least one review where, hilariously, 98% had a 3-star rating or higher and 75 percent had a 4-star rating or higher. This implies, that using Fandango's standards it's almost impossible for a movie to be a flop at the box office. Therefore, Fandango's rating is biased and skewed:

Problem Statement

Figure 3: Fandango's lopsided ratings curve

Since the ratings from Fandango are unreliable, we will instead predict our own ratings based on IMDb ratings. More specifically, this is a multivariate regression problem, since our predictive model will use multiple features to make the rating prediction having many predictors.

Fortunately, the data is small enough to fit in memory, so plain batch learning should do just fine. Considering these factors and need, we will see that linear regression will meet our requirements. However, for more robust regression, you can still use deep neural network based regression techniques such as deep belief networks Regressor.

Using Linear Regression for Movie Rating Prediction

Now, the first task is downloading the Fandango's rating dataset from GitHub at https://github.com/fivethirtyeight/data/tree/master/fandango. It contains every film that has a Rotten Tomatoes rating, an RT user rating, a Metacritic score, a Metacritic user score, IMDb score, and at least 30 fan reviews on Fandango.

Table 1: Description of the columns in fandango_score_comparison.csv

The dataset has 22 columns that can be described as follows:

Column

Definition

FILM

Name of the film.

RottenTomatoes

Corresponding Tomatometer score for the film by Rotten Tomatoes.

RottenTomatoes_User

Rotten Tomatoes user score for the film.

Metacritic

Metacritic critic score for the film.

Metacritic_User

Metacritic user score for the film.

IMDB

IMDb user score for the film.

Fandango_Stars

A number of stars the film had on its Fandango movie page.

Fandango_Ratingvalue

The Fandango rating value for the film, as pulled from the HTML of each page. This is the actual average score the movie obtained.

RT_norm

Tomatometer score for the film. It is normalized to a 0 to 5 point system.

RT_user_norm

Rotten Tomatoes user score for the film. It is normalized to a 0 to 5 point system.

Metacritic_norm

The Metacritic critic scores for the film. It is normalized to a 0 to 5 point system.

Metacritic_user_nom

Metacritic user score for the film, normalized to a0 to 5 point system.

IMDB_norm

IMDb user score for the film which is normalized to a 0 to 5 point system.

RT_norm_round

Rotten Tomatoes Tomatometer score for the film which is normalized to a 0 to 5 point system and rounded to the nearest half-star.

RT_user_norm_round

Rotten Tomatoes user score for the film, normalized to a 0 to 5 point system and rounded to the nearest half-star.

Metacritic_norm_round

Metacritic critic score for the film, normalized to a 0 to 5 point system and rounded to the nearest half-star.

Metacritic_user_norm_round

Metacritic user score for the film, normalized to a 0 to 5 point system and rounded to the nearest half-star.

IMDB_norm_round

IMDb user score for the film, normalized to a 0 to 5 point system and rounded to the nearest half-star.

Metacritic_user_vote_count

A number of user votes the film had on Metacritic.

IMDB_user_vote_count

A number of user votes the film had on IMDb.

Fandango_votes

A number of user votes the film had on Fandango.

Fandango_Difference

Difference between the presented Fandango_Stars and the actual Fandango_Ratingvalue.

We have already seen that a typical linear regression problem using TensorFlow has the following workflow that updates the parameters to minimize the given cost function of Fandango's lopsided rating curve:

Using Linear Regression for Movie Rating Prediction

Figure 4: The learning algorithm using linear regression in TensorFlow

Now, let's try to follow the preceding figure and reproduce the same for the linear regression:

  1. Import the required libraries:
    import numpy as np
    import pandas as pd
    from scipy import stats
    import sklearn
    from sklearn.model_selection import train_test_split
    import tensorflow as tf
    import matplotlib
    import matplotlib.pyplot as plt
    import seaborn as sns
  2. Read the dataset and create a Panda DataFrame:
    df = pd.read_csv('fandango_score_comparison.csv')
    print(df.head())

    The output is as follows:

    Using Linear Regression for Movie Rating Prediction

    Figure 5: A snap of the dataset showing a typo in the Metacritic_user_nom

    So, if you look at the preceding DataFrame carefully, there is a typo that could cause a disaster. From our intuition, it is clear that Metacritic_user_nom should have actually been Metacritic_user_norm. Let's rename it to avoid further confusion:

    df.rename(columns={'Metacritic_user_nom':'Metacritic_user_norm'}, inplace=True)

    Moreover, according to a statistical analysis at https://fivethirtyeight.com/features/fandango-movies-ratings/, all the variables don't contribute equally; the following columns have more importance in ranking the movies:

     'Fandango_Stars',
    'RT_user_norm',
    'RT_norm',
    'IMDB_norm',
    'Metacritic_user_norm',
    'Metacritic_norm'

    Now we can check the correlation coefficients between variables before build the LR model. First, let's create a ranking list for that:

    rankings_lst = ['Fandango_Stars',
                    'RT_user_norm',
                    'RT_norm',
                    'IMDB_norm',
                    'Metacritic_user_norm',
                    'Metacritic_norm']

    The following function computes the Pearson correlation coefficients and builds a full correlation matrix:

    def my_heatmap(df):    
        import seaborn as sns    
        fig, axes = plt.subplots()
        sns.heatmap(df, annot=True)
        plt.show()
        plt.close()

    Let's call the preceding method to plot the matrix as follows:

    my_heatmap(df[rankings_lst].corr(method='pearson'))

    Note

    Pearson correlation coefficients: A measure of the strength of the linear relationship between two variables. If the relationship between the variables is not linear, then the correlation coefficient cannot accurately and adequately represent the strength of the relationship between those two variables. It is often represented as "ρ" when measured on population and "r" when measured on a sample. Statistically, the range is -1 to 1, where -1 indicates a perfect negative linear relationship, an r of 0 indicates no linear relationship, and an r of 1 indicates a perfect positive linear relationship between variables.

    The following correlation matrix shows correlation between considered features using the Pearson correlation coefficients:

    Using Linear Regression for Movie Rating Prediction

    Figure 6: The correlation matrix on the ranking list movies

    So, the correlation between Fandango and Metacritic is still positive. Now, let's do another study by considering only the movies for which RT has provided at least a 4-star rating:

    RT_lst = df['RT_norm'] >= 4.
    my_heatmap(df[RT_lst][rankings_lst].corr(method='pearson'))
    >>>

    The output is the correlation matrix on the ranked movies and RT movies having ratings of at least 4 showing a correlation between considered features using the Pearson correlation coefficients:

    Using Linear Regression for Movie Rating Prediction

    Figure 7: The correlation matrix on the ranked movies and RT movies having ratings at least 4

    This time, we have obtained anticorrelation (that is, negative correlation) between Fandango and Metacritic, with the correlation coefficient-0.23. This means that the correlation of Metacritic in terms of Fandango is significantly biased toward high ratings.

    Therefore, we can train our model without considering Fandango's rating, but before that let's build the LR model using this first. Later on, we will decide which option would produce a better result eventually.

  3. Preparing the training and test sets.

    Let's create a feature matrix X by selecting two DataFrame columns:

    feature_cols = ['Fandango_Stars', 'RT_user_norm', 'RT_norm', 'Metacritic_user_norm', 'Metacritic_norm']
    X = df.loc[:, feature_cols]

    Here, I have used only the selected column as features and now we need to create a response vector y:

    y = df['IMDB_norm']

    We are assuming that the IMDB is the most reliable and the baseline source of ratings. Our ultimate target is to predict the rating of each movie and compare the predicted ratings with the response column IMDB_norm.

    Now that we have the features and the response columns, it's time to split data into training and testing sets:

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=43)

    If you want to change the random_state, it helps you generate pseudo-random numbers for a random sampling value to obtain differentfinal results.

    Note

    Random state: As the name sounds can be used for initializing the internal random number generator, which will decide the splitting of data into train and test indices. This also signifies that every time you run it without specifying random_state, you will get a different result, this is expected behavior. So, we can have the following three options:

    • If random_state is None (or np.random), a randomly-initialized RandomState object is returned
    • If random_state is an integer, it is used to seed a new RandomState object
    • If random_state is a RandomState object, it is passed through

    Now, we need to have the dimension of the dataset to be passed through the tensors:

    dim = len(feature_cols)

    We need to include an extra dimension for independent coefficient:

    dim += 1
    

    And so we need to create an extra column for the independent coefficient in the training set and test feature set as well:

    X_train = X_train.assign( independent = pd.Series([1] * len(y_train), index=X_train.index))
    X_test = X_test.assign( independent = pd.Series([1] * len(y_train), index=X_test.index))

    So far, we have used and utilized the panda DataFrames but converting it into tensors is troublesome so instead let's convert them into a NumPy array:

    P_train = X_train.as_matrix(columns=None)
    P_test = X_test.as_matrix(columns=None)
    
    q_train = np.array(y_train.values).reshape(-1,1)
    q_test = np.array(y_test.values).reshape(-1,1)
  4. Creating a place holder for TensorFlow.

    Now that we have all the training and test sets, before initializing these variables, we have to create the place holder for TensorFlow to feed the training sets across the tensors:

    P = tf.placeholder(tf.float32,[None,dim])
    q = tf.placeholder(tf.float32,[None,1])
    T = tf.Variable(tf.ones([dim,1]))

    Let's add some bias to differing from the value in the case where both types are quantized as follows:

    bias = tf.Variable(tf.constant(1.0, shape = [n_dim]))
    q_ = tf.add(tf.matmul(P, T),bias)
  5. Creating an optimizer.

    Let's create an optimizer for the objective function:

    cost = tf.reduce_mean(tf.square(q_ - q))
    learning_rate = 0.0001
    training_op = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)
  6. Initializing global variables:
    init_op = tf.global_variables_initializer()
    cost_history = np.empty(shape=[1],dtype=float)
  7. Training the LR model.

    Here we are iterating the training 50,000 times and tracking several parameters, such as means square error that signifies how good the training is; we are keeping the cost history for future visualization, and so on:

    training_epochs = 50000
    with tf.Session() as sess:
        sess.run(init_op)
        cost_history = np.empty(shape=[1], dtype=float)
        t_history = np.empty(shape=[dim, 1], dtype=float)
        for epoch in range(training_epochs):
            sess.run(training_op, feed_dict={P: P_train, q: q_train})
            cost_history = np.append(cost_history, sess.run(cost, feed_dict={P: P_train, q: q_train}))
            t_history = np.append(t_history, sess.run(T, feed_dict={P: P_train, q: q_train}), axis=1)
        q_pred = sess.run(q_, feed_dict={P: P_test})[:, 0]
        mse = tf.reduce_mean(tf.square(q_pred - q_test))
        mse_temp = mse.eval()
        sess.close()

    Finally, we evaluate the mse to get the scalar value out of the training evaluation on the test set. Now, let's compute the mse and rmse values, as follows:

    print(mse_temp)
    RMSE = math.sqrt(mse_temp)
    print(RMSE)
    >>> 
    0.425983107542
    0.6526738140461913

    You can also change the feature column, as follows:

    feature_cols = ['RT_user_norm', 'RT_norm', 'Metacritic_user_norm', 'Metacritic_norm']

    Now that we are not considering the Fandango's stars, I experienced the following result of mse and rmse respectively:

    0.426362842426
    0.6529646563375979
  8. Observing the training cost throughout iterations:
    fig, axes = plt.subplots()
    plt.plot(range(len(cost_history)), cost_history)
    axes.set_xlim(xmin=0.95)
    axes.set_ylim(ymin=1.e-2)
    axes.set_xscale("log", nonposx='clip')
    axes.set_yscale("log", nonposy='clip')
    axes.set_ylabel('Training cost')
    axes.set_xlabel('Iterations')
    axes.set_title('Learning rate = ' + str(learning_rate))
    plt.show()
    plt.close()
    >>>
    

    The output is as follows:

    Using Linear Regression for Movie Rating Prediction

    Figure 8: The training and training cost become saturated after 10000 iterations

    The preceding graph shows that the training cost becomes saturated after 10,000 iterations. This also means that, even if you iterate the model more than 10,000 times, the cost is not going to experience a significant decrease.

  9. Evaluating the model:
    predictedDF = X_test.copy(deep=True)
    predictedDF.insert(loc=0, column='IMDB_norm_predicted', value=pd.Series(data=q_pred, index=predictedDF.index))
    predictedDF.insert(loc=0, column='IMDB_norm_actual', value=q_test)
    
    print('Predicted vs actual rating using LR with TensorFlow')
    print(predictedDF[['IMDB_norm_actual', 'IMDB_norm_predicted']].head())print(predictedDF[['IMDB_norm_actual', 'IMDB_norm_predicted']].tail())
    >>>

    The following shows the predicted versus actual rating using LR:

              IMDB_norm_actual  IMDB_norm_predicted
    45              3.30              3.232061
    50              3.35              3.381659
    98              3.05              2.869175
    119             3.60              3.796200
    133             2.15              2.521702
    140             4.30              4.033006
    143             3.70              3.816177
    42              4.10              3.996275
    90              3.05              3.226954
    40              3.45              3.509809

    We can see that the prediction is a continuous value. Now it's time to see how well the LR model generalizes and fits to the regression line:

    How the LR fit with the predicted data points:
    plt.scatter(q_test, q_pred, color='blue', alpha=0.5)
    plt.plot([q_test.min(), q_test.max()], [q_test.min(), q_test.max()], '--', lw=1)
    plt.title('Predicted vs Actual')
    plt.xlabel('Actual')
    plt.ylabel('Predicted')
    plt.show()
    plt.show()
    
    >>>

    The output is as follows:

    Using Linear Regression for Movie Rating Prediction

    Figure 9: Prediction made by the LR model

    The graph does not tell us that the prediction made by the LR model is good or bad. But we can still improve the performance of such models using layer architectures such as deep neural networks.

    The next example is about applying other supervised learning algorithms such as logistic regression, support vector machines, and random forest for predictive analytics.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset