Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Linear Regression – Revisited

In Lesson 1, From Data to Decisions – Getting Started with TensorFlow we have seen an example of linear regression. We have observed how to work TensorFlow on the randomly generated dataset, that is, fake data. We have seen that the regression is a type of supervised machine learning for predicting the continuous-valued output. However, running a linear regression on fake data is just like buying a new car and never driving it. This awesome machinery begs to manifest itself in the real world!

Fortunately, many datasets are available online to test your new-found knowledge of regression:

The University of Massachusetts Amherst supplies small datasets of various types: http://www.umass.edu/statdata/statdata/
Kaggle contains all types of large-scale data for machine learning competitions: https://www.kaggle.com/datasets
Data.gov is an open data initiative by the US government, which contains many interesting and practical datasets: https://catalog.data.gov

Therefore, in this section, by defining a set of models, we will see how to reduce the search space of possible functions. Moreover, TensorFlow takes advantage of the differentiable property of the functions by running its efficient gradient descent optimizers to learn the parameters. To avoid overfitting our data, we regularize the cost function by penalizing larger valued parameters.

The linear regression is shown in Lesson 1, From Data to Decision – Getting Started with TensorFlow, shows some tensors that just contained a single scalar value, but you can, of course, perform computations on arrays of any shape. In TensorFlow, operations such as addition and multiplication take two inputs and produce an output. In contrast, constants and variables do not take any input. We will also see an example of how TensorFlow can manipulate 2D arrays to perform linear regression like operations.

Problem Statement

Online movie ratings and recommendations have become a serious business around the world. For example, Hollywood generates about $10 billion at the U.S. box office each year. Websites like Rotten Tomatoes aggregates movie reviews into one overall rating and also reports poor opening weekends. Although a single movie critic or a single negative review can't make or break a film, thousands of reviews and critics do.

Rotten Tomatoes, Metacritic, and IMDb have their own way of aggregating film reviews and distinct rating systems. On the other hand, Fandango, an NBCUniversal subsidiary uses a five-star rating system in which most of the movies get at least three stars, according to a FiveThirtyEight analysis.

An exploratory analysis of the dataset used by Fandango shows that out of 510 films, 437 films got at least one review where, hilariously, 98% had a 3-star rating or higher and 75 percent had a 4-star rating or higher. This implies, that using Fandango's standards it's almost impossible for a movie to be a flop at the box office. Therefore, Fandango's rating is biased and skewed:

Figure 3: Fandango's lopsided ratings curve

(Source: https://fivethirtyeight.com/features/fandango-movies-ratings/)

Since the ratings from Fandango are unreliable, we will instead predict our own ratings based on IMDb ratings. More specifically, this is a multivariate regression problem, since our predictive model will use multiple features to make the rating prediction having many predictors.

Fortunately, the data is small enough to fit in memory, so plain batch learning should do just fine. Considering these factors and need, we will see that linear regression will meet our requirements. However, for more robust regression, you can still use deep neural network based regression techniques such as deep belief networks Regressor.

Using Linear Regression for Movie Rating Prediction

Now, the first task is downloading the Fandango's rating dataset from GitHub at https://github.com/fivethirtyeight/data/tree/master/fandango. It contains every film that has a Rotten Tomatoes rating, an RT user rating, a Metacritic score, a Metacritic user score, IMDb score, and at least 30 fan reviews on Fandango.

Table 1: Description of the columns in fandango_score_comparison.csv

The dataset has 22 columns that can be described as follows:

Column	Definition
`FILM`	Name of the film.
`RottenTomatoes`	Corresponding Tomatometer score for the film by Rotten Tomatoes.
`RottenTomatoes_User`	Rotten Tomatoes user score for the film.
`Metacritic`	Metacritic critic score for the film.
`Metacritic_User`	Metacritic user score for the film.
`IMDB`	IMDb user score for the film.
`Fandango_Stars`	A number of stars the film had on its Fandango movie page.
`Fandango_Ratingvalue`	The Fandango rating value for the film, as pulled from the HTML of each page. This is the actual average score the movie obtained.
`RT_norm`	Tomatometer score for the film. It is normalized to a 0 to 5 point system.
`RT_user_norm`	Rotten Tomatoes user score for the film. It is normalized to a 0 to 5 point system.
`Metacritic_norm`	The Metacritic critic scores for the film. It is normalized to a 0 to 5 point system.
`Metacritic_user_nom`	Metacritic user score for the film, normalized to a0 to 5 point system.
`IMDB_norm`	IMDb user score for the film which is normalized to a 0 to 5 point system.
`RT_norm_round`	Rotten Tomatoes Tomatometer score for the film which is normalized to a 0 to 5 point system and rounded to the nearest half-star.
`RT_user_norm_round`	Rotten Tomatoes user score for the film, normalized to a 0 to 5 point system and rounded to the nearest half-star.
`Metacritic_norm_round`	Metacritic critic score for the film, normalized to a 0 to 5 point system and rounded to the nearest half-star.
`Metacritic_user_norm_round`	Metacritic user score for the film, normalized to a 0 to 5 point system and rounded to the nearest half-star.
`IMDB_norm_round`	IMDb user score for the film, normalized to a 0 to 5 point system and rounded to the nearest half-star.
`Metacritic_user_vote_count`	A number of user votes the film had on Metacritic.
`IMDB_user_vote_count`	A number of user votes the film had on IMDb.
`Fandango_votes`	A number of user votes the film had on Fandango.
`Fandango_Difference`	Difference between the presented `Fandango_Stars` and the actual `Fandango_Ratingvalue`.

We have already seen that a typical linear regression problem using TensorFlow has the following workflow that updates the parameters to minimize the given cost function of Fandango's lopsided rating curve:

Using Linear Regression for Movie Rating Prediction

Figure 4: The learning algorithm using linear regression in TensorFlow

Now, let's try to follow the preceding figure and reproduce the same for the linear regression:

Import the required libraries:

import numpy as np
import pandas as pd
from scipy import stats
import sklearn
from sklearn.model_selection import train_test_split
import tensorflow as tf
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

Read the dataset and create a Panda DataFrame:
```
df = pd.read_csv('fandango_score_comparison.csv')
print(df.head())
```
The output is as follows:
Figure 5: A snap of the dataset showing a typo in the Metacritic_user_nom
So, if you look at the preceding DataFrame carefully, there is a typo that could cause a disaster. From our intuition, it is clear that Metacritic_user_nom should have actually been Metacritic_user_norm. Let's rename it to avoid further confusion:
```
df.rename(columns={'Metacritic_user_nom':'Metacritic_user_norm'}, inplace=True)
```
Moreover, according to a statistical analysis at https://fivethirtyeight.com/features/fandango-movies-ratings/, all the variables don't contribute equally; the following columns have more importance in ranking the movies:
```
 'Fandango_Stars',
'RT_user_norm',
'RT_norm',
'IMDB_norm',
'Metacritic_user_norm',
'Metacritic_norm'
```
Now we can check the correlation coefficients between variables before build the LR model. First, let's create a ranking list for that:
```
rankings_lst = ['Fandango_Stars',
                'RT_user_norm',
                'RT_norm',
                'IMDB_norm',
                'Metacritic_user_norm',
                'Metacritic_norm']
```
The following function computes the Pearson correlation coefficients and builds a full correlation matrix:
```
def my_heatmap(df):    
    import seaborn as sns    
    fig, axes = plt.subplots()
    sns.heatmap(df, annot=True)
    plt.show()
    plt.close()
```
Let's call the preceding method to plot the matrix as follows:
```
my_heatmap(df[rankings_lst].corr(method='pearson'))
```
Note
Pearson correlation coefficients: A measure of the strength of the linear relationship between two variables. If the relationship between the variables is not linear, then the correlation coefficient cannot accurately and adequately represent the strength of the relationship between those two variables. It is often represented as "ρ" when measured on population and "r" when measured on a sample. Statistically, the range is -1 to 1, where -1 indicates a perfect negative linear relationship, an r of 0 indicates no linear relationship, and an r of 1 indicates a perfect positive linear relationship between variables.
The following correlation matrix shows correlation between considered features using the Pearson correlation coefficients:
Figure 6: The correlation matrix on the ranking list movies
So, the correlation between Fandango and Metacritic is still positive. Now, let's do another study by considering only the movies for which RT has provided at least a 4-star rating:
```
RT_lst = df['RT_norm'] >= 4.
my_heatmap(df[RT_lst][rankings_lst].corr(method='pearson'))
>>>
```
The output is the correlation matrix on the ranked movies and RT movies having ratings of at least 4 showing a correlation between considered features using the Pearson correlation coefficients:
Figure 7: The correlation matrix on the ranked movies and RT movies having ratings at least 4
This time, we have obtained anticorrelation (that is, negative correlation) between Fandango and Metacritic, with the correlation coefficient-0.23. This means that the correlation of Metacritic in terms of Fandango is significantly biased toward high ratings.
Therefore, we can train our model without considering Fandango's rating, but before that let's build the LR model using this first. Later on, we will decide which option would produce a better result eventually.
Preparing the training and test sets.
Let's create a feature matrix X by selecting two DataFrame columns:
```
feature_cols = ['Fandango_Stars', 'RT_user_norm', 'RT_norm', 'Metacritic_user_norm', 'Metacritic_norm']
X = df.loc[:, feature_cols]
```
Here, I have used only the selected column as features and now we need to create a response vector y:
```
y = df['IMDB_norm']
```
We are assuming that the IMDB is the most reliable and the baseline source of ratings. Our ultimate target is to predict the rating of each movie and compare the predicted ratings with the response column IMDB_norm.
Now that we have the features and the response columns, it's time to split data into training and testing sets:
```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=43)
```
If you want to change the random_state, it helps you generate pseudo-random numbers for a random sampling value to obtain differentfinal results.
Note
Random state: As the name sounds can be used for initializing the internal random number generator, which will decide the splitting of data into train and test indices. This also signifies that every time you run it without specifying random_state, you will get a different result, this is expected behavior. So, we can have the following three options:
- If random_state is None (or np.random), a randomly-initialized RandomState object is returned
- If random_state is an integer, it is used to seed a new RandomState object
- If random_state is a RandomState object, it is passed through
Now, we need to have the dimension of the dataset to be passed through the tensors:
```
dim = len(feature_cols)
```
We need to include an extra dimension for independent coefficient:
```
dim += 1
```
And so we need to create an extra column for the independent coefficient in the training set and test feature set as well:
```
X_train = X_train.assign( independent = pd.Series([1] * len(y_train), index=X_train.index))
X_test = X_test.assign( independent = pd.Series([1] * len(y_train), index=X_test.index))
```
So far, we have used and utilized the panda DataFrames but converting it into tensors is troublesome so instead let's convert them into a NumPy array:
```
P_train = X_train.as_matrix(columns=None)
P_test = X_test.as_matrix(columns=None)

q_train = np.array(y_train.values).reshape(-1,1)
q_test = np.array(y_test.values).reshape(-1,1)
```
Creating a place holder for TensorFlow.
Now that we have all the training and test sets, before initializing these variables, we have to create the place holder for TensorFlow to feed the training sets across the tensors:
```
P = tf.placeholder(tf.float32,[None,dim])
q = tf.placeholder(tf.float32,[None,1])
T = tf.Variable(tf.ones([dim,1]))
```
Let's add some bias to differing from the value in the case where both types are quantized as follows:
```
bias = tf.Variable(tf.constant(1.0, shape = [n_dim]))
q_ = tf.add(tf.matmul(P, T),bias)
```

Creating an optimizer.

Let's create an optimizer for the objective function:

cost = tf.reduce_mean(tf.square(q_ - q))
learning_rate = 0.0001
training_op = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

Initializing global variables:

init_op = tf.global_variables_initializer()
cost_history = np.empty(shape=[1],dtype=float)

Training the LR model.

Here we are iterating the training 50,000 times and tracking several parameters, such as means square error that signifies how good the training is; we are keeping the cost history for future visualization, and so on:

training_epochs = 50000
with tf.Session() as sess:
    sess.run(init_op)
    cost_history = np.empty(shape=[1], dtype=float)
    t_history = np.empty(shape=[dim, 1], dtype=float)
    for epoch in range(training_epochs):
        sess.run(training_op, feed_dict={P: P_train, q: q_train})
        cost_history = np.append(cost_history, sess.run(cost, feed_dict={P: P_train, q: q_train}))
        t_history = np.append(t_history, sess.run(T, feed_dict={P: P_train, q: q_train}), axis=1)
    q_pred = sess.run(q_, feed_dict={P: P_test})[:, 0]
    mse = tf.reduce_mean(tf.square(q_pred - q_test))
    mse_temp = mse.eval()
    sess.close()

Finally, we evaluate the mse to get the scalar value out of the training evaluation on the test set. Now, let's compute the mse and rmse values, as follows:

print(mse_temp)
RMSE = math.sqrt(mse_temp)
print(RMSE)
>>> 
0.425983107542
0.6526738140461913

You can also change the feature column, as follows:

feature_cols = ['RT_user_norm', 'RT_norm', 'Metacritic_user_norm', 'Metacritic_norm']

Now that we are not considering the Fandango's stars, I experienced the following result of mse and rmse respectively:

0.426362842426
0.6529646563375979

Observing the training cost throughout iterations:
```
fig, axes = plt.subplots()
plt.plot(range(len(cost_history)), cost_history)
axes.set_xlim(xmin=0.95)
axes.set_ylim(ymin=1.e-2)
axes.set_xscale("log", nonposx='clip')
axes.set_yscale("log", nonposy='clip')
axes.set_ylabel('Training cost')
axes.set_xlabel('Iterations')
axes.set_title('Learning rate = ' + str(learning_rate))
plt.show()
plt.close()
>>>
```
The output is as follows:
Figure 8: The training and training cost become saturated after 10000 iterations
The preceding graph shows that the training cost becomes saturated after 10,000 iterations. This also means that, even if you iterate the model more than 10,000 times, the cost is not going to experience a significant decrease.

Evaluating the model:

predictedDF = X_test.copy(deep=True)
predictedDF.insert(loc=0, column='IMDB_norm_predicted', value=pd.Series(data=q_pred, index=predictedDF.index))
predictedDF.insert(loc=0, column='IMDB_norm_actual', value=q_test)

print('Predicted vs actual rating using LR with TensorFlow')
print(predictedDF[['IMDB_norm_actual', 'IMDB_norm_predicted']].head())print(predictedDF[['IMDB_norm_actual', 'IMDB_norm_predicted']].tail())
>>>

The following shows the predicted versus actual rating using LR:

          IMDB_norm_actual  IMDB_norm_predicted
45              3.30              3.232061
50              3.35              3.381659
98              3.05              2.869175
119             3.60              3.796200
133             2.15              2.521702
140             4.30              4.033006
143             3.70              3.816177
42              4.10              3.996275
90              3.05              3.226954
40              3.45              3.509809

We can see that the prediction is a continuous value. Now it's time to see how well the LR model generalizes and fits to the regression line:

How the LR fit with the predicted data points:
plt.scatter(q_test, q_pred, color='blue', alpha=0.5)
plt.plot([q_test.min(), q_test.max()], [q_test.min(), q_test.max()], '--', lw=1)
plt.title('Predicted vs Actual')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
plt.show()

>>>

The output is as follows:

Figure 9: Prediction made by the LR model

The graph does not tell us that the prediction made by the LR model is good or bad. But we can still improve the performance of such models using layer architectures such as deep neural networks.

The next example is about applying other supervised learning algorithms such as logistic regression, support vector machines, and random forest for predictive analytics.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Linear Regression – Revisited

Create new playlist

Sign In

Sign Up

Linear Regression – Revisited

Problem Statement

Using Linear Regression for Movie Rating Prediction

Note

Note

Table of Contents for
Linear Regression – Revisited