To illustrate why this is more useful, let's run the following scenario on two sample series:
a = pd.Series([10,10,10,10]) b = pd.Series([12,8,8,12]) np.sqrt(np.mean((b-a)**2))/np.mean(a)
This generates the following output:
Now, compare that to the mean:
(b-a).mean()
This generates the following output:
Clearly, the latter is the more meaningful statistic. Now, let's run it for our model:
np.sqrt(np.mean((y_pred-y_actual)**2))/np.mean(y_actual)
This generates the following output:
Suddenly, our awesome model looks a lot less awesome. Let's take a look at some of the predictions our model made versus the actual values that can be seen in the data:
deltas[['predicted','actual']].iloc[:30,:].plot(kind='bar', figsize=(16,8))
The preceding code generates the following output:
Based on what we can see here, the model—at least for this sample—tends to modestly underpredict the virality of the typical article, but then heavily underpredicts the virality for a small number. Let's see what those are:
all_data.loc[test_index[:30],['title', 'fb']].reset_index(drop=True)
The preceding code results in the following output:
From the preceding output, we can see that an article on Malala and an article on a husband complaining about how much his stay-at-home wife costs him greatly overshot the predicted numbers of our model. Both would seem to have high emotional valence.