We started getting into the nuances of forecasting in the previous chapter where we saw how to generate multi-step forecasts. While that covers one of the aspects, there is another aspect of forecasting that is as important as it is confusing – how to evaluate forecasts.
In the real world, we generate forecasts to enable some downstream processes to plan better and take relevant actions. For instance, the operations manager at a bike rental company should decide how many bikes he should make available at the metro station the next day at 4 p.m. However, instead of using the forecasts blindly, he may want to know which forecasts he should trust and which ones he shouldn’t. This can only be done by measuring how good a forecast is.
We have been using a few metrics throughout the book and it is now time to get down into the details to understand those metrics, when to use them, and when to not use some metrics. We will also elucidate a few aspects of these metrics experimentally.
In this chapter, we will be covering these main topics:
You will need to set up the Anaconda environment following the instructions in the Preface of the book to get a working environment with all packages and datasets required for the code in this book.
The associated code for the chapter can be found here: https://github.com/PacktPublishing/Modern-Time-Series-Forecasting-with-Python/tree/main/notebooks/Chapter18.
For this chapter, you need to run the notebooks in the Chapters02 and Chapter04 folders from the book’s GitHub repository.
Traditionally, in regression problems, we have very few, general loss functions such as the mean squared error or the mean absolute error, but when you step into the world of time series forecasting, you will be hit with a myriad of different metrics.
Important note
Since the focus of the book is on point predictions (and not probabilistic predictions), we will stick to reviewing point forecast metrics.
There are a few key factors that distinguish the metrics in time series forecasting:
These aforementioned factors, along with a few others, have led to an explosion in the number of forecast metrics. In a recent survey paper by Hewamalage et al. (#1 in References), the number of metrics that was covered stands at 38. Let’s try and unify these metrics under some structure. Figure 18.1 depicts a taxonomy of forecast error measures:
Figure 18.1 – Taxonomy of forecast error measures
We can semantically separate the different forecast metrics into two buckets – intrinsic and extrinsic. Intrinsic metrics measure the generated forecast using nothing but the generated forecast and the corresponding actuals. As the name suggests, it is a very inward-looking metric. Extrinsic metrics, on the other hand, use an external reference or benchmark in addition to the generated forecast and ground truth to measure the quality of the forecast.
Before we start with the metrics, let’s establish a notation to help us understand. and are the actual observation and the forecast at time . The forecast horizon is denoted by . In cases where we have a dataset of time series, we assume there are time series, indexed by , and finally, denotes the error at timestep . Now, let’s start with the intrinsic metrics.
There are four major base errors – absolute error, squared error, percent error, and symmetric error – that are aggregated or summarized in different ways in a variety of metrics. Therefore, any property of these base errors also applies to the aggregate ones, so let’s look at these base errors first.
The error, , can be positive or negative, depending on whether or not, but then when we are calculating and adding this error over the horizon, the positive and negative errors may cancel each other out and that paints a rosier picture. Therefore, we include a function on top of to ensure that the errors do not cancel each other out.
The absolute function is one of these functions: . The absolute error is a scale-dependent error. This means that the magnitude of the error depends on the scale of the time series. For instance, if you have an AE of 10, it doesn’t mean anything until you put it in context. For a time series with values of around 500 to 1,000, an AE of 10 may be a very good number, but if the time series has values around 50 to 70, then it is bad.
Important note
Scale dependence is not a deal breaker when we are looking at individual time series, but when we are aggregating or comparing across multiple time series, scale-dependent errors skew the metric in favor of the large-scale time series. The interesting thing to note here is that this is not necessarily bad. Sometimes, the scale in the time series is meaningful and it makes sense from the business perspective to focus more on the large-scale time series than the smaller ones. For instance, in a retail scenario, one would be more interested in getting the high-selling product forecast right than those of the low-selling ones. In these cases, using a scale-dependent error automatically favors the high-selling products.
You can see this by carrying out an experiment on your own. Generate a random time series, . Now, similarly, generate a random forecast for the time series, . Now, we multiply the forecast, , and time series, , by 100 to get two new time series and their forecasts, and . If we calculate the forecast metric for both these sets of time series and forecasts, the scaled-dependent metrics will give very different values, whereas the scale-independent ones will give the same values.
Many metrics are based on this error:
Here, is the weight of a particular timestep. This can be used to assign more weight to special days (such as weekends or promotion days).
To calculate ND, we just sum all the absolute errors across the horizons and time series and scale it by the actual observations across the horizons and time series:
Squaring is another function that makes the error positive and thereby prevents the errors from canceling each other out:
There are many metrics that are based on this error:
While absolute error and squared error are scale-dependent, percent error is a scale-free error measure. In percent error, we scale the error using the actual time series observations: . Some of the metrics that use percent error are as follows:
Percent error has a few problems – it is asymmetrical (we will see this in detail later in the chapter), and it breaks down when the actual observation is zero (due to division by zero). Symmetric error was proposed as an alternative to avoid this asymmetry, but as it turns out, symmetric error is itself asymmetric – more on that later, but for now, let’s see what symmetric error is:
There are only two metrics that are popularly used under this base error:
There are a few other metrics that are intrinsic in nature but don’t conform to the other metrics. Notable among those are three metrics that measure the over- or under-forecasting aspect of forecasts:
Here, is the past window over which TS is calculated.
Now, let’s turn our attention to a few extrinsic metrics.
There are two major buckets of metrics under the extrinsic umbrella – relative error and scaled error.
One of the problems of intrinsic metrics is that they don’t mean a lot unless a benchmark score exists. For instance, if we hear that the MAPE is 5%, it doesn’t mean a lot because we don’t know how forecastable that time series is. Maybe 5% is a bad error rate. Relative error solves this by including a benchmark forecast in the calculation so that the errors of the forecast we are measuring are measured against the benchmark and thus show the relative gains of the forecast. Therefore, in addition to the notation that we have established, we need to add a few more.
Let be the forecast from the benchmark and be the benchmark error. There are two ways we can include the benchmark in the metric:
Let’s look at a few metrics which follow these:
Hyndman and Koehler introduced the idea of scaled error in 2006. This was an alternative to relative error and measures and tries to get over some of the drawbacks and subjectivity of choosing the benchmark forecast. Scaled error scales the forecast error using an in-sample MAE of a benchmark method such as naïve forecasting. Let the entire training history be of timesteps, indexed by .
So, the scaled error is defined as follows:
There are a couple of metrics that adopt this principle:
There are other extrinsic metrics that don’t fall into the categorization of errors we have made. One such error measure is the following:
Percent Better (PB) is a method that is based on counts and can be applied to individual time series as well as a dataset of time series. The idea here is to use a benchmark method and count how many times a given method is better than the benchmark and report it as a percentage. Formally, we can define it using MAE as the reference error as follows:
Here, is an indicator function that returns 1 if the condition is true and 0 otherwise.
We have seen a lot of metrics in the previous sections, but now it’s time to understand a bit more about the way they work and what they are suited for.
It’s not enough to know the different metrics since we also need to understand how these work, what are they good for, and what are they not good for. We can start with the basic errors and work our way up because understanding the properties of basic errors such as absolute error, squared error, percent error, and symmetric error will help us understand the others as well because most of the other metrics are derivatives of these primary errors; either aggregating them or using relative benchmarks.
Let’s do this investigation using a few experiments and understand them through the results.
Notebook alert
The notebook for running these experiments on your own is 01-Loss Curves and Symmetry.ipynb in the Chapter18 folder.
All these base errors depend on two factors – forecasts and actual observations. We can examine the behavior of these several metrics if we fix one and alter the other in a symmetric range of potential errors. The expectation is that the metric will behave the same way on both sides because deviation from the actual observation on either side should be equally penalized in an unbiased metric. We can also swap the forecasts and actual observations and that also should not affect the metric.
In the notebook, we did exactly these experiments – loss curves and complementary pairs.
When we plot these for absolute error, we get Figure 18.2:
Figure 18.2 – The loss curves and complementary pairs for absolute error
The first chart plots the signed error against the absolute error and the second one plots the absolute error with all the combinations of actuals and forecast, which add up to 10. The two charts are obviously symmetrical, which means that an equal deviation from the actual observed on either side is penalized equally, and if we swap the actual observation and the forecast, the metric remains unchanged.
Now, let’s look at squared error:
Figure 18.3 – The loss curves and complementary pairs for squared error
These charts also look symmetrical, so the squared error also doesn’t have an issue with asymmetric error distribution – but we can notice one thing here. The squared error increases exponentially as the error increases. This points to a property of the squared error – it gives undue weightage to outliers. If there are a few timesteps for which the forecast is really bad and excellent at all other points, the squared error inflates the impact of those outlying errors.
Now, let’s look at percent error:
Figure 18.4 – The loss curves and complementary pairs for percent error
There goes our symmetry. The percent error is symmetrical when you move away from the actuals on both sides (mostly because we are keeping the actuals constant), but the complementary pairs tell us a whole different story. When the actual is 1 and the forecast is 9, the percent error is 8, but when we swap them, the percent error drops to 1. This kind of asymmetry can cause the metric to favor under-forecasting. The right half of the second chart in Figure 18.4 are all cases where we are under-forecasting and we can see that the error is very low there when compared to the left half.
We will look at under- and over-forecasting in detail in another experiment.
For now, let’s move on and look at the last error we had – symmetric error:
Figure 18.5 – The loss curves and complementary pairs for symmetric error
Symmetric error was proposed mainly because of the asymmetry we saw in the percent error. MAPE, which uses percent error, is one of the most popular metrics used and sMAPE was proposed to directly challenge and replace MAPE – true to its claim, it did resolve the asymmetry that was present in percent error. However, it introduced its own asymmetry. In the first chart, we can see that for a particular actual value, if the forecast moves on either side, it is penalized differently, so in effect, this metric favors over-forecasting (which is in direct contrast to percent error, which favors under-forecasting).
With all the intrinsic measures done, we can also take a look at the extrinsic ones. With extrinsic measures, plotting the loss curves and checking symmetry is not as easy. Instead of two variables, we now have three – the actual observation, the forecast, and the reference forecast; the value of the measure can vary with any of these. We can use a contour plot for this as shown in Figure 18.6:
Figure 18.6 – Contour plot of the loss surface – relative absolute error and absolute scaled error
The contour plot enables us to plot three dimensions in a 2D plot. The two dimensions (error and reference forecast) are on the x- and y-axes. The third dimension (the relative absolute error and absolute scaled error values) is represented as color, with contour lines bordering same-colored areas. The errors are symmetric around the error (horizontal) axis. This means that if we keep the reference forecast constant and vary the error, both measures vary equally on both sides of the errors. This is not surprising since both these errors have their base in absolute error, which we know was symmetric.
The interesting observation is the dependency on the reference forecast. We can see that for the same error, relative absolute error has different values for different reference forecasts, but scaled error doesn’t have this problem. This is because it is not directly dependent on the reference forecast and rather uses the MAE of a naïve forecast. This value is fixed for a time series and eliminates the task of choosing a reference forecast. Therefore, scaled error has good symmetry for absolute error and very little or fixed dependency on the reference forecast.
We have seen indications of bias toward over- or under-forecasting in a few metrics that we saw. In fact, it looked like the popular metric, MAPE, favors under-forecasting. To finally put that to test, we can perform another experiment with synthetically generated time series and we included a lot more metrics in this experiment so that we know which are safe to use and which need to be looked at carefully.
Notebook alert
The notebook to run these experiments on your own is 02-Over and Under Forecasting.ipynb in the Chapter18 folder.
The experiment is simple and detailed as follows:
np.random.randint(2,5,n)
np.random.randint(2,5,n)
np.random.randint(0,4,n)# Underforecast
np.random.randint(3,7,n) # Overforecast
After the experiment is done, we can plot a box plot of different metrics so that it shows the distribution of each metric for each of those three forecasts over these 10,000 runs of the experiment. Let’s see the box plot in Figure 18.7:
Figure 18.7 – Over- and under-forecasting experiment
Let’s go over what we would expect from this experiment first. The over- (green) and under- (red) forecasted forecasts would have a higher error than the baseline (blue). The over- and under-forecasted errors would be similar.
With that, let’s summarize our major findings:
We have investigated a few properties of different error measures and understood the basic properties of some of them. To further that understanding and move closer to helping us select the right measure for our problem, let’s do one more experiment using the London Smart Meters dataset we have been using through this book.
As we discussed earlier, there are a lot of metrics for forecasting that people have come up with over the years. Although there are many different formulations of these metrics, there can be similarities in what they are measuring. Therefore, if we are going to choose a primary and secondary metric while modeling, we should pick some metrics that are diverse and measure different aspects of the forecast.
Through this experiment, we are going to try and figure out which of these metrics are similar to each other. We are going to use the subset of the London Smart Meters dataset we have been using all through the book and generate some forecasts for each household. I have chosen to do this exercise with the darts library because I wanted multi-step forecasting. I’ve used five different forecasting methods – seasonal naïve, exponential smoothing, Theta, FFT, and LightGBM (local) – and generated forecasts. On top of that, I have also calculated the following metrics on all of these forecasts – MAPE, WAPE, sMAPE, MAE, MdAE, MSE, RMSE, MRAE, MASE, RMSSE, RelMAE, RelRMSE, RelMAPE, CFE, Forecast Bias, and PB(MAE). In addition to this, we also calculated a few aggregate metrics – meanMASE, meanRMSSE, meanWAPE, meanMRAE, AvgRelRMSE, ND, and NRMSE.
The basis of the experiment is that if different metrics measure the same underlying factor, then they will also rank forecasts on different households similarly. For instance, if we say that MAE and MASE are measuring one latent property of the forecast, then those two metrics would give similar rankings to different households. At the aggregated level, there are five different models and aggregate metrics that measure the same underlying latent factor and should also rank them in similar ways.
Let’s look at the aggregate metrics first. We ranked the different forecast methods at the aggregate level using each of the metrics and then we calculated the Pearson correlation of the ranks. This gives us Spearman’s rank correlation between the forecasting methods and metrics. The heatmap of the correlation matrix is in Figure 18.8:
Figure 18.8 – Spearman’s rank correlation between the forecast methods and aggregate metrics
These are the major observations:
Similarly, we calculated Spearman’s rank correlation between the forecast methods and metrics across all the households (Figure 18.9). This enables us to have the same kind of comparison as before at the item level:
Figure 18.9 – Spearman’s rank correlation between the forecast methods and item-level metrics
The major observations are as follows:
Important note
Spearman’s rank correlation on aggregate metrics is done using a single dataset and has to be taken with a grain of salt. The item-level correlation has a bit more significance because it is made across many households, but there are still a few things in there that warrant further investigation. I urge you to repeat this experiment on some other datasets and check whether we see the same patterns repeated before adopting them as rules.
Now that we have explored the different metrics, it is time to summarize and probably leave you with a few guidelines for choosing a metric.
Throughout this chapter, we have come to understand that it is difficult to choose one forecast metric and apply it universally. There are advantages and disadvantages for each metric and being cognizant of these while selecting a metric is the only rational way to go about it.
Let’s summarize and note a few points we have seen through different experiments in the chapter:
Hewamalage et al. (#1 in References) have proposed a very detailed flowchart to aid in decision-making, but that is also more of a guideline as to what not to use. The selection of a single metric is a very debatable task. There are a lot of conflicting opinions out there and I’m just adding another to that noise. Here are a few guidelines I propose to help you pick a forecasting metric:
Congratulations on finishing a chapter full of new terms and metrics and I hope you have gained the necessary intuition to intelligently select the metric to focus on for your next forecasting assignment!
In this chapter, we looked at the thickly populated and highly controversial area of forecast metrics. We started with a basic taxonomy of forecast measures to help you categorize and organize all the metrics in the field.
Then, we launched a few experiments through which we learned about the different properties of these metrics, slowly approaching a better understanding of what these metrics are measuring, but looking at synthetic time series experiments, we learned how MAPE and sMAPE favor under- and over-forecasting, respectively.
We also analyzed the rank correlations between these metrics on real data to see how similar the different metrics are and finally, rounded off by laying out a few guidelines that can help you pick a forecasting metric for your problem.
In the next chapter, we will look at cross-validation strategies for time series.
The following are the references that we used throughout the chapter:
If you wish to read further about forecast metrics, you can check out the blog post Forecast Error Measures: Intermittent Demand by Manu Joseph – https://deep-and-shallow.com/2020/10/07/forecast-error-measures-intermittent-demand/.