The test of all knowledge is experiment. Experiment is the sole judge of scientific truth.
—RICHARD FEYNMAN, FEYNMAN LECTURES ON PHYSICS
When I conduct our webinar and seminar training, I like to divide the content into takeaway and aspirational. The former are methods the attendees should be able to immediately apply by the end of the training without further assistance. The basic one-for-one substitution described in chapters 4 and 11 is the takeaway method. This is the one of the simplest quantitative risk analysis methods possible. It involves just eliciting quantitatively explicit estimates of the likelihood and impact of events, the appetite for risk, and the effects of controls. Only the simplest empirical methods have been introduced and most estimates will rely on calibrated experts. This is the starting point from which we will improve incrementally as you see fit.
This chapter will provide a high-level introduction of just some of the aspirational issues of quantitative risk management. Depending on your needs and your comfort level with these techniques, you may want to introduce these ideas soon. The objective here is to teach you at least enough that you know these methods exist so you can aspire to add them and more in the future. The ideas we will introduce now are as follows:
The survey of Monte Carlo modelers (see chapter 10) found that of the seventy-two models I reviewed, only three (4 percent) actually conducted some original empirical measurement to reduce uncertainty about a particular variable in a model. This is far short of what it should be.
Should more measurements have been attempted? Would such efforts have been justified? Which specific measurements are justified? Those questions can be answered. As previously discussed, there is a method for computing the value of additional measurements and we have been applying that routinely on many models.
Of the well-over 150 models I've completed up to this point, only three have shown that further measurement was not needed. In other words, I'm finding that additional empirical measurement is easily justified at least 97 percent of the time, not 4 percent of the time. Clearly, the lack of empirical methods is a key weakness of many quantitative models. Here, we will quickly discuss the concept behind computing the value of information and some key ideas for conducting further measurements for risk analysis.
As discussed in chapter 10, the expected value of perfect information (EVPI) for some uncertain decisions is equal to the expected opportunity loss (EOL) of the decision. This means that the most you should be willing to pay for uncertainty reduction about a decision is the cost of being wrong times the chance of being wrong.
The EVPI is handy to know, but we don't usually get perfect information. So we could estimate the expected value of information (EVI). This is the EVPI without the perfect. This is equal to how much the EOL will be reduced by a measurement. I won't go into further detail here other than to direct the curious reader to the www.howtomeasureanything.com/riskmanagement site for an example.
The systematic application of this is why I refer to the method I developed as applied information economics. The process is fairly simple to summarize. Just get your Monte Carlo tool ready, calibrate your estimators, and follow these five steps:
This is very different from what I see many modelers do. In step 2, they almost never calibrate, so that is one big difference. Then they hardly ever execute steps 3 and 4. And the results of these steps are what guides so much of my analysis that I don't know what I would do without them. It is for lack of conducting these steps that I find that analysts, managers, and SMEs of all stripes will consistently underestimate both the extent of their current uncertainty and the value of reducing that uncertainty. They would also likely be unaware of the consequences of the measurement inversion described in chapter 10. This means that because they don't have the value of information to guide what should be measured, if they did measure something, it is likely to be the wrong thing.
The reasons for underestimating uncertainty and its effects on decisions are found in the research we discussed previously. JDM researchers observed the extent of our overconfidence in estimates of our uncertainty. And Daniel Kahneman observes of decision-makers that when they get new information, they forget how much they learned. They forget the extent of their prior uncertainty and think that the new information was no big surprise (i.e., “I knew it all along”). Researchers have also found that decision-makers will usually underestimate how much can be inferred from a given amount of information.
I once had a related conversation with a very experienced operations research expert, who had said he had extensive experience with both Monte Carlo models and large law enforcement agencies. We were discussing how to measure the percentage of traffic stops that inadvertently released someone with a warrant for his or her arrest under some other jurisdiction. They needed to quantify a way to justify investments in building information technology that would enable easier communication among all participating law enforcement agencies.
After I explained my approach, he said, “But we don't have enough data to measure that.” I responded, “But you don't know what data you need or how much.” I went on to explain that his assumption that there isn't enough data should really be the result of a specific calculation. Furthermore, he would probably be surprised at how little additional data would be needed to reduce uncertainty about this highly uncertain variable.
I've heard this argument often from a variety of people and then subsequently disproved it by measuring the very item they thought could not be measured. It happened so often I had to write How to Measure Anything: Finding the Value of Intangibles in Business, just to explain all the fallacies I ran into and what the solutions were. Here are a couple of key concepts from that book that risk managers should keep in mind:
In risk analysis, it is often important to assess the likelihood of relatively uncommon but extremely costly events. The rarity of such events—called catastrophes or disasters—is part of the problem in determining their likelihood. The fact that they are rare might make it seem problematic for statistical analysis. But there are solutions even in these cases. For example, the previously mentioned Bayes' theorem is a simple but powerful mathematical tool and should be a basic tool for risk analysts to evaluate such situations. From what I can tell it is used much less frequently by quantitative modelers than it should be.
Bayes' theorem is a way to update prior knowledge with new information. We know from calibration training that we can always express our prior state of uncertainty for just about anything. What seems to surprise many people is how little additional data we need to update prior knowledge to an extent that would be of value.
One area of risk analysis in which we don't get many data points is in the failure rates of new vehicles in aerospace. If we are developing a new rocket, we don't know what the rate of failure might be—that is, the percentage of times the rocket blows up or otherwise fails to get a payload into orbit. If we could launch it a million times, we would have a very precise and accurate value for the failure rate. Obviously, we can't do that because, as with many problems in business, the cost per sample is just too high.
Suppose your existing risk analysis (calibrated experts) determined that a new rocket design had an 80 percent chance of working properly and a 20 percent chance of failure. But you have another battery of tests on components that could reduce your uncertainty. These tests have their own imperfections, of course, so passing them is still no guarantee of success. We know that in the past, other systems that failed in the maiden test flight had also failed this component testing method 95 percent of the time. Systems that succeeded on the launch pad had passed the component testing 90 percent of the time. If you get a good test result on the new rocket, what is the chance of success on the first flight? Follow this using the notation introduced in the previous chapter:
In other words, a good test means we can go from 80 percent certainty of launch success to 98.6 percent certainty. We started out with the probability of a good test result given a good rocket and we get the probability of a good rocket given a good test. That is why this is called an inversion using Bayes' theorem, or a Bayesian inversion.
Now, suppose we don't even get this test. (How did we get all the data for past failures with the test, anyway?) All we get is each actual launch as a test for the next one. And suppose we were extremely uncertain about this underlying failure rate. In fact, let's say that all we know is that the underlying failure rate (what we could measure very accurately if we launched it a million times) is somewhere between 0 percent and 100 percent. Given this extremely high uncertainty, each launch starting with the first launch tells us something about this failure rate.
Our starting uncertainty gives every percentile increment in our range equal likelihood. An 8–9 percent base failure rate is a 1 percent chance, a 98–99 percent failure rate is a 1 percent chance, and so on for every other value between 0 and 100 percent. Of course, we can easily work out the chance of a failure on a given launch if the base failure rate is 77 percent—it's just 77 percent. What we need now is a Bayesian inversion so we can compute the chance of a given base rate given some actual observations. I show a spreadsheet at www.howtomeasureanything.com that does this Bayesian inversion for ranges. In exhibit 12.1, you can see what the distribution of possible base failure rates looks like after just a few launches.
Exhibit 12.1 shows our estimate of the baseline failure rates as a probability density function (pdf). The area under each of these curves has to add up to 1, and the higher probabilities are where the curve is highest. Even though our distribution started out uniform (the flat dashed line), where every baseline failure rate was equally likely, even the first launch told us something. After the first launch, the pdf shifted to the left.
Here's why. If the failure rate were really 99 percent, it would have been unlikely, but not impossible, for the first launch to be a success. If the rate were 95 percent, it would have still been unlikely to have a successful launch on the first try, but a little more likely than if it were 99 percent. At the other end of the range, an actual failure rate of 2 percent would have made a success on the first launch very likely.
We know the chance of launch failure given a particular failure rate. We just applied a Bayesian inversion to get the failure rate given the actual launches and failures. It involves simply dividing up the ranges into increments and computing a Bayesian inversion for each small increment in the range. Then we can see how the distribution changes for each observed launch.
Even after just a few launches, we can compute the chance of seeing that result given a particular failure probability. There are two ways to compute this. One is to compute the probability of seeing a given number of failures given an assumed failure rate and then use a Bayesian inversion to compute the probability of that failure rate given the observed failures.
We compute the first part of this with a calculation in statistics called the binomial distribution. In Excel the binomial distribution is written simply as =binomdist(S,T,P,C), where S = number of successes, T = number of trials (launches), P = probability of a success, and C is an indicator telling Excel whether you want it to tell you the cumulative probability (the chance of every number of successes up to that one) or just the individual probability of that result (we set it to the latter). After five launches we had one failure. If the baseline probability were 50 percent, I would find that one failure after five launches would have a 15.6 percent chance of occurrence. If the baseline failure were 70 percent, this result would have only a 3 percent chance of occurrence. As with the first launch, the Bayesian inversion is applied here to each possible percentile increment in our original range.
Because this starts out with the assumption of maximum possible uncertainty (a failure rate anywhere between 0 and 100 percent), this is called a robust Bayesian method. So, when we have a lot of uncertainty, it doesn't really take many data points to improve it—sometimes just one.
There is an even simpler way to do this using a method we briefly mentioned already: the beta distribution. In the previous chapter we showed how using just the mean of the beta distribution can estimate the chance that the next marble drawn from an urn would be red given the previously observed draws. Recall from chapter 11 that this is what Laplace called the rule of succession. It works even if there are very few draws or if you have many draws but no observed red marbles yet. This approach could serve as a baseline for a widely variety of events where data is scarce.
Now, suppose that instead of estimating the chance that the next draw is a red marble, we want to estimate a population proportion of red marbles given a few draws. The beta distribution uses two parameters with rather opaque names: alpha and beta. But you could think of them as hits and misses. We can also use alpha or beta to represent our prior uncertainty. For example, if we set alpha and beta to 1 and 1, respectively, the beta distribution would have a shape equal to the uniform distribution from 0 percent to 100 percent, where all values in between are equally likely. Again, this is the robust prior or uninformative prior in which we pretend we know absolutely nothing about the true population proportion. That is, all we know is that a population proportion has to be somewhere between 0 percent and 100 percent but other than that we have no idea what it could be. Then if we start sampling from some population, we add hits to alpha and misses to beta.
Suppose we are sampling marbles from an urn and we think of a red marble as a hit and green marbles as a miss. Also, let's say we start with the robust prior for the proportion of marbles that are red—it could be anything between 0 percent and 100 percent with equal probability. As we sample, each red marble adds one to our alpha and each green adds one to our beta. If we wanted to show a 90 percent confidence interval of a population proportion using a beta distribution, we would write the following in Excel:
If we sampled just four marbles and only one of those were red, the 90 percent confidence interval for the proportion of red marbles in the urn is 7.6 percent to 65.7 percent. If after sampling twelve marbles only two were red, our 90 percent confidence interval narrows to just 6 percent to 41 percent.
It can be shown that the beta distribution will produce the same answer as the method using the binomial with the Bayesian inversion. You can see this example in chapter 12 spreadsheet on the website.
For the truly disastrous events, we might need to find a way to use even more data than the disasters themselves. We might want to use near misses, similar to what Robin Dillon-Merrill researched (see chapter 8). Near misses tend to be much more plentiful than actual disasters and, because at least some disasters are caused by the same events that caused the near miss, we can learn something from them. Near misses could be defined in a number of ways, but I will use the term very broadly. A near miss is an event in which the conditional probability of a disaster given the near miss is higher than the conditional probability of a disaster without a near miss. In other words, P(disaster|with near miss) > P(disaster|without near miss).
For example, the failure of a safety inspection of an airplane could have bearing on the odds that a disaster would happen if corrective actions were not taken. Other indicator events could be an alarm that sounds at a nuclear power plant, a middle-aged person experiencing chest pains, a driver getting reckless-driving tickets, or component failures during the launch of the Space Shuttle (e.g., O-rings burning through or foam striking the Orbiter during launch). Each of these could be a necessary, but usually not sufficient, factor in the occurrence of a catastrophe of some kind. An increase in the occurrence of near misses would indicate an increased risk of catastrophe, even though there are insufficient samples of catastrophes to detect an increase based on the number of catastrophes alone.
As in the case of the overall failure of a system, each observation of a near miss or lack thereof tells us something about the rate of near misses. Also, each time a disaster does or does not occur when near misses do or do not occur tells us something about the conditional probability of the disaster given the near miss. To analyze this, I applied a fairly simple robust Bayesian inversion to both the failure rate of the system and the probability of the near miss. When I applied this to the Space Shuttle, I confirmed Dillon-Merrill's findings that it was not rational for NASA managers to perceive a near miss to be evidence of resilience. Remember, when they saw a near miss followed by the safe return of the crew, they concluded that the system must be more robust than they thought.
Exhibit 12.2 shows the probability of a failure on each launch given that every previous launch was a success but a near miss occurred on every launch (as was the case with the foam falling off the external tank). I started with the prior knowledge stated by some engineers that they should expect one launch failure in fifty launches. (This is more pessimistic than what Feynman found in his interviews.) I took this 2 percent failure rate as merely the expected value in a range of possible baseline failure rates. I also started out with maximum initial uncertainty about the rate of the near misses. Finally, I limited the disaster result to those situations when the event that allowed the near miss to occur was a necessary precondition for the disaster. In other words, the examined near miss was the only cause of catastrophe. These assumptions are actually more forgiving to NASA management.
The chart shows the probability of a failure on the first launch (starting out at 2 percent) and then updates it on each observed near miss with an otherwise successful flight. As I stated previously, I started with the realization that the real failure rate could be much higher or lower. For the first ten flights, the adjusted probability of a failure increases, even under these extremely forgiving assumptions. Although the flight was a success, the frequency of the observed near misses makes low-baseline near-miss rates less and less likely (i.e., it would be unlikely to see so many near misses if the chance of a near miss per launch were only 3 percent, for example). In fact, it takes about fifty flights for the number of observed successful flights to bring the adjusted probability of failure back to what it was.
But, again, this is too forgiving. The fact is that until they observed these near misses they acted as if they had no idea they could occur. Their assumed rate of these near misses was effectively 0 percent—and that was conclusively disproved on the first observation.
The use of Bayesian methods is not reserved for special situations. I consider them a tool of first resort for most measurement problems. The fact is that almost all real-world measurement problems are Bayesian. That is, you knew something about the quantity before (a calibrated estimate, if nothing else), and new information updates that prior knowledge. It is used in clinical testing for life-saving drugs for the same reason it applies to most catastrophic risks—it is a way to get the most uncertainty reduction out of just a few observations.
The development of methods for dealing with near misses will greatly expand the data we have available for evaluating various disasters. For many disasters, Bayesian analysis of near misses is going to be the only realistic source of measurement.
In chapter 11 we discussed how decomposition and a little bit of Monte Carlo simulation helps estimates. Bear in mind that we could always add more decomposition detail to a model. Models will never be a perfectly accurate representation of what is being modeled. If they were, we wouldn't call them models; we would call them reality. So the objective of improving a model is not infinite detail but only to find marginal improvements that make the model incrementally more realistic and less wrong. As long as the value of being less wrong (the information value) exceeds the cost of the additional complexity, then it is justified. Otherwise, it is not.
In the one-for-one substitution model, we can think of decompositions in three dimensions: vertical, horizontal, and Z. This classification of decompositions is convenient given the structure of a spreadsheet, and they naturally allow for different levels of effort and complexity.
This is the simplest decomposition. It involves only replacing a single row with more rows. For example, if we have a risk that a data breach is 90 percent likely per year and that the impact is $1,000 to $100 million, we may be putting too much under the umbrella of a single row in the model. Such a wide range probably includes frequent but very small events versus infrequent but serious events. In this case, we may want to represent this risk with more than a single row. Perhaps we differentiate the events by those that involve a legal penalty in one row and those that do not in another. We might say that data breaches resulting in a legal penalty have a likelihood of 5 percent per year with a range of $2 million to $100 million. Smaller events happen almost every year but their impact is, say, $1,000 to $500,000. All we did for this decomposition was add new data. No new calculations are required.
If you were starting with the one-for-one substitution model, a horizontal decomposition involves adding new columns and new calculations. For example, you might compute the impact as a function of several other factors. One type of impact might be related to disruption of operations. For that impact, you might decompose the calculation as follows:
Furthermore, cost per hour could be further decomposed as revenue per hour, profit margin, and unrecoverable revenue. In a similar way, you could decompose the cost of a data breach as being a function of the number of records compromised and a cost per record including legal liabilities.
When adding new variables such as these, you will need to add the following:
Some decompositions will involve turning one row in the original risk register into an entirely separate model, perhaps another tab on the same spreadsheet. The estimates usually directly input into the risk register are now pulled from another model. For example, we may create a detailed model of a data breach due to third-party/vendor errors. This may include a number of inputs and calculations that would be unique to that set of risks. The new worksheet could list all vendors in a table. The table could include columns with information for each vendor, such as what type of services they provide, how long they have been an approved vendor, what systems and data they have access to, their internal staff vetting procedures and so on. That information could then be used to compute a risk of the breach event for each vendor individually. The output of this model is supplied to the original risk register. The simulated losses that would usually be computed on the risk register page are just the total losses from this new model.
The Z decomposition is essentially a new model for a very specific type of risk. This is a more advanced solution, but my staff and I frequently find the need to do just this.
In the simple one-for-one substitution model, we only worried about two kinds of distributions: binary and lognormal. What we are calling the binary distribution (what mathematicians and statisticians may also call the Boolean or Bernoulli distribution) only produces a 0 or 1. Recall from chapter 4 that this is to determine whether or not the event occurred. If the event occurred then we use a lognormal distribution to determine the size of the impact, expressed as a monetary loss. We use the lognormal because it naturally fits fairly well to various losses. Similar to a lognormal, losses can't be negative or zero (if the loss was zero we say the loss event didn't happen). It also has a long upper tail that can go far beyond the upper bound of a 90 percent confidence interval.
But we may have information that indicates that the lognormal isn't the best fit. When that is the case, then we might consider adding one of the following. Note that the random generators for these distributions are on the book website. (See exhibit 12.3 for illustrations of these distribution types.)
In addition to further decomposition and the use of better-fitting distributions, there are a few slightly more advanced concepts you may want to eventually employ. For example, if some events make other events more likely or exacerbate their impact, you may want to model that explicitly. One way this is done is to build correlations between events. This can be done using a correlation coefficient between two variables. For example, your sales are probably correlated to your stock prices. An example at www.howtomeasureanything.com/riskmanagement shows how this can be done.
But unless you have lots of historical data, you probably will have a hard time estimating a correlation. Correlations are not entirely intuitive metrics that calibrated estimators can approximate. You may be better off building these relationships explicitly as we first described in chapter 10. For example, in a major data breach, the legal liabilities and the cost of fraud detection for customers will be correlated, but that is because they are both partly functions of the number of records breached. If you decompose those costs, and they each use the same estimate for number of records breached, they will be correlated.
Furthermore, some events may actually cause other events, leading to cascade failures. These, too, can be explicitly modeled in the logic of a simulation. We can simply make it so that the probability of event A increases if event B occurred. Wildfires caused by downed power lines in California in 2018, for example, involved simultaneous problems exacerbated the impact of having to fight fires as well as dealing with a large population without power. Including correlations and dependencies between events will cause more extreme losses to be much more likely.
In chapter 11 we discussed how calibration training can improve the subjective estimates of subject matter experts. Most experts will start out extremely overconfident, but with training they can learn to overcome that. Observations will confirm that the experts will be right 90 percent of the time they say they are 90 percent confident.
There is more we can add to the expert's estimates. First, experts are not just overconfident, but highly inconsistent. Second, we may need to consider how to aggregate experts because not all experts are equally good.
In chapter 7, we described the work of a researcher in the 1950s, Egon Brunswik, who showed that experts are highly inconsistent in their estimates. For example, if they are estimating the chance of project failures based on descriptive data for each project (duration, project type, project manager experience, etc.) they will give different estimates for duplicates of the same project. If the duplicates are shown far apart enough in the list—say one is the eleventh project and the other is the ninety-second, experts who forget they already answered will usually give at least a slightly different answer the second time. We also explained that Brunswik showed how we can remove this inconsistency by building a model that predicts what the expert's judgment will be and that, by using this model, the estimates improve.
A simple way to reduce inconsistency is to average together multiple experts. JDM researchers Bob Clemen and Robert Winkler found that a simple average of forecasts can be better calibrated than any individual.1 This reduces the effect of a single inconsistent individual, but it doesn't eliminate inconsistency for the group, especially if there are only a few experts.
Another solution is to build a model to try to predict the subjective estimates of the experts. It may seem nonintuitive, but Brunswik showed how such a model of expert judgment is better than the expert from whom the data came, simply by making the estimates more consistent. You might recall from chapter 7 that Brunswik referred to this as his lens method.
Similar to most other methods discussed in this chapter, we will treat the lens method as an aspirational goal and skim over the details. But if you feel comfortable with statistical regressions, you can apply the lens method just by following these steps. We've modified it somewhat from Brunswik's original approach to account for some other methods we've learned about since Brunswik first developed this approach (e.g., calibration of probabilities).
In chapter 4 and in the lens model, we assume we are simply averaging the estimates of multiple experts. This approach alone (even without the lens method) usually does reduce some inconsistency. There are benefits to this approach but there is a better way.
A mathematician focusing on decision analysis who has researched this in detail is Roger Cooke.2 He conducted multiple studies, replicated by others, that showed that performance-weighted expert estimates were better than equally weighted expert estimates.3 and applied these methods to several types of assessments including risk assessments of critical infrastructures.4
Cooke and his colleagues would develop topic-specific calibration questions. Unlike the generic trivia calibration question examples we showed previously, this enabled Cooke to simultaneously measure not only the calibration for confidence of the expert but also to assess the general knowledge of the expert in that field. Consider, for example, two equally well calibrated experts but one provides ranges half as wide as the other. Because both experts were calibrated, both experts provided ranges that contained the answer 90 percent of the time. But because the latter expert provides narrower 90 percent confidence interval, she is assessed to have more knowledge than the other expert and, therefore, less uncertainty.
Cooke uses both calibration performance and the assessment of knowledge to compute a weight for each expert. My staff and I have the opportunity to use the performance-weighted methods with clients. What I find most interesting is how most experts end up with a performance weight of at or near zero with just a few—or maybe one—getting the majority weighting for the group.
I find this astonishing. Can it really be true that we can ignore half or more of experts' opinions in fields as diverse as operational risks, engineering estimates, terrorism risks, and more? Apparently so. This is consistent with the research of Philip Tetlock (see chapter 9). Even though Tetlock uses different scoring methods he still finds that a few “superforecasters” outperform the others by large margins.5 You might improve risk models a lot by finding the superforecasters in your team.
There are plenty of good Monte Carlo tools to pick from (see exhibit 12.4). The total number of users for all of the various Monte Carlo software tools is in the tens of thousands and chances are that somebody is already applying one of these tools to problems very similar to yours.
You probably already have a license for Excel and that will be good enough for starting out. But as you start developing more elaborate decompositions and using more advanced methods, you may find some other tools more suitable. Some of the tools listed in exhibit 12.4 are actually add-ins for Excel that add a lot more functionality. Others are entirely separate software packages, and some may be more suited to enterprise solutions. There are players in this field coming and going on a regular basis so treat this as a tentative list for starting your research.
EXHIBIT 12.4 Some Available Monte Carlo Tools
Tool | Made by | Description |
Excel | Microsoft (Free supporting templates by HDR) | Sufficient for the vast majority of users; HDR has developed free downloadable templates that use the HDR pseudo random number generator (PRNG) www.howtomeasureanything.com/riskmanagement |
SIPMath | Sam Savage, Probabilitymanagement.org | An intuitive and free set of Excel add-ins from the Savage's nonprofit that promotes standards in simulation tools; these tools use the Hubbard PRNG |
RiskLens | RiskLens | Developed primarily for cybersecurity with a decomposition already developed for that use |
Risk Solver Engine | Frontline Systems Incline Village, NV | Unique Excel-based development platform to perform interactive Monte Carlo simulation at unprecedented speed; support SIP and SLURP formats for probability management |
Analytica | Lumina Decision Systems, Los Gatos, CA | Uses an extremely intuitive graphical interface that enables complex systems to be modeled as a kind of flowchart of interactions; has a significant presence in government and environmental policy analysis |
R and RStudio | Open source, supported by R Foundation for Statistical Computing | Not just Monte Carlo; R is a popular open source stats programming language with a large user group; most users of R will use RStudio as a more intuitive interface; requires a bit more technical knowledge |
Crystal Ball | Oracle (Previously Decisioneering, Inc.) Denver, CO | Excel-based; a wide variety of distributions; a fairly sophisticated tool; broad user base and technical support |
@Risk | Palisade Corporation, Ithaca, NY | Excel-based tool; main competitor to Crystal Ball; many users and much technical support |
SAS | SAS Corporation, Raleigh, NC | Goes well beyond the Monte Carlo; extremely sophisticated package used by many professional statisticians |
SPSS | SPSS Inc., Chicago, IL | Goes far beyond the Monte Carlo; tends to be more popular among academics |
Mathematica | Wolfram Research, Champaign, IL | Extremely powerful tool that does much more than Monte Carlo; used primarily by scientists and mathematicians, but has applications in many fields |
Pelican | Vose Software, Ghent, Belgium | A Monte Carlo-based Enterprise Risk Management and risk governance tool |
Too often, a model of reality takes on a reality of its own. The users of the model are likely eventually to adopt it as the truth. The philosophers Plato (an idealist) and Benedict de Spinoza (a rationalist) were similar in that respect. That is, they believed that all knowledge had to come from their models (reason) alone—that everything we know could have been deduced without observation. In a way, they believed, the need to resort to observation merely shows the weaknesses and flaws in our ability to reason.
The philosopher David Hume, by contrast, was an empiricist. Empiricists doubt even the most rigorous, rational models and prefer to get their hands dirty with observation and real-world facts. They insist that reason alone would be insufficient for the basis of our knowledge even if our ability to reason were perfect. In fact, empiricists say that much of what we call reason could not have been known without observation first. But the best combination appears to be skill in theory and in observation. The Nobel Prize–winning physicists Enrico Fermi and Richard Feynman were competent theorists as well as consummate observers. They had a knack for checking just about any claim with observations much simpler than one might expect were necessary.
However, I would say that most modelers, whether they really thought about this or not, are closer to Plato and Spinoza than Hume. They believe that because they are using a sophisticated sounding method such as Monte Carlo that the model should work. They and their clients may feel more confident in their decisions having gone through this process. But because we know about the analysis placebo effect (described in previous chapters) we also know that their confidence in the method is an unreliable indicator of effectiveness.
Constantly testing your model by seeing how well it matches history is absolutely essential. This requires tracking and/or backtesting of models. Tracking is simply recording forecasts and comparing them with eventual outcomes. Look at a set of project cost estimates produced by a model. How well did they match reality? Backtesting is similar to tracking except that with backtesting the real-world events used to test the model may have predated when the model was built. In other words, you test the model against data you already had. For the same model which estimates project costs, you don't have to wait until current projects are done. You can look at past models. These tests are necessary to have any confidence in modeling at all. However, even though this testing is not difficult to implement, the survey of models in chapter 10 showed it is rarely done. There is simply very little incentive for analysts or management to go back and check models against reality.
We know from the research mentioned in the previous chapters that there is evidence that statistical models, decomposition, and the use of Monte Carlos improve estimates. But we know this not because the users merely felt better about their estimates. We know it because estimates were compared to actual outcomes in a large number of trials.
In that spirit, I try to routinely go back and check forecasts against reality. Of all the variables in all the models I've forecasted, I can easily get real data to compare to forecasts for only a small percentage, but it still adds up to more than two hundred variables in which we had initially estimated outcomes and confirmed outcomes to compare them to.
We can validate forecasts of binary probabilities in the same way we validated the results of true/false calibration tests in the previous chapter. That is, of all the times we said some event was about 10 percent likely, it should have occurred about 10 percent of the time. To validate ranges, we can apply a test that is just a bit more detailed than the test we used to validate 90 percent confidence intervals. To get a little bit more data out of the actual observations, I separated all the original range forecasts into multiple buckets:
The buckets are arbitrary and I could have defined a very different set. But, as with the true/false tests, things should be about as frequent as we estimate. In this case, the actual data should fall in these sections of a range about as often as would be implied by the forecasts. And ranges don't have to be all the same type. Some separation of buckets can always be done for any distribution of a range, no matter what its shape. No matter how I define a distribution, only 5 percent of the data should be above the 95 percentile of the forecast.
When I go back and look, I find that, within a statistically allowable error, the forecasts of most types of data were distributed as we expected. For those, the forecasting methods were working. However, in 2001, just as I was reaching thirty projects in my database, I was also finding that there were certain variables the calibrated estimators still didn't forecast well, even though they had shown their ability to assess probabilities in the calibration tests. The two areas where I found that even calibrated persons failed to perform well were the estimation of catastrophic project failures and business volumes.
In the case of project cancellations (stopping a project after beginning it), the assessed probabilities were far too low to account for the observed rate of cancellations. Calibrated estimators never gave a chance of cancellation to be above 10 percent and the average estimate is 5 percent. But observation of those same projects after they got started showed that the cancellation rate was six out of thirty, or 20 percent. According to the binomial distribution we discussed previously, it would be extremely unlikely, if there were only a 5 percent chance of cancellation per project, to see six cancellations out of thirty (less than one in one thousand). One of the cancellations was a project that was thought to have no chance of cancellation, which, of course, should be impossible.
I saw a similar inability for people not directly involved with tracking business volumes to forecast them well. For many of the operational investments I was assessing for companies, the return on investment had to be partly a function of sales. For example, if an insurance company was trying to increase the efficiency of new policy processing, the value had to be, in part, related to how many new policies were sold. At first, we would ask IT managers to estimate this and, if the information value justified it, we would do more research. Again, I found that too many of the actual observations of business volumes were ending up in the extremes of the original forecast.
The good news is that because of this tracking and testing, I found that the historical data for project cancellations and changes in business volumes was more reliable than the calibrated estimators—even if the data came from the industry and not that particular firm. And now we know better than to ask managers to estimate business volumes if tracking them is not in their own area of expertise. Furthermore, for every other kind of forecast I was asking estimators to make (e.g., project duration, productivity improvements, technology adoption rates, etc.), the calibrated estimator did as well as expected—about 90 percent of the observations fell within the originally stated 90 percent CI.
Using historical data to validate models and track forecasts to compare them to actual outcomes is fairly simple. Don't worry that you can't get all the data. Get what you can. Even though I had limited data, I still learned something useful from tracking results. Here are four things to keep in mind (if these items sound a little like what we saw previously, it is because the same principles apply):
The first thing you have to know is yourself. A man who knows himself can step outside himself and watch his own reactions like an observer.
—ADAM SMITH [pseudonym], The Money Game