img

Chapter 5
The Ensemble Effect
Netflix, Crowdsourcing, and Supercharging Prediction

To crowdsource predictive analytics—outsource it to the public at large—a company launches its strategy, data, and research discoveries into the public spotlight. How can this possibly help the company compete? What key innovation in predictive analytics has crowdsourcing helped develop? Must supercharging predictive precision involve overwhelming complexity, or is there an elegant solution? Is there wisdom in nonhuman crowds?

Casual Rocket Scientists

A buddy and I are thinking of building a spaceship next year. The thing is, we have absolutely no training or background. But who cares? I want to go to outer space.

This may sound outlandish, but in the realm of predictive analytics (PA), it is essentially what Martin Chabbert and Martin Piotte did. In 2008, this pair of Montrealers launched a mission to win the $1 million Netflix Prize, the most high-profile analytical competition of its time. Incredibly, with no background in analytics, these casual part-timers became a central part of the story.

The movie rental company Netflix launched this competition to improve the movie recommendations it provides to customers. The company challenged the world by requiring that the winner improve upon Netflix's own established recommendation capabilities by 10 percent. Netflix is a prime example of PA in action, as a reported 70 percent of Netflix movie choices arise from its online recommendations. Product recommendations are increasingly important for the retail industry in general. More than a sales ploy, these tailored recommendations provide relevancy and personalization that customers actively seek.

  1. What's predicted: What rating a customer would give to a movie.
  2. What's done about it: Customers are recommended movies that they are predicted to rate highly.

PA contests such as the Netflix Prize leverage competitive spirit to garner scientific advancement. Like a horse race, a competition levels the playing field and unambiguously singles out the best entrant. With few limitations, almost anyone in the world—old or young, tall or short—can participate by downloading the data, forming a predictive model, and submitting.

It's winner take all. To ensure submissions are objectively compared, prediction competitions employ a clever trick: The competitor must submit not a predictive model, but its predictive scores, as generated for an evaluation data set within which the correct answers—the target values that the model is meant to infer—are withheld. Netflix Prize models predict how a customer would rate a movie (based on how he or she has rated other movies). The true ratings are suppressed in the publicly posted evaluation data, so submitters can't know exactly which examples they're getting right and which they're getting wrong at the time of submission. All said, to launch the competition, Netflix released to the public over 100 million ratings from some 480,189 customers (anonymized for privacy considerations, with names suppressed).1

The model's ability to predict is all that matters, not the modeler's background, experience, or academic pedigree. Such a contest is a hard-nosed, objective bake-off—whoever can cook up the solution that best handles the predictive task at hand wins kudos and, usually, cash.

Dark Horses

And so it was with our two Montrealers, Martin and Martin, who took the Netflix Prize by storm despite their lack of experience—or perhaps because of it. Neither had a background in statistics or analytics, let alone recommendation systems in particular. By day, the two worked in the telecommunications industry developing software.

But by night, the two-member team plugged away at home for 10 to 20 hours per week apiece, racing ahead in the contest under the team name PragmaticTheory. The “pragmatic” approach proved groundbreaking. The team wavered in and out of the number one slot; during the final months of the competition, the team was often in the top echelons.

There emerges an uncanny parallel to SpaceShipOne, the first privately funded human spaceflight, which won the $10 million Ansari X Prize. According to some, this small team, short on resources with a spend of only $25 million, put the established, gargantuan NASA to shame by doing more for so much less. PA competitions do for data science what the X Prize did for rocket science.

Mindsourced: Wealth in Diversity

[Crowdsourcing is] a perfect meritocracy, where age, gender, race, education, and job history no longer matter; the quality of the work is all that counts.

—Jeff Howe, Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business

When pursuing a grand challenge, from where will key discoveries appear? If we assume for the moment that one cannot know, there's only one place to look: everywhere. Contests tap the greatest resource, the general public. A common way to enact crowdsourcing, an open competition brings together scientists from far and wide to compete for the win and cooperate for the joy. With crowdsourcing, a company outsources to the world.

The $1 million Netflix Prize attracted a white-hot spotlight and built a new appreciation for the influence crowdsourcing holds to rally an international wealth of bright minds. In total, 5,169 teams formed to compete in this contest, submitting 44,014 entries by the end of the event.

PA crowdsourcing reaps the rewards brought by a diverse brainshare. Chris Volinsky, a member of a leading Netflix Prize team named BellKor from AT&T Research, put it to me this way: “From the beginning, I thought it was awesome how many people in the top of the leaderboard were what could be called ‘amateurs.’ In fact, our group had no experience with [product recommendations] when we started, either.…It just goes to show that sometimes it takes a fresh perspective from outside the field to make progress.”

One mysterious, highly competitive team came out of the woodwork, calling itself “Just a guy in a garage.” The team was anonymous but rose at one point to sixth place on the competition's leaderboard. Later, going public, it turned out to be a one-member team, a former management consultant who went to college for psychology and graduate school for operations research (he lists himself as unemployed and has revealed he was working out of a second bedroom in his house rather than an actual garage).

So too did our pair of dark horse laymen, team PragmaticTheory, circumvent established practices, effortlessly thinking outside a box they knew nothing of in the first place. Unbounded, they could boldly go...in new directions that no one had gone before. As Martin Chabbert told me in an interview, they “figured that a more pragmatic and less dogmatic approach might yield some good results.” Ironically, their competitive edge appeared to hinge less on scientific innovation and more on their actual expertise: adept software engineering. Martin provided this striking lesson:

Many people came up with (often good) ideas…but translating those words into a mathematical formula is the complicated part.…Our background in engineering and software was key. In this contest, there was a fine line between a bad idea and a bug in the code. Often you would think that the model was simply bad because it didn't yield the expected results, but in fact the problem was a bug in the code. Having the ability to write code with few bugs and the skill to actually find the bugs before giving up on the model is something that definitely helped a lot.…Compared to what most people think, this was more of an engineering contest than a mathematical contest.

Cross-discipline competitors thrive, as revealed by many PA contests beyond the Netflix Prize. One competition concerned with educational applications witnessed triumph by a particle physicist (Great Britain), a data analyst for the National Weather Service (Washington, D.C.), and a graduate student (Germany); $100,000 in prize money sponsored by the Hewlett Foundation (established by a founder of Hewlett-Packard) went to these winners, who developed the best means to automatically grade student-written essays. Their resulting system grades essays as accurately as human graders, although none of these three winners had backgrounds in education or text analytics.

And guess what kind of expert excelled at predicting the distribution of dark matter in the universe? Competing in a contest sponsored by NASA and the Royal Astronomical Society, Martin O'Leary, a British PhD student in glaciology, generated a method the White House announced has “outperformed the state-of-the-art algorithms most commonly used in astronomy.” For this contest, O'Leary provided the first major breakthrough (although he was not the eventual winner). As he explains it, aspects of his work mapping the edges of glaciers from satellite photos could extend to mapping galaxies as well.

Crowdsourcing Gone Wild

Given the right set of conditions, the crowd will almost always outperform any number of employees.

—Jeff Howe, Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business

The organizations I've worked with have mostly viewed the competition in business as a race that benefits from sharing, rather than a fight where one's gain can come only from another's loss. The openness of crowdsourcing aligns with this philosophy.

—Stein Kretsinger, Founding Executive, Advertising.com

One small groundbreaking firm, Kaggle, has taken charge and leads the production of PA crowdsourcing. Kaggle has launched more than 175 PA competitions, including the essay-grading and dark matter ones mentioned above. Over 50,000 registered competitors are incentivized by prizes that usually come to around $10,000 to $20,000, but climb as high as $500,000. These diverse minds from more than 200 universities and 100 countries, about half of them academics, have submitted over 144,000 attempts for the win.2

An enterprise turns research and development completely on its head in order to leverage PA crowdsourcing. Instead of protecting strategy, plans, data, and research discoveries as carefully guarded secrets, a company must launch them fully into the public spotlight. And instead of maintaining careful control over its research staff, the organization gets whoever cares to take part in the contest and join in on the fun (for fully public contests, as is the norm). Crowdsourcing must be the most ironic, fantastical way for a business to compete.

Crowdsourcing forms a match made in heaven. Kaggle's founder and CEO, Anthony Goldbloom (a Forbes “30 Under 30: Technology” honoree), spells out the love story: “On one hand, you've got companies with piles and piles of data, but not the ability to get as much out of it as they would like. On the other hand, you've got researchers and data scientists, particularly at university, who are pining for access to real-world data” in order to test and refine their methodologies.

With strong analytics experts increasingly tough to find, seeding your talent pool by reaching out to the masses starts to sound like a pretty good idea. A McKinsey report states, “By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.” To leave no analytical stone unturned, innovative organizations turn by necessity to the crowd at large. As Kaggle pitches, “There are countless strategies that can be applied to any predictive modeling task, and it is impossible to know at the outset which technique or analyst will be most effective.”

Until a few years ago, most PA competitions were held by academic institutions or research conferences. Kaggle has changed this. With the claim that it “has never failed to outperform a preexisting accuracy benchmark, and to do so resoundingly,” Kaggle has brought commercial credibility to the practice. For example, across this book's Central Tables of 182 PA examples, 14 come from Kaggle competitions—namely:3

Organization What Is Predicted Central Table to See for More Info
Facebook Friendship 1
dunnhumby Supermarket visits 2
Allstate Bodily harm from car crashes 3
Heritage Health Prize Days spent in the hospital 4
Researchers HIV progression 4
New South Wales, Australia Travel time vis-à-vis traffic 6
University of Melbourne Awarding of grants 7
Hewlett Foundation Student grades 7
Grockit Student knowledge 7
Imperium Insults 8
Ford Motor Co. Driver inattentiveness 8
Online Privacy Foundation Psychopathy 8
Wikipedia Editor attrition 9
CareerBuilder Job applications 9

Your Adversary Is Your Amigo

Competition paradoxically breeds cooperation. Kaggle's tagline is “making data science a sport.” But these lab coat competitors don't seem to exhibit the same fierce, cutthroat voraciousness as sweaty athletes out on the field. Despite the cash incentive to come out on top, participants are often driven by the love of science. They display a spirited tendency to collaborate and share. It's the best of coopetition. Netflix Prize leader Martin Chabbert told me the prize's public forum “was also a place where people proposed new ideas; these ideas often inspired us to come up with our own creative innovations.” And Wired magazine wrote, “The prize hunters, even the leaders, are startlingly open about the methods they're using, acting more like academics huddled over a knotty problem than entrepreneurs jostling for a $1 million payday.” When John Elder took part in the competition, he took pause. “It was astonishing how many people were openly sharing and cooperating,” John says. “It comes of what people do out of camaraderie.”

And so a community emerges around each contest, catalyzing a petri dish of great ideas. But John Elder recognizes that disclosure can cost a competitive edge. John and some staff at Elder Research were part of a Netflix Prize team during earlier phases of the contest when much of the major headway was still being made. At one point the team held third place, having employed a key analytical method before any other competitor. The method, you will soon see, was a key ingredient both to winning the Netflix Prize and to building IBM's Watson, the Jeopardy! player. In a collegial spirit, John's team went as far as displaying this choice of method as their very name, thereby revealing their secret weapon. The team was called “Ensemble Experts.”

United Nations

As competitors rounded the final bend of the horse race known as the Netflix Prize (which launched before Kaggle all but took over the field), a handful of key leaders held a tense dead heat. Ironically, the race moved along at the speed of a snail, as if watching a sporting event's slow-motion instant replay on TV. Because there were diminishing returns on the teams' efforts and their predictive models' complexity, the closer they got to their objective—a 10 percent improvement over Netflix's established method that would qualify for the $1 million win—the more slowly they progressed.

Despite the glacial pace, it was gripping. The leaders faced dramatic upsets every week as they leapfrogged one another on the contest's public leaderboard. The teams jockeyed for position by way of minuscule improvements.

While nobody, including Netflix, knew if the 10 percent mark was even possible, there was the constant sense that, at any moment, a team could find a breakthrough and catapult into the win zone.

A promising breakthrough popped in September 2008, temporarily leaving our heroic lay competitors in the dust. Two other teams, BellKor (from AT&T Research) and BigChaos (a strikingly young-looking team from a small analytics start-up in Austria), formed an alliance. They joined forces and blended predictive models to form an über-team. With all the communal cooperation already taking place informally, it was time to make it official.

It was risky to team up. By sharing technology, the teams lost their mutual competitive edge against one another. If they won, they'd need to split the winnings. But if they didn't team up quickly enough, other teams could try the same tactic for the win.

It worked. The teams' predictive models were quite different from one another and, as hoped, the strengths of one model compensated for the weaknesses of the other. By integrating the models, they achieved a performance that enjoyed the best of both models. Only by doing so did the new über-team, BellKor in BigChaos, spring ahead far enough to qualify for—and win—the contest's annual progress prize of $50,000.

Meta-Learning

Here's where the power to advance PA begins. Combining two or more sophisticated predictive models is simple: Just apply predictive modeling to learn how to combine them. Since each model comes about from machine learning, this is an act of “learning on top of learning”—meta-learning.

Figure illustrating an ensemble of two predictive models where two small eggs labeled as base model 1 and base model 2 point arrows toward a bigger egg labeled as ensemble model. Meta-learning is mentioned below the bigger egg.

An ensemble of two predictive models.

Therefore, competitors turned collaborators with two distinct, intricate models that have been developed in very different ways don't necessarily need to work that hard to combine them. Instead of digging in and thinking intensely to compare and contrast their theories and techniques, BigChaos team member Andreas Töscher told me, they let predictive modeling do the blending. They trained a new model that sits above the two existing models like a manager. This new ensemble model then considers both models' predictions on a case-by-case basis. For certain cases, it can give more credence to model A rather than B, or the other way around. By so doing, the ensemble model is trained to predict which cases are weak points for each component model. There may be many cases where the two models are in agreement, but where there is disagreement, teaming the models together provides the opportunity to improve performance.

For the Netflix Prize, the dynamics of the gameplay had now changed, triggering a new flurry of merging and blending as teams consolidated, rolling up into bigger and better competitors. It was like the mergers and acquisitions that take place among companies in a nascent, quickly developing industry.4

This merging and blending outplayed the ingenuity of our heroic lay team PragmaticTheory (the two Montrealers named Martin). But the team's success had gained the attention of its adversaries, and an invitation was extended by über-team BellKor in BigChaos to join and form an über-über-team. And so BellKor's Pragmatic Chaos came to be:

Diagrammatic representation of an example of meta-learning where BellKor and BigChaos point arrows toward BellKor in BigChaos (meta-learning). BellKor in BigChaos and pragmatic theory point arrow toward BellKor's pragmatic Chaos (meta-learning).

BellKor's Pragmatic Chaos made the grade. On June 26, 2009, it broke through the 10 percent barrier that qualified the super team for the $1 million Netflix Prize.

A Big Fish at the Big Finish

But it wasn't over yet. Per contest rules, this accomplishment triggered a 30-day countdown, during which all teams could continue to submit entries.

An archnemesis had emerged, called none other than The Ensemble (not to be confused with the team that included John Elder, Ensemble Experts, which employed ensemble methods internally but did not involve combining separate teams). This rival gave BellKor's Pragmatic Chaos a serious run for the money by rolling together teams like mad. By the end, it was an amalgam of over 20 teams, one of which openly absorbed any and all teams that wished to join. By uploading its predictions, a joining team would be rewarded in proportion to the resulting improvement, the bump it contributed to the growing ensemble—but only if the overarching team won, of course. It was like the Borg from Star Trek, an abominable hive-like force that sucks up entire civilizations after declaring menacingly, “You will be assimilated!” A number of teams allowed themselves to be swallowed by this fish that ate the fish that ate the fish. After all, if you can't beat 'em, join 'em.

Although it combined the efforts of only three teams, BellKor's Pragmatic Chaos rallied to compete against this growing force. The 30 days counted down. Neck and neck, the two über-teams madly submitted new entries, tweaking, retweaking, and submitting again, even into the final hours and minutes of this multiple-year contest. Crowdsourcing competitions cultivate a heated push for scientific innovation, engendering focus and drive sometimes compared to that attained during wartime.

Time ran out. The countdown was over and the dust was settling. The contest administrators at Netflix went silent for a few weeks as they assessed and verified. They held yet another undisclosed set of data with which to validate the submissions and determine the final verdict. Here is the top portion of the final leaderboard:

Rank Team Name Best Test Score Percentage Improvement Best Submit Time
1 BellKor's Pragmatic Chaos 0.8567 10.06 2009–07–26 18:18:28
2 The Ensemble 0.8567 10.06 2009–07–26 18:38:22
3 Grand Prize Team 0.8582 9.90 2009–07–10 21:24:40
4 Opera Solutions and Vandelay United 0.8588 9.84 2009–07–10 01:12:31
5 Vandelay Industries! 0.8591 9.81 2009–07–10 00:32:20
6 PragmaticTheory 0.8594 9.77 2009–06–24 12:06:56
7 BellKor in BigChaos 0.8601 9.70 2009–05–13 08:14:09

BellKor's Pragmatic Chaos won by a nose. Its performance was so close to The Ensemble's that it was considered a quantitative tie in accord with the contest's posted rules. Because of this, the determining factor was which of the tied entries had been submitted first. At the very end of a multiple-year competition, BellKor's Pragmatic Chaos had uploaded its winning entry just 20 minutes before The Ensemble. The winning team received the cash and the other team received nothing. Netflix CEO Reed Hastings reflected, “That 20 minutes was worth a million dollars.”

Collective Intelligence

With most things, the average is mediocrity. With decision making, it's often excellence.

—James Surowiecki, The Wisdom of Crowds

Even competitions much simpler than a data mining contest can tap the wisdom held by a crowd. The magic of collective intelligence was lightheartedly demonstrated in 2012 at the Predictive Analytics World (PAW) conference. Charged with drawing attention to his analytics company on the event's exposition floor, Gary Panchoo held a money-guessing contest. Here he is, collecting best guesses as to how many dollar bills are in the container:

A photograph depicting a jar filled with many dollar bills is placed on the right and on top of it is written “Guess how much and keep the cash.” To the left are some booklets and  behind the jar is a bald man standing wearing an ID card.

The guessers as a group outsmarted every individual guess. The winner was only $10 off the actual amount, $362. But the average of the 61 guesses, $365, was off by just $3.

With no coordinated effort among the guessers, how could this be a common phenomenon? One way to look at it is that all people's overestimations and underestimations even out. If we assume that people guess too high as much as they do too low, averaging cancels out these errors in judgment. No one person can overcome his or her own limited capacity—unless you're a superhero, you can't look at the container of dollars and be superconfident about your estimation. But across a group, the mistakes come out in the wash.

Uniting endows power. By coming together as a group, our limited capacities as individuals are overcome. Moreover, we no longer need to take on the challenging task of identifying the best person for the job. It doesn't matter which person is smartest. A diverse mix best does the trick.

The collective intelligence of a crowd emerges on many occasions, as explored thoroughly by James Surowiecki in his book The Wisdom of Crowds. Examples include:

  • prediction markets, wherein a group of people together estimate the prospects for a horse race, political event, or economic occurrence by way of placing bets (unfortunately, this adept forecasting method cannot usually scale to the domain of PA, in which thousands or millions of predictions are generated by a predictive model);
  • the audience of the TV quiz show Who Wants to Be a Millionaire?, whom contestants may poll to weigh in on questions; and
  • Google's PageRank method, by which a Web page's value and importance are informed by how many links people have created to point to the page.

Human minds aren't the only things that can be effectively merged together. It turns out the aggregate effect emerging from a group extends also to nonhuman crowds—of predictive models.

The Wisdom of Crowds…of Models

The “wisdom of crowds” concept motivates ensembles because it illustrates a key principle of ensembling: Predictions can be improved by averaging the predictions of many.

—Dean Abbott, Abbott Analytics

Like a crowd of people, an ensemble of predictive models benefits from the same “collective intelligence” effect.5 Each model has its strengths and weaknesses. As with guesses made by people, the predictive scores produced by models are imperfect. Some will be too high and some too low. Averaging scores from a mix of models can wipe away much of the error. Why hire the best single employee when you can afford to hire a team whose members compensate for one another's weaknesses? After all, models work for free; a computer uses practically no additional electricity to apply 100 models rather than just one.

Ensemble modeling has taken the PA industry by storm. It's often considered the most important predictive modeling advancement to come to fruition in the first decade of this century. While its success in crowdsourcing competitions has helped bolster its credibility, the craft of ensembling pervades beyond that arena, both in commercial application and in research advancement.

But increasing complexity is paradoxical to improved learning. An ensemble of models—which can grow to include thousands—is much more involved than a single model, so it's a move away from the “keep it simple, stupid” (KISS) principle (aka Occam's razor) that's so critical to avoiding overlearning, as discussed in Chapter 4. Before ironing out this irony, let's take a closer look at how ensemble models work.

A Bag of Models

Leo Breiman, one of the Fab Four inventors of CART decision trees (detailed in Chapter 4), developed a leading method for ensemble models called bagging (short for bootstrap aggregating). The way it operates is practically self-evident. Make a bunch of models, a bagful. To predict, have each model make its prediction, and tally up the results. Each model gets to vote (voting is similar to averaging and in some cases is equivalent). The models are endowed with a key characteristic: diversity. Diversity is ensured by building each model on a different subset of the data, in which some examples are randomly duplicated so that they have a stronger influence on the model's learning process, and others are left out completely. Reflecting this random element, one variation on bagging that assembles a number of CART decision trees is dubbed random forests. (Doesn't this make a single tree seem “à la CART”?)

Diagrammatic representation of a group of models forming an ensemble where many small golden eggs labeled as model point out arrow toward a bigger golden egg labeled as ensemble model.

A group of models comes together to form an ensemble.

The idea of collecting models and having them vote is as simple and elegant as it sounds. In fact, other ensemble methods, all variations on the same theme, also sport friendly, self-descriptive names, including bucket of models, bundling, committee of experts, meta-learning, stacked generalization, and TreeNet (some employ voting and others meta-learn as for the Netflix Prize).

The notion of assembling components into a more complex, powerful structure is the very essence of engineering, whether constructing buildings and bridges or programming the operating system that runs your iPhone. Nobody must conceive of the entire massive structure at once—indeed, nobody can anyway. Tiered assembly makes architecting manageable.

An ensemble usually kicks a single model's butt. Check out this attempt by a single decision tree to model a circle:

Graphical representation of CART decision tree plotted between y and x axis ranging from -1 to 1. A dashed circle is plotted with tree's decision boundary consisting of only horizontal and vertical lines.

This and the following figure are reproduced with permission.6

In this experiment, a CART decision tree was trained over a data set that was manufactured to include positive and negative examples, inside and outside the circle, respectively. Because a decision tree can only compare the predictor variables (in this case, the x and y coordinates) to a fixed value and cannot perform any math on them, the tree's decision boundary consists only of horizontal and vertical lines. No diagonal or curvy boundaries are allowed. The resulting model does correctly label most points as to whether they're inside or outside the circle, but it's clearly a rough, primitive approximation.

Bagging a set of 100 CART trees generates a smoother, more refined model:7

Graphical representation of CART decision tree between y and x axis ranging from -1 to 1 where bagging a set of 100 CART trees generates a smoother and more refined model.

Reproduced with permission.

Ensemble Models in Action

Teams often use an ensemble model to win Kaggle contests.

—Anthony Goldbloom, founder and CEO of Kaggle

Whether assembled by the thousands or pasted together manually (as in the case when Netflix Prize teams joined forces), ensemble models triumph time after time. Research results consistently show that ensembles boost a single model's performance in the general range of 5 to 30 percent, and that integrating more models into an ensemble often continues to improve it further. “The ensemble of a group of models is usually better than most of the individual models it's made up of, and often better than them all,” says Dean Abbott.

Commercial deployment is expanding. Across this book's Central Tables of PA examples, at least eight employed ensemble models: IBM (Jeopardy!-playing Watson computer), the IRS (tax fraud), the Nature Conservancy (donations), Netflix (movie recommendations), Nokia-Siemens (dropped calls), University of California, Berkeley (brain activity, to construct a moving image of what you're seeing), U.S. Department of Defense (fraudulent government invoices), and U.S. Special Forces (job performance).

It seems too good to be true. With ensembles, we are consistently rewarded with better predictive models, often without any new math or formal theory. Is there a catch?

The Generalization Paradox: More Is Less

Ensembles appear to increase complexity…so, their ability to generalize better seems to violate the preference for simplicity summarized by Occam's Razor.

—John Elder, “The Generalization Paradox of Ensembles”

In Chapter 4 we saw that pursuing the heady goal of machine learning, to learn without overlearning, requires striking a careful balance. Building up a predictive model's complexity so that it more closely fits the training data can go only so far. After a certain point, true predictive performance, as measured over a held-aside test set, begins to suffer.

Ensembles remain robust even as they become increasingly complex. They seem to be immune to this limitation, as if soaked in a magic potion against overlearning. John Elder, who humorously calls ensemble models a “secret weapon,” identified this phenomenon in a research paper and dubbed it “the generalization paradox of ensembles.”

John resolves the apparent paradox by redefining complexity, measuring it “by function rather than form.” Ensemble models look more complex—but, he asks, do they act more complex? Instead of considering a model's structural complexity—how big it is or how many components it includes—he measures the complexity of the overall modeling method. He employs a measure called generalized degrees of freedom, which shows how adaptable a modeling method is, how much its resulting predictions change as a result of small experimental changes to the training data. If a small change in the data makes a big difference, the learning method may be brittle, susceptible to the whims of randomness and noise found within any data set. It turns out that this measure of complexity is lower for an ensemble of models than for individual models. Ensembles overadapt less. In this way, ensemble models exhibit less complex behavior, so their success in robustly learning without overlearning isn't paradoxical after all.

Enter The Ensemble Effect. By simply joining models together, we enjoy the benefit of cranking up our model's structural complexity while retaining a critical ingredient: robustness against overlearning.

The Sky's the Limit

With the newfound power of ensemble models and the fervor to tackle increasingly grand challenges, what's next? In the following chapter, PA takes on a tremendous one: competing on the TV quiz show Jeopardy!

Notes

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset