img

Chapter 3
The Data Effect
A Glut at the End of the Rainbow

We are up to our ears in data, but how much can this raw material really tell us? What actually makes it predictive? What are the most bizarre discoveries from data? When we find an interesting insight, why are we often better off not asking why? In what way is bigger data more dangerous? How do we avoid being fooled by random noise and ensure scientific discoveries are trustworthy?

Spotting the big data tsunami, analytics enthusiasts exclaim, “Surf's up!”

We've entered the golden age of predictive discoveries. A frenzy of number crunching churns out a bonanza of colorful, valuable, and sometimes surprising insights:1

  • People who “like” curly fries on Facebook are more intelligent.
  • Typing with proper capitalization indicates creditworthiness.
  • Users of the Chrome and Firefox browsers make better employees.
  • Men who skip breakfast are at greater risk for coronary heart disease.
  • The demand for Pop-Tarts spikes before a hurricane.
  • Female-named hurricanes are more deadly.
  • High-crime neighborhoods demand more Uber rides.

A Cautionary Tale: Orange Lemons

Look like fun? Before you dive in, be warned: This spree of data exploration must be tamed with strict quality control. It's easy to get it wrong and end up with egg on your face.

In 2012, a Seattle Times article led with an eye-catching predictive discovery: “An orange used car is least likely to be a lemon.”2 This insight came from a predictive analytics (PA) competition to detect which used cars are bad buys (lemons). While insights also emerged pertaining to other car attributes—such as make, model, year, trim level, and size—the apparent advantage of being orange caught the most attention. Responding to quizzical expressions, data wonks offered creative explanations, such as the idea that owners who select an unusual car color tend to have more of a “connection” to and take better care of their vehicle.

Examined alone, the “orange lemon” discovery appeared sound from a mathematical perspective. Here's the specific result:

The figure depicts two shaded bars. Bar on the left has 12.3% mentioned on top and represents percent of all cars that are lemons whereas bar on right has 8.2% written on top and represents percent of orange cars that are lemons.

This shows orange cars turn out to be lemons one third less often than average. Put another way, if you buy a car that's not orange, you increase your risk by 50 percent.

Well-established statistics appeared to back up this “colorful” discovery. A formal assessment indicated it was statistically significant, meaning that the chances were slim this pattern would have appeared only by random chance. It seemed safe to assume the finding was sound. To be more specific, a standard mathematical test indicated there was less than a 1 percent chance this trend would show up in the data if orange cars weren't actually more reliable.

But something had gone terribly wrong. The “orange car” insight later proved inconclusive. The statistical test had been applied in a flawed manner; the press had ran with the finding prematurely. As data gets bigger, so does a potential pitfall in the application of common, established statistical methods. We'll dive into this dilemma later—but for now here's the issue in a nutshell: Testing many predictors means taking many small risks of being fooled by randomness, adding up to one big risk.

This chapter first establishes just how important an opportunity data represents, and then shows how to securely tap it—here's the flow of topics:

  1. The source: where data comes from.
    • Why logs of transactions aren't boring
    • Why social data isn't always an oxymoron
    • Estimating the mass mood of the public
    • The massive recycling effort that supplies data for PA
  2. The enormousness: how much there is and what the big in big data actually means.
  3. The excitement: why data is so predictive—The Data Effect.
  4. The gold rush: what data tells us—46 fascinating discoveries.
  5. Caveat #1: why causality is generally an unknown.
  6. Caveat #2: what went wrong with the “orange lemons” case and how to tap data's potential without drawing false conclusions.

The Source: Otherwise Boring Logs Fuel Prediction

Today's predictive gold mine occurred by happy accident. Most data accumulates not to serve analytics, but as the by-product of routine tasks.

Consider all the phone calls you make. Your wireless provider logs your communications for billing and other transactional purposes. Boring! And yet these logs also reveal a wellspring of behavioral trends that characterize you and your contacts (and serve law enforcement activities, as discussed in the previous chapter). Companies leverage the predictive power of such consumer behavior to, for example, keep consumers around. By predicting who's going to leave, companies target offers—such as a free phone—in order to retain would-be defectors.

“Social data” may sound like an oxymoron to many, but data about social behavior predicts like nobody's business. Optus, a leading cell phone carrier in Australia, doubled the precision of predicting whether a customer will cancel by incorporating the behavior of each customer's social contacts: If the people you regularly call defect to another wireless provider, there's an up to sevenfold greater risk you will also do so, as more than one telecom has discovered.3

Beyond the telecom industry, another immense sector of modern society stockpiles records of person-to-person interactions: social media sites like Facebook, Twitter, and an endless assortment of blogs. Seeing the potential, the financial industry taps these sites to help assess the creditworthiness of would-be debtors, and the Internal Revenue Service taps them to check out taxpayers. City health departments predict restaurant health code violations via Yelp reviews.

In short, what you've posted online may help determine whether your application for a credit card is approved, whether your tax return is audited, and whether a restaurant is inspected.

Social Media and Mass Public Mood

Can a population's overall average mood predict mass behavior? Many bet yes. A trending area of research taps social media posts to gauge the aggregate mood of the public. Researchers evaluate these readings of mass mood for their ability to predict all kinds of population-level behaviors, including the stock market, product sales, top music hits, movie box-office revenue, Academy Award and Grammy winners, elections, and unemployment statistics.

Emotions don't usually fall within the domain of PA. Feelings are not concrete things easily tabulated in a spreadsheet as facts and figures. They're ephemeral and subjective. Sure, they may be the most important element of our human condition, but their subtleties place them outside the reach of most hard science. While a good number of neuroscientists are wiring up the noggins of undergraduate students in exchange for free pizza, many data scientists view this work as irrelevant, far removed from common applications of PA.

But social media blares our emotions. Bloggers, tweeters, and posters broadcast their thoughts, thereby transforming from private, introverted “Dear Diary” writers into vocal extroverts. A mass chorus expresses freely, unfettered by any preordained purpose or restriction. Bloggers alone render an estimated 864,000 posts per day, and in so doing act as an army of volunteers who express sentiment on the public's behalf.

Take a look at how our collective mood moves. Here's sample output of a word-based measure of mood by researchers at Indiana University. Based on a feed from Twitter, it produces daily readings of mass mood for the dimensions calm versus anxious, and happy versus unhappy (shown from October 2008 to December 2008):4

img

As we oscillate between elation and despair, this jittery movement reveals that we are a moody bunch. The time range shown includes a U.S. presidential election as well as Thanksgiving. Calmness rebounds once the voting of Election Day is complete. Happiness spikes on Thanksgiving.

A tantalizing prospect lingers for black-box trading if the mass mood approach bears fruit predicting the stock market. While there's not yet publicly known proof that it could predict the market well enough to make a killing, optimistic pioneers believe mass mood will become a fundamental component of trading analysis, alongside standard economic gauges. Entrepreneurial quant Randy Saaf said, “We see ‘sentiment’ as a diversified asset class like foreign markets, bonds, [and] gold.”

Recycling the Data Dump

One man's trash is another man's treasure.

By leveraging social media in a new way, researchers discover newfound value in oversharing. People tweet whatever the heck suits their fancy. If someone tweets, “I feel awesome today! Just wanted to share,” you might assume it interests only the tweeter's friends and family, and there's no value for the rest of the world. As with most applications of PA, though, the data at hand is readily repurposed.

This repurposing signifies a mammoth recycling initiative: the discovery of new value in the data avalanche. Like the millions of chicken feet the United States has realized it can sell to China rather than throw away, our phenomenal accumulation of 1's and 0's surprises us over and over with newfound applications. Calamari was originally considered junk, as was the basis for white chocolate. My mom, Lisa Schamberg, makes photographic art of compost, documenting the beauty inherent in organic waste. Mad scientists want to make use of nuclear waste. I can assure you, data scientists are just as mad.

Growing up watching Sesame Street, I got a kick out of the creature Oscar the Grouch, who lives in a garbage can and sings a song about how much he loves trash. It turns out Oscar isn't so crazy after all.

If social media amounts to large-scale, unregulated graffiti, there's a similar phenomenon with the millions of encyclopedias' worth of organizational data scrawled onto magnetic media for miscellaneous operational functions. It's a zillion tons of human refuse that does not smell. What do ScarJo, Iceland, and borscht have in common with data? They're all beautiful things with unwelcoming names.

Most data is not accumulated for the purpose of prediction, but PA can learn from this massive recording of events in the same fashion that you can learn from your accumulation of life experience. As a simple example, take a company's record of your e-mail address and membership status—utilitarian, yet also predictive. During one project, I found that users who signed up with an Earthlink.com (an Internet provider) e-mail address were almost five times more likely to convert from a free trial user level to the premium paid level than those with a Hotmail.com e-mail address. This could be because those who divulged only a temporary e-mail account—which is the intent for some users of free e-mail services like Hotmail—were, on average, less committed to their trial membership. Whatever the reason, this kind of discovery helps a company predict who will be acquired as a paying customer.

The Instrumentation of Everything We Do

Count what is countable, measure what is measurable, and what is not measurable, make measurable.

—Galileo

Intangibles that appear to be completely intractable can be measured.

—Douglas Hubbard, How to Measure Anything

Some historians assert that we are now experiencing the information revolution, following the agricultural and industrial revolutions. I buy it. Colin Shearer, a PA leader at IBM, eloquently states that the key to the information revolution is “the instrumentation of everything.” More and more, each move you make, online and offline, is recorded, including transactions conducted, websites visited, movies watched, links clicked, friends called, opinions posted, dental procedures endured, sports games won (if you're a professional athlete), traffic cameras passed, flights taken, Wikipedia articles edited, and earthquakes experienced. Countless sensors deploy daily. Mobile devices, robots, and shipping containers record movement, interactions, inventory counts, and radiation levels. Personal health monitors watch your vital signs and exercise routine. The mass migration of online applications from your desktop to the cloud (aka software as a service) makes even more of your computer use recordable by organizations.

Free public data is also busting out, so a wealth of knowledge sits at your fingertips. Following the open data movement, often embracing a not-for-profit philosophy, many data sets are available online from fields like biodiversity, business, cartography, chemistry, genomics, and medicine. Look at one central index, www.kdnuggets.com/datasets/, and you'll see what amounts to lists of lists of data resources. The Federal Chief Information Officer of the United States launched Data.gov “to increase public access to high value, machine readable datasets generated by…the Government.” Data.gov sports over 390,000 data sets, including data about marine casualties, pollution, active mines, earthquakes, and commercial flights. Its growth is prescribed: A directive in 2009 obliged all U.S. federal agencies to post at least three “high-value” data sets.

Far afield of government activities, a completely different accumulation of data answers the more forbidden question, “Are you having fun yet?” For a dating website, I predicted occurrences of online flirtation. After all, as data shows, you're much more likely to be retained as a customer if you get some positive attention. When it comes to recording and predicting human behavior, what's more fundamental than our mating rituals? For this project, actions such as a virtual “wink,” a message, or a request to connect as “friends” counted as “flirtatious.” Working up a sort of digital tabloid magazine, I produced reports such as the average waiting times before a flirt is reciprocated, depending on the characteristics of the customer. For example:

Sexual orientation: Average hours before reciprocal flirt (if any):
Man seeking man 40
Woman seeking man 33
Man seeking woman 43
Woman seeking woman 55

For your entertainment, here's an actual piece of code from a short 175-line computer program called “Flirtback” that I wrote (in the computer language AWK, an oldie but goodie):

sex = sexuality[flirt_to]; # sexual orientation
sumbysex[sex] += (delta/(60*60));
nPairsSex[sex]++

Come on, you have to admit that's some exciting stuff—enough to keep any computer programmer awake.

Data expresses the bare essence of human behavior. What it doesn't capture is the full dimension and innuendo of human experience—and that's just fine for PA. Because organizations record the aspects of our actions important to their function, one extraordinarily elusive, daunting task has already been completed in the production of raw materials for PA: abstracting the infinite complexity of everyday life and thereby defining which of its endless details are salient.

A new window on the world has opened. Professor Erik Brynjolfsson, an economist at the Massachusetts Institute of Technology, compares this mass instrumentation of human behavior to another historic breakthrough in scientific observation. “The microscope, invented four centuries ago, allowed people to see and measure things as never before—at the cellular level,” said The New York Times, explaining Brynjolfsson's perspective. “It was a revolution in measurement. Data measurement is the modern equivalent of the microscope.” But rather than viewing things previously too small to see, now we view things previously too big.

Batten Down the Hatches: TMI

There are over 358 million trillion gallons of water on Earth.

—A TV advertisement for Ice Mountain Spring Water

The world now contains more photographs than bricks.

—John Szarkowski, Director of Photography, Museum of Modern Art (back in 1976)

All this tracking dumps upon us a data glut. Six hundred blog posts are published per minute; by 2011, there were over 100 million blogs across WordPress and Tumblr alone. As for Twitter, “Every day, the world writes the equivalent of a 10-million-page book in Tweets or 8,163 copies of Leo Tolstoy's War and Peace,” says the official Twitter blog. Stacking that many copies of the book “would reach the height of about 1,470 feet, nearly the ground-to-roof height of Taiwan's Taipei 101, the second tallest building in the world.”

YouTube gains an hour of video each second. Estimates put the World Wide Web at over 8.32 billion Web pages. Millions of online retail transactions take place every hour. More photos are taken daily than in the first 100 years of photography, more in two minutes than in all of the 1800s, with 200 million uploaded to Facebook every day. Femto-photography takes a trillion frames per second to capture light in motion and “see around corners.” Over 7 billion mobile devices capture usage statistics. More than 100 things per second connect to the Internet, and this rate is increasing; by 2020 the “Internet of Everything” will connect 50 billion things, Cisco projects.

Making all this growth affordable, the cost of data storage is sinking like a rock. The cost per gigabyte on a hard drive has been exponentially decaying since the 1980s, when it approached $1 million. By 2014, it reached 3 cents. We can afford to never delete.5

Government intelligence aims to archive vast portions of all communication. The U.S. National Security Agency's $2 billion Utah Data Center, a facility five times the size of the U.S. Capitol, is designed to store mammoth archives of human interactions, including complete phone conversations and e-mail messages.

Scientific researchers are uncovering and capturing more and more data, and in so doing revolutionizing their own paradigms. Astronomers are building a new array of radio telescopes that will generate an exabyte of data per day (an exabyte is a quintillion bytes; a byte is a single value, an integer between 0 and 255, often representing a single letter, digit, or punctuation mark). Using satellites, wildlife conservationists track manta rays, considered vulnerable to extinction, as the creatures travel as far as 680 miles in search of food. In biology, as famed futurist Ray Kurzweil portends, given that the price to map a human genome has dropped from $1 billion to a few thousand dollars, information technology will prove to be the domain from which this field's greatest advances emerge.

Overall, data is growing at an incomprehensible speed, an estimated 2.5 quintillion bytes (exabytes) of data per day. A quintillion is a 1 with 18 zeros. In 1986, the data stored by computers, printed on double-sided paper, could have covered the Earth's landmasses; by 2011, it could have done so with two layers of books.

The growth is exponential. Data more than doubles every three years. This brought us to an estimated 8 zettabytes in 2015—that's 8,000,000,000,000,000,000,000 (21 zeros) bytes. Welcome to Big Bang 2.0.

The next logical question is: What's the most valuable thing to do with all this stuff? This book's answer: Learn from it how to predict.

Who's Your Data?

Good, better, best, bested. How do you like that for a declension, young man?

—Edward Albee, Who's Afraid of Virginia Woolf?

Bow your head: The hot buzzword big data has ascended to royalty. It's in every news clip, every data science presentation, and every advertisement for analytics solutions. It's a crisis! It's an opportunity! It's a crisis of opportunity!

Big data does not exist. The elephant in the room is that there is no elephant in the room. What's exciting about data isn't how much of it there is, but how quickly it is growing. We're in a persistent state of awe at data's sheer quantity because of one thing that does not change: There's always so much more today than yesterday. Size is relative, not absolute. If we use the word big today, we'll quickly run out of adjectives: “big data,” “bigger data,” “even bigger data,” and “biggest data.” The International Conference on Very Large Databases has been running since 1975. We have a dearth of vocabulary with which to describe a wealth of data.6

“Big data” is also grammatically incorrect. It's like saying “big water.” Rather, it should be “a lot of data” or “plenty of data.”

What's big about data is the excitement—about its rate of growth and about its predictive value.

The Data Effect: It's Predictive

The leg bone connected to the knee bone, and the knee bone connected to the thigh bone, and the thigh bone connected to the hip bone.

—From the song “Dry Bones”

There's a ton of it—so what? What guarantees that all this residual rubbish, this by-product of organizational functions, holds value? It's no more than an extremely long list of observed events, an obsessive-compulsive enumeration of things that have happened.

The answer is simple. Everything in the world is affected by connections to other things—things touch and cause one another in all sorts of ways—and this is reflected in data. For example:

  • Your purchases relate to your shopping history, online behavior, and preferred payment method, and to the actions of your social contacts. Data reveals how to predict consumer behavior from these elements.
  • Your health relates to your life choices and environment, and therefore data captures connections predictive of health based on type of neighborhood and household characteristics.
  • Your job satisfaction relates to your salary, evaluations, and promotions, and data mirrors this reality.

Data always speaks. It always has a story to tell, and there's always something to learn from it. Data scientists see this over and over again across PA projects. Pull some data together and, although you can never be certain what you'll find, you can be sure you'll discover valuable connections by decoding the language it speaks and listening. That's The Data Effect in a nutshell.

This is the assumption behind the leap of faith an organization takes when undertaking PA. Budgeting the staff and tools for a PA project requires this leap, knowing not what specifically will be discovered and yet trusting that something will be. Sitting on an expert panel at Predictive Analytics World, leading UK consultant Tom Khabaza put it this way: “Projects never fail due to lack of patterns.” With The Data Effect in mind, the scientist rests easy, secure the analysis will be fruitful.

Data is the new oil. It's this century's greatest possession and often considered an organization's most important strategic asset. Several thought leaders have dubbed it as such—“the new oil”—including European Consumer Commissioner Meglena Kuneva, who also calls it “the new currency of the digital world.” It's not hyperbole. In 2012, Apple, Inc. overtook Exxon Mobil Corp., the world's largest oil company, as the most valuable publicly traded company in the world. Unlike oil, data is extremely easy to transport and cheap to store. It's a bigger geyser, and this one is never going to run out.

The Building Blocks: Predictors

Prediction starts small. PA's building block is the predictor variable, a single value measured for each individual (known informally as a factor, attribute, feature, or predictor, and more formally as an independent variable). For example, recency, the number of weeks since the last time an individual made a purchase, committed a crime, or exhibited a medical symptom, often reveals the chances that individual will do it again in the near term. In many arenas, it makes sense to begin with the most recently active people first, whether for marketing contact, criminal investigation, or clinical assessment.

Similarly, frequency—the number of times the individual has exhibited the behavior—is also a common, fruitful measure. People who have done something a lot are more likely to do it again.

In fact, it is usually what individuals have done that predicts what they will do. And so PA feeds on data that extends past dry yet essential demographics like location and gender to include behavioral predictors such as recency, frequency, purchases, financial activity, and product usage such as calls and Web surfing. These behaviors are often the most valuable—it's always a behavior that we seek to predict, and indeed behavior predicts behavior. As Jean-Paul Sartre put it, “[A man's] true self is dictated by his actions.”

PA builds its power by combining dozens—or even hundreds—of predictors. You give the machine everything you know about each individual, and let 'er rip. The core learning technology to combine these elements is where the real scientific magic takes place. That learning process is the topic of the next chapter; for now, let's look at some interesting individual predictors.

Far Out, Bizarre, and Surprising Insights

Some predictors are more fun to talk about than others.

Are customers more profitable if they don't think? Does crime increase after a sporting event? Does hunger dramatically influence a judge's life-altering decisions? Do online daters more consistently rated as attractive receive less interest? Can promotions increase the chance you'll quit your job? Do vegetarians miss fewer flights? Does your e-mail address reveal your intentions?

Yes, yes, yes, yes, yes, yes, and yes!

Welcome to the Ripley's Believe It or Not! of data science. Poring over a potpourri of prospective predictors, PA's aim isn't only to assess human hunches by testing relationships that seem to make sense, but also to explore a boundless playing field of possible truths beyond the realms of intuition. And so, with The Data Effect in play, PA drops onto your desk connections that seem to defy logic. As strange, mystifying, or unexpected as they may seem, these discoveries help predict.

Here are some colorful discoveries, each pertaining to a single predictor variable (for each example's citation, see the Notes at www.PredictiveNotes.com).

Bizarre and Surprising Insights—Consumer Behavior

Insight Organization Suggested Explanation7
Guys literally drool over sports cars. Male college student subjects produce measurably more saliva when presented with images of sports cars or money. Northwestern University Kellogg School of Management Consumer impulses are physiological cousins of hunger.
If you buy diapers, you are more likely to also buy beer. A pharmacy chain found this across 90 days of evening shopping across dozens of outlets (urban myth to some, but based on reported results). Osco Drug Daddy needs a beer.
Dolls and candy bars. Sixty percent of customers who buy a Barbie doll buy one of three types of candy bars. Walmart Kids come along for errands.
Pop-Tarts before a hurricane. Prehurricane, Strawberry Pop-Tart sales increased about sevenfold. Walmart In preparation before an act of nature, people stock up on comfort or nonperishable foods.
Staplers reveal hires. The purchase of a stapler often accompanies the purchase of paper, waste baskets, scissors, paper clips, folders, and so on. A large retailer Stapler purchases are often a part of a complete office kit for a new employee.
Higher crime, more Uber rides. In San Francisco, the areas with the most prostitution, alcohol, theft, and burglary are most positively correlated with Uber trips. Uber “We hypothesized that crime should be a proxy for nonresidential population.…Uber riders are not causing more crime. Right, guys?
Mac users book more expensive hotels. Orbitz users on an Apple Mac spend up to 30 percent more than Windows users when booking a hotel reservation. Orbitz applies this insight, altering displayed options according to your operating system. Orbitz Macs are often more expensive than Windows computers, so Mac users may on average have greater financial resources.
Your inclination to buy varies by time of day. For retail websites, the peak is 8:00 PM; for dating, late at night; for finance, around 1:00 PM; for travel, just after 10:00 AM. This is not the amount of website traffic, but the propensity to buy of those who are already on the website. Survey of websites The impetus to complete certain kinds of transactions is higher during certain times of day.
Your e-mail address reveals your level of commitment. Customers who register for a free account with an Earthlink.com e-mail address are almost five times more likely to convert to a paid, premium-level membership than those with a Hotmail.com e-mail address. An online dating website Disclosing permanent or primary e-mail accounts reveals a longer-term intention.
Banner ads affect you more than you think. Although you may feel you've learned to ignore them, people who see a merchant's banner ad are 61 percent more likely to subsequently perform a related search, and this drives a 249 percent increase in clicks on the merchant's paid textual ads in the search results. Yahoo! Advertising exerts a subconscious effect.
Companies win by not prompting customers to think. Contacting actively engaged customers can backfire—direct mailing financial service customers who have already opened several accounts decreases the chances they will open more accounts (more details in Chapter 7). U.S. Bank Customers who have already accumulated many credit accounts are susceptible to impulse buys (e.g., when they walk into a bank branch) but, when contacted at home, will respond by considering the decision and possibly researching competing products online. They would have been more likely to make the purchase if left to their own devices.
Your Web browsing reveals your intentions. Wireless customers who check online when their contract period ends are more likely to defect to a competing cell phone company. A major North American wireless carrier Adverse to early termination fees, those intending to switch carriers remind themselves when they'll be free to change over.
Friends stick to the same cell phone company (a social effect). If you switch wireless carriers, your contacts are in turn up to seven times more likely to follow suit. A major North American wireless carrier; Optus (Australian telecom) saw a similar effect. People experience social influence and/or heed financial incentives for in-network calling.

Bizarre and Surprising Insights—Finance and Insurance

Insight Organization Suggested Explanation
Low credit rating, more car accidents. If your credit score is higher, car insurance companies will lower your premium, since you are a lower driving risk. People with poor credit ratings are charged more for car insurance. In fact, a low credit score can increase your premium more than an at-fault car accident; missing two payments can as much as double your premium. Automobile insurers “Research indicates that people who manage their personal finances responsibly tend to manage other important aspects of their life with that same level of responsibility, and that would include being responsible behind the wheel of their car,” Donald Hanson of the National Association of Independent Insurers theorizes.
Your shopping habits foretell your reliability as a debtor. If you use your credit card at a drinking establishment, you're a greater risk to miss credit card payments; at the dentist, lower risk; buy cheap, generic rather than name-brand automotive oil, greater risk; buy felt pads that affix to chair legs to protect the floor, lower risk. Canadian Tire (a major retail and financial services company) More cautionary activity such as seeing the dentist reflects a more conservative or well-planned lifestyle.
Typing with proper capitalization indicates creditworthiness. Online loan applicants who complete the application form with the correct case are more dependable debtors. Those who complete the form with all lower-case letters are slightly less reliable payers; all capitals reveals even less reliability. A financial services startup company Adherence to grammatical rules reflects a general propensity to correctly comply.
Small businesses' credit risk depends on the owner's behavior as a consumer. Unlike business loans in general, when it comes to a small business, consumer-level data about the owner is more predictive of credit risk performance than business- level data (and combining both data sources is best of all). Creditors to the leasing industry A small business's behavior largely reflects the choices and habits of one individual: the owner.

Bizarre and Surprising Insights—Healthcare

Insight Organization Suggested Explanation
Genetics foretell cheating wives. Within a certain genetic cluster, having more genes shared by a heterosexual couple means more infidelity by the female. University of New Mexico We're programmed to avoid inbreeding, since there are benefits to genetic diversity.
Early retirement means earlier death. For a certain working category of males in Austria, each additional year of early retirement decreases life expectancy by 1.8 months. University of Zurich Unhealthy habits such as smoking and drinking follow retirement. Voltaire said, “Work spares us from three evils: boredom, vice, and need.” Malcolm Forbes said, “Retirement kills more people than hard work ever did.”
Men who skip breakfast get more coronary heart disease. American men 45 to 82 who skip breakfast showed a 27 percent higher risk of coronary heart disease over a 16-year period. Harvard University medical researchers Besides direct health effects—if any—eating breakfast may be a proxy for lifestyle: People who skip breakfast may lead more stressful lives and “were more likely to be smokers, to work full time, to be unmarried, to be less physically active, and to drink more alcohol.”
Google search trends predict disease outbreaks. Certain searches for flu-related information provide insight into current trends in the spread of the influenza virus. Google Flu Trends People with symptoms or in the vicinity of others with symptoms seek further information.
Smokers suffer less from repetitive motion disorder. In certain work environments, people who smoke cigarettes are less likely to develop carpal tunnel syndrome. A major metropolitan newspaper, conducting research on its own staff's health Smokers take more breaks.
Positive health habits are contagious (a social effect). If you quit smoking, your close contacts become 36 percent less likely to smoke. Your chance of becoming obese increases by 57 percent if you have a friend who becomes obese. Research institutions People are strongly influenced by their social environment.
Happiness is contagious (a social effect). Each additional Facebook friend who is happy increases your chances of being happy by roughly 9 percent. Harvard University “Waves of happiness…spread throughout the network.”
Knee surgery choices make a big difference. After ACL-reconstruction knee surgery, walking on knees was rated “difficult or impossible” by twice as many patients who donated their own patellar tissue as a graft source rather than hamstring tissue. Medical research institutions in Sweden The patellar ligament runs across your kneecap, so grafting from it causes injury in that location.
Music expedites poststroke recovery and improves mood. Stroke patients who listen to music for a couple of hours a day more greatly improve their verbal memory and attention span and improve their mood, as measured by a psychological test. Cognitive Brain Research Unit, Department of Psychology, University of Helsinki, and Helsinki Brain Research Centre, Finland “Music listening activates a widespread bilateral network of brain regions related to attention, semantic processing, memory, motor functions, and emotional processing.”
Yoga improves your mood. Long-term yoga practitioners showed benefits in a psychological test for mood in comparison to nonyoga practitioners, including a higher “vigor” score. Research institutions in Japan Yoga is designed for, and practiced with the intent for, the attainment of tranquility.

Bizarre and Surprising Insights—Crime and Law Enforcement

Insight Organization Suggested Explanation
Suicide bombers do not buy life insurance. An analysis of bank data of suspected terrorists revealed a propensity to not hold a life insurance policy. A large UK bank Suicide nullifies a life insurance policy.
Unlike lightning, crime strikes twice. Crime is more likely to repeat nearby, spreading like earthquake aftershocks. Departments of math, computer science, statistics, criminology, and law in California universities Perpetrators “repeatedly attack clusters of nearby targets because local vulnerabilities are well-known to the offenders.”
Crime rises with public sporting events. College football upset losses correspond to a 112 percent increase in assaults. University of Colorado Psychological theories of fan aggression are offered.
Crime rises after elections. In India, crime is lower during an election year and rises soon after elections. Researchers in India Incumbent politicians crack down on crime more forcefully when running for reelection.
Phone card sales predict danger in the Congo. Impending massacres in the Congo are presaged by spikes in the sale of prepaid phone cards. CellTel (African telecom) Prepaid cards denominated in U.S. dollars serve as in-pocket security against inflation for people “sensing impending chaos.”
Hungry judges rule negatively. Judicial parole decisions immediately after a food break are about 65 percent favorable, which then drops gradually to almost zero percent before the next break. If the judges are hungry, you are more likely to stay in prison. Columbia University and Ben Gurion University (Israel) Hunger and/or fatigue leave decision makers feeling less forgiving.

Bizarre and Surprising Insights—Miscellaneous

Insight Organization Suggested Explanation
Music taste predicts political affiliation. Kenny Chesney and George Strait fans are most likely conservative, Rihanna and Jay-Z fans liberal. Republicans can be more accurately predicted by music preferences than Democrats because they display slightly less diversity in music taste. Metal fans can go either way, spanning the political spectrum. The Echo Nest (a music data company) Personality types entail certain predilections in both musical and political preferences (this is the author's hypothesis; the researchers do not offer a hypothesis).
Online dating: Be cool and unreligious to succeed. Online dating messages that initiate first contact and include the word awesome are more than twice as likely to elicit a response as those with sexy. Messages with “your pretty” get fewer responses than those with “you're pretty.” “Howdy” is better than “Hey.” “Band” does better than “literature” and “video games.” “Atheist” far surpasses most major religions, but “Zeus” is even better. OkCupid (online dating website) There is value in avoiding the overused or trite; video games are not a strong aphrodisiac.
Hot or not? People consistently considered attractive get less attention. Online daters rated with a higher variance of attractiveness ratings receive more messages than others with the same average rating but less variance. A greater range of opinions—more disagreement on looks—results in receiving more contact. OkCupid People often feel they don't have a chance with someone who appears universally attractive. When less competition is expected, there is more incentive to initiate contact.
Users of the Chrome and Firefox browsers make better employees. Among hourly employees engaged in front-line service and sales-based positions, those who use these two custom Web browsers perform better on employment assessment metrics and stay on longer. A human resources professional services firm, over employee data from Xerox and other firms “The fact that you took the time to install [another browser] shows…that you are an informed consumer…that you care about your productively and made an active choice.”
A job promotion can lead to quitting. In one division of HP, promotions increase the risk an employee will leave unless accompanied by sufficient increases in compensation; promotions without raises hurt more than help. Hewlett-Packard Increased responsibilities are perceived as burdensome if not financially rewarded.
More engaged employees have fewer accidents. Among oil refinery workers, a one percentage-point increase in team employee engagement is associated with a 4 percent decrease in the number of safety incidents per employee. Shell More engaged workers are more attentive and focused.
Higher status, less polite. Editors on Wikipedia who exhibit politeness are more likely to be elected to “administrative” status that grants greater operational authority. However, once elected, an editor's politeness decreases. Researchers examining Wikipedia behavior “Politeness theory predicts a negative correlation between politeness and the power of the requester.”
Vegetarians miss fewer flights. Airline customers who preorder a vegetarian meal are more likely to make their flight. An airline The knowledge of a personalized or specific meal awaiting the customer provides an incentive or establishes a sense of commitment.
Smart people like curly fries. Liking “Curly Fries” on Facebook is predictive of high intelligence. Researchers at the University of Cambridge and Microsoft Research An intelligent person was the first to like this Facebook page, “and his friends saw it, and by homophily, we know that he probably had smart friends, and so it spread to them…,” and so on.
A photo's quality is predictable from its caption. Even without looking at the picture itself, key words from its caption foretell whether a human would subjectively rate the photo as “good.” The words Peru, tombs, trails, and boats corresponded with better photos, whereas the words graduation and CEO tend to appear with lower-quality photos. (Not available) Certain events and locations are conducive to or provide incentive for capturing more picturesque photos.
Female-named hurricanes are more deadly. Based on a study of the most damaging hurricanes in the United States during six recent decades, the ones with “relatively feminine” names killed an average of 42 people, almost three times the 15 killed by hurricanes with “relatively male” names. University researchers This may result from “a hazardous form of implicit sexism.” Psychological experiments in a related study “suggested that this is because feminine- versus masculine-named hurricanes are perceived as less risky and thus motivate less preparedness.…Individuals systematically underestimate their vulnerability to hurricanes with more feminine names.”
Men on the Titanic faced much greater risk than women. A woman on the Titanic was almost four times as likely to survive as a man. Most men died and most women lived. Miscellaneous researchers Priority for access to lifeboats was given to women.
Solo rockers die younger than those in bands. Although all rock stars face higher risk, solo rock stars suffer twice the risk of early death as rock band members. Public health offices in the UK Band members benefit from peer support, and solo artists exhibit even riskier behavior.

Caveat #1: Correlation Does Not Imply Causation

Satisfaction came in the chain reaction.

—From the song “Disco Inferno,” by The Trammps

The preceding tables, packed with fun-filled facts, do not explain a single thing.

Take note, the third column is headed “Suggested Explanation.” While the left column's discoveries are validated by data, the reasons behind them are unknown. Every explanation put forth, each entry in the rightmost column, is pure conjecture with absolutely no hard facts to back it up.

The dilemma is, as it is often said, correlation does not imply causation.8 The discovery of a predictive relationship between A and B does not mean one causes the other, not even indirectly. No way, no how.

Consider this: Increased ice cream sales correspond with increased shark attacks. Why do you think that is? A causal explanation could be that eating ice cream makes us taste better to sharks:

The image depicts  an ice cream on the left pointing an arrow toward a human image and further to a shark on the right, describing that eating ice creams makes us taste better to sharks.

But another explanation is that, rather than one being caused by the other, they are both caused by the same thing. On cold days, people eat less ice cream and also swim less; on warm days, they do the opposite:

Warm weather mentioned on the left radiates two arrows toward right indicating more ice cream eaten by humans and more humans eaten by sharks.

Take the example of smokers getting less carpal tunnel syndrome, from the table of healthcare examples. One explanation is that smokers take more breaks:

Figure depicting a human image with cigarette in his hand pointing an arrow toward analog clock representing taking more breaks and further toward a human hand representing less carpal tunnel.

But another could be that there's some mysterious chemical in your bloodstream that influences both things:

Mysterious chemical in your blood stream mentioned on the left radiates two arrows toward right indicating more prone to cigarette addiction and less prone to carpal tunnel.

I totally made that up. But the truth is that finding the connection between smoking and carpal tunnel syndrome in and of itself provides no evidence that one explanation is more likely than the other. With this in mind, take another look through the tables. The same rule applies to each example. We know the what, but we don't know the why.

When applying PA, we generally don't have firm knowledge about causation, and we often don't necessarily care. For many PA projects, the value comes from prediction, with only an avocational interest in understanding the world and figuring out what makes it tick.

Causality is elusive, tough to nail down. We naturally assume things do influence one another in some way, and we conceive of these effects in physical, chemical, medical, financial, or psychological terms. The noble scientists in these fields have their work cut out for them as they work to establish and characterize causal links.

In this way, data scientists have it easier with PA. It just needs to work; prediction trumps explanation. PA operates with extreme solution-oriented intent. The whole point, the “ka-ching” of value, comes in driving decisions from many individual predictions, one per patient, customer, or person of any kind. And while PA often delivers meaningful insights akin to those of various social sciences, this is usually a side effect, not the primary objective.

This makes PA a kind of “metascience” that transcends the taxonomy of natural and social sciences, abstracting across them by learning from any and all data sources that would typically serve biology, criminology, economics, education, epidemiology, medicine, political science, psychology, or sociology. PA's mission is to engineer solutions. As for the data employed and the insights gained, the tactic in play is: “Whatever works.”

And yet even hard-nosed scientists fight the urge to overexplain. It's human nature, but it's dangerous. It's the difference between good science and bad science.

Stein Kretsinger, founding executive of Advertising.com and a director at Elder Research, tells a classic story of our overly interpretive minds. In the early 1990s, as a graduate student, Stein was leading a medical research meeting, assessing the factors that determine how long it takes to wean off a respirator. As this was before the advent of PowerPoint projection, Stein displayed the factors, one at a time, via graphs on overhead transparencies. The team of healthcare experts nodded their heads, offering one explanation after another for the relationships shown in the data. After going through a few, though, Stein realized he'd been placing the transparencies with the wrong side up, thus projecting mirror images that depicted the opposite of the true relationships. After he flipped them to the correct side, the experts seemed just as comfortable as before, offering new explanations for what was now the very opposite effect of each factor. Our thinking is malleable—people readily find underlying theories to explain just about anything.

In another case, a published medical study discovered that women who happened to be receiving hormone replacement therapy showed a lower incidence of coronary heart disease. Could it be that a new treatment for this disease had been discovered?

Figure depicting a syringe and two capsules on the left, representing hormonal replacement therapy received, pointing an arrow toward heart on the right illustrating it is less likely to develop coronary heart disease.

Later, a proper control experiment disproved this false conclusion. Instead, the currently held explanation is that more affluent women had access to the hormone replacement therapy, and these same women had better health habits overall:

“More affluent mentioned on the left radiates two arrows toward right indicating more likely to receive hormone replacement therapy and better health habits followed by less likely to develop coronary heart disease.”

Prematurely jumping to conclusions about causality is bad science that leads to bad medical treatment. This kind of research snafu is not an isolated case. According to The Wall Street Journal, the number of retracted journal publications has surged in recent years.

But, in this arena, the line between apt and inept sometimes blurs. Twenty years ago, while in graduate school, I befriended a colleague, a chain smoker who was nevertheless brilliant with the mathematics behind probability theory. He would hang you out to dry if you attempted to criticize his bad smoking habit on the basis of clinical studies. “Smoking studies have no control group,” he'd snap.9 He was questioning the common causal conclusion:

Figure depicting a human image with a cigarette in his hand points an arrow toward ambulance representing health problems.

One day in front of the computer science building, as I kept my distance from his cloud of smoke, he drove this point home. New to the study of probability, I suddenly realized what he was saying and, looking at him incredulously, asked, “You mean to say that it's possible smoking studies actually reflect that stupid people smoke, and that these people also do other stupid things, and only those other things poorly affect their health?” By this logic, I had been stupid for not considering him quite possibly both stupid and healthy.

Less intelligent written on the left points two arrows toward right indicating more likely to smoke cigarettes and do unhealthy things unrelated to smoking and develop health problems.

He exhaled a lungful of smoke triumphantly as if he'd won the argument and said with no irony, “Yes!” The same position had also been espoused in the 1950s by an early founder of modern statistics, Ronald Fisher. He was a pipe-smoking curmudgeon who attacked the government-supported publicity about tobacco risks, calling it egregious fearmongering.

In addressing the effects of tobacco, renowned healthcare statistician David Salsburg wrote that the very meaning of cause and effect is “a deep philosophical problem…that gnaws at the heart of scientific thought.” Due in part to our understanding of how inhaled agents actively lead to genetic mutations that create cancerous cells, the scientific community has concluded that cigarettes are causal in their connection to cancer. While I implore scientists not to overinterpret results, I also implore you not to smoke.

Caveat #2: Securing Sound Discoveries

The trouble with the world is that the stupid are cocksure and the intelligent are full of doubt.

—Bertrand Russell

Even before suggesting any causal explanation for a correlation observed in data, you had better verify it's actually a real trend rather than misleading noise.

At the beginning of this chapter, we saw that data can lead us astray, tempting us—and several mass media outlets—to believe orange cars last longer. In that data, used cars sporting this flashy color turned out to be lemons 33 percent less often. However, subsequent analysis has severely weakened the confidence in this discovery, relegating it to inconclusive. What went wrong?

Warning! Big data brings big potential—but also big danger. With more data, a unique pitfall often dupes even the brightest of data scientists. This hidden hazard can undermine the process that evaluates for statistical significance, the gold standard of scientific soundness. And what a hazard it is! A bogus discovery can spell disaster. You may buy an orange car—or undergo an ineffective medical procedure—for no good reason. As the aphorisms tell us, bad information is worse than no information at all; misplaced confidence is seldom found again.

This peril seems paradoxical. If data's so valuable, why should we suffer from obtaining more and more of it? Statistics has long advised that having more examples is better. A longer list of cases provides the means to more scrupulously assess a trend. Can you imagine what the downside of more data might be? As you'll see in a moment, it's a thought-provoking, dramatic plot twist.

The fate of science—and sleeping well at night—depends on deterring the danger. The very notion of empirical discovery is at stake. To leverage the extraordinary opportunity of today's data explosion, we need a surefire way to determine whether an observed trend is real, rather than a random artifact of the data.

Statistics approaches this challenge in a very particular way. It tells us the chances the observed trend could randomly appear even if the effect were not real. That is, it answers this question:10

  1. Question that statistics can answer: If orange cars were actually no more reliable than used cars in general, what would be the probability that this strong a trend—depicting orange cars as more reliable—would show in data anyway, just by random chance?

With any discovery in data, there's always some possibility we've been Fooled by Randomness, as Nassim Taleb titled his compelling book. The book reveals the dangerous tendency people have to subscribe to unfounded explanations for their own successes and failures, rather than correctly attributing many happenings to sheer randomness. The scientific antidote to this failing is probability, which Taleb affectionately dubs “a branch of applied skepticism.”

Statistics is the resource we rely on to gauge probability. It answers the orange car question above by calculating the probability that what's been observed in data would occur randomly if orange cars actually held no advantage. The calculation takes data size into account—in this case, there were 72,983 used cars varying across 15 colors, of which 415 were orange.11

Calculated answer to the question: 0.68 percent

Looks like a safe bet. Common practice considers this risk acceptably remote, low enough to at least tentatively believe the data. But don't buy an orange car just yet—or write about the finding in a newspaper for that matter.

What Went Wrong: Accumulating Risk

In China when you're one in a million, there are 1,300 people just like you.

—Bill Gates

So if there had only been a 1 percent long shot that we'd be misled by randomness, what went wrong?

The experimenters' mistake was to not account for running many small risks, which had added up to one big one. In addition to checking whether being orange is predictive of car reliability, they also checked each of the other 14 colors, as well as the make, model, year, trim level, type of transmission, size, and more. For each of these factors, they repeatedly ran the risk of being fooled by randomness.

Probability is relative, affected entirely by context. With additional background information, a seemingly unlikely event turns out to be not so special after all. Imagine your friend calls to tell you, “I won the jackpot at hundred-to-one odds!” You might get a little excited. “Wow!”

Now imagine your friend adds, “By the way, I'm only talking about one of 70 times that I spun the jackpot wheel.” The occurrence that had at first seemed special suddenly has a new context, positioned alongside a number of less remarkable episodes. Instead of exclaiming wow, you might instead do some arithmetic. The probability of losing a spin is 99 percent. If you spin twice, the chances of losing both is 99 percent × 99 percent, which is about 98 percent. Although you'll probably lose both spins, why stop at two? The more times you spin, the lower the chances of never winning once. To figure out the probability of losing 70 times in a row, multiple 99 percent times itself 70 times, aka 0.99 raised to the power of 70. That comes to just under 0.5. Let your friend know that nothing special happened—the odds of winning at least once were about 50/50.

Special cases aren't so special after all. By the same sort of reasoning, we might be skeptical about the merits of the famed and fortuned. Do the most successful elite hold talents as elevated as their singular status? As Taleb put it in Fooled by Randomness, “I am not saying that Warren Buffett is not skilled; only that a large population of random investors will almost necessarily produce someone with his track records just by luck.”

Play enough and you'll eventually win. Likewise, press your luck repeatedly and you'll eventually lose. Imagine your same well-intentioned friend calls to tell you, “I discovered that orange cars are more reliable, and the stats say there's only a 1 percent chance this phenomenon would appear in the data if it weren't true.” You might get a little impressed. “Interesting discovery!”

Now imagine your friend adds, “By the way, I'm only talking about one among dozens of car factors—my computer program systematically went through and checked each one.” Both of your friend's stories enthusiastically led with a “remarkable” event—a jackpot win or a predictive discovery. But the numerous other less remarkable attempts—that often go unmentioned—are just as pertinent to each story's conclusion.

Wake up and smell the probability. Imagine we test 70 characteristics of cars that in reality are not predictive of lemons. But each test suffers a, say, 1 percent risk the data will falsely show a predictive effect just by random chance. The accumulated risk piles up. As with the jackpot wheel, there's a 50/50 chance the unlikely event will eventually take place—that you will stumble upon a random perturbation that, considered in isolation, is compelling enough to mislead.

The Potential and Danger of Automating Science: Vast Search

The most exciting phrase to hear in science, the one that heralds new discoveries, is not “Eureka!” but rather “Hmm…that's funny…”

—Isaac Asimov

A tremendous potential inspires us to face this peril: Predictive modeling automates scientific discovery. Although it may seem like an obvious thing to do in this computer age, trying out each predictor variable is a dramatic departure from the classic scientific method of developing a single hypothesis and then testing it. Your computer essentially acts as hundreds or even thousands of scientists by conducting a broad, exploratory analysis, automatically evaluating an entire batch of predictors. This aggressive hunt for any novel source of predictive information leaves no stone unturned. The process is key to uncovering valuable, unforeseen insights.

Automating this search for valuable predictors empowers science, lessening its dependence on ever-elusive serendipity. Instead of waiting to inadvertently stumble upon revelations or racking our brains for hypotheses, we rely less on luck and hunches by systematically testing many factors. While necessity is the mother of invention, historically speaking, serendipity has long been its daddy. It was only by happy accident that Alexander Fleming happened upon the potent effects of penicillin, by noticing that an old bacteria culture he was about to clean up happened to be contaminated with some mold—which was successfully killing it. Likewise, Minoxidil was inadvertently discovered as a baldness remedy in an unexpected, quizzical moment: “Look, more hair!”

But as exciting a proposition as it is, this automation of data exploration builds up an overall risk of eventually being fooled—at one time or another—by randomness. This inflation of risk comes as a consequence of assessing many characteristics of used cars, for example. The power of automatically testing a batch of predictors may serve us well, but it also exposes us to the very real risk of bogus discoveries.

Let's call this issue vast search—the term that industry leader (and Chapter 1's predictive investor) John Elder coined for this form of automated exploration and its associated peril. Repeatedly identified anew across industries and fields of science, this issue is also called the multiple comparisons problem or multiple comparisons trap. John warns, “The problem is so widespread that it is the chief reason for a crisis in experimental science, where most journal results have been discovered to resist replication; that is, to be wrong!”

Statistics darling Nate Silver jumped straight to the issue of vast search when asked generally about the topic of big data on Freakonomics Radio. With a lot of data, he said, “you're going to find lots and lots of correlations through brute force…but the problem is that a high percentage of those, maybe the vast majority, are false correlations, are false positives.…They [appear] statistically significant, but you have so many lottery tickets when you can run an analysis on a [large data set] that you're going to have some one-in-a-million coincidences just by chance alone.”

The casual “mining” of data—analysis of one sort or another to find interesting tidbits and insights—often involves vast search, making it all too easy to dig up a false claim. With this misstep so commonplace, there's a real possibility that some of the predictive discoveries listed in the tables earlier in this chapter could face debunking, depending on whether the researchers have taken proper care. As we'll see in the next chapter, one mischievous professor illustrated the problem of searching and “re-searching” too far and wide when he unearthed a cockamamie relationship between dairy products in Bangladesh and the U.S. stock market.

Bigger data isn't the problem—more specifically, it's wider data. When prepared for PA, data grows in two dimensions—it's a table:

A tabular data representation for predicting bad buys among used cars where data includes name of the car, year, type, color, and whether bad or ok buy. On top and right of the table are two bi-directional arrows labeled as one column per predictor variable: wider means more predictors-vaster search and one row per training case: longer means more cases-better, respectively.

A small sample of data for predicting bad buys among used cars. The complete data is both wider and longer.

As you accrue more examples of cars, people, or whatever you're predicting, the table grows longer (more rows, aka training cases). That's always a good thing. The more training cases to analyze, the more statistically sound.12 Expanding in the other dimension, each row widens (more columns) as more factors—aka predictor variables—are accrued. A certain factor such as car color may only amount to a single column in the data, but since we look at each possible color individually, it has the virtual effect of adding 15 columns to the width, one per color. Overall, the sample data in the figure above is not nearly as wide as data often gets, but even in this case the vast search effect is at play. With wider and wider data, we can only tap the potential if we can avoid the booby trap set by vast search.

A Failsafe for Sound Results

To understand what sort of failsafe mechanism we need, let's revisit the misleading “orange lemons” discovery.

img

This 12.3-versus-8.2 result is calculated from four numbers:

  1. There were 72,983 cars, of which 8,976 were lemons.
  2. There were 415 orange cars, of which 34 were lemons.

The standard method—the one that misled researchers as well as the press—evaluates for statistical significance based only on those four numbers. When fed these as input, the test provides a positive result, calculating there was only a 0.68 percent chance we would witness that extreme a difference in orange cars if they were in actuality no more prone to be lemons than cars of other colors.

But these four numbers alone do not tell the whole story—the context of the discovery also matters. How vast was the search for such discoveries? How many other factors were also checked for a correlation with whether a car is a lemon?

In other words, if a data scientist hands you these four numbers as “proof” of a discovery, you should ask what it took to find it. Inquire, “How many other things did you also try that came up dry?”

With the breadth of search taken into account, the “orange lemon” discovery collapses. Confidence diminishes and it shows as inconclusive. Even if we assume the other 14 colors were the only other factors examined, statistical methods estimate a much less impressive 7.2 percent probability of stumbling by chance alone upon a bogus finding that appears this compelling.13 Although 7.2 percent is lower odds than a coin toss, it's no long shot; by common standards, this is not a publishable result. Moreover, 7.2 is an optimistic estimate. We can assume the risk was even higher than that (i.e., worse) since other factors such as car make, model, and year were also available, rendering the search even wider and the opportunities to be duped even more plentiful.

Inconclusive results are no results at all. It may still be true that orange cars are less likely to be lemons, but the likelihood this appeared in the data by chance alone is too high to put a lot of faith in it. In other words, there's not enough evidence to rigorously support the hypothesis. It is, at least for now, relegated to “a fascinating possibility,” only provisionally distinct from any untested theories one might think up.

Want conclusive results? Then get longer data, i.e., more rows of examples. Adequately rigorous failsafe methods that account for the breadth of search set a higher bar. They serve as a more scrupulous filter to eliminate inconclusive findings before they get applied or published. To compensate for this strictness and increase the opportunity to nonetheless attain conclusive results, the best recourse is elongating the list of cases. If the search is vast—that is, if the data is wide—then findings will need to be more compelling in order to pass through the filter. To that end, if there are ample examples with which to confirm findings—in other words, if the data makes up for its width by also being longer—then legitimate findings will have the empirical support they need to be validated.

The Data Effect will prevail so long as there are enough training examples to correctly discern which predictive discoveries are authentic.

A Prevalent Mistake

Despite the seriousness of this mistake, the vast search pitfall regularly trips up even the most well-intentioned data scientists, statisticians, and other researchers. A perfect storm of influences leads to its prevalence:

  • It's elusive. You have to think outside a certain box. The classic application of statistical methods has traditionally focused on evaluating for significance based entirely on the result itself. There's a conceptual leap in moving beyond that to also account for the breadth of search, the full suite of other predictors also considered.
  • It's new. Since the advent of big data—to be specific, wide data—has more recently intensified this problem, awareness across the data science community still needs to catch up.
  • Simplicity can deceive. Ironically, although bite-sized anecdotes are more likely to make compelling headlines and draw public attention, they're less likely to be properly screened against failure. It's widely understood that a predictive model, whose job is to combine variables in order to fit the data, can go too far and overfit—a primary topic of the next chapter. Since single-variable insights—such as the “orange lemons” claim and the many examples listed earlier in this chapter's tables—are so much simpler than multivariate models, their potential to hold a spurious aberration is underestimated and so they're often subjected to less rigorous scrutiny.
  • Falsehoods don't look wrong. Without realizing that a pattern may only in actuality be random noise, people creatively formulate compelling causal explanations. This is human nature, but on many occasions it only increases one's attachment to a false discovery.
  • It's a buzzkill. Given the strong incentives to make predictive discoveries, the temptation is there to be less than scrupulous, either intentionally—or, more commonly, with a certain convenient forgetfulness—neglecting to account for the full scope of the search that led to the discovery.

In the big data tsunami, you've got to either sharpen your skills or get out of the water.

Putting All the Predictors Together

There ought to be a rock band named after this chapter's explosive topic, “The Predictors.”14

The number of predictors at our disposal grows along with an unbridled trend: Exploding quantities of increasingly diverse data are springing forth, and organizations are innovating to turn all this unprocessed sap into maple syrup.

The next step is a doozy. To fully leverage predictor variables, we must deftly and intricately combine them with a predictive model. To this end, you can't just stir the bowl with a big spoon. You need an apparatus that learns from the data itself how best to mix and combine it.

Holy combinatorial explosion, Batman! This will make the vast search problem worse—much worse. By combining two predictors, as in, “Are cars with the color black and the make Audi liable to be lemons?” for example—or even more than two—we will build up a much larger batch of relationships to evaluate. This also means a much greater number of opportunities to be fooled by randomness.

Concerned? Overwhelmed? What if I told you there's an intuitive, elegant method for building a predictive model, as well as a simple way to confirm a model's soundness, without the need to mathematically account for the vastness of search? The next chapter shows you how it's done—20 pages on how machine learning works, plus another 12 covering the most practical yet philosophically intriguing question of data science: How can we ensure that what the machine has learned is real, that the predictive model is sound?

Notes

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset