Chapter 8. The Road Ahead

Data Science Today

Kaggle is a marketplace for hosting data science competitions. Companies post their questions and data scientists from all over the world compete to produce the best answers. When a company posts a challenge, it also posts how much it’s willing to pay to anyone who can find an acceptable answer. If you take the questions posted to Kaggle and plot them by value in descending order, the graph looks like Figure 8-1.

Figure 8-1. The value of questions posted to Kaggle matches a long-tail distribution

This is a classic long-tail distribution. Half the value of the Kaggle market is concentrated in about 6% of the questions, while the other half is spread out among the remaining 94%. This distribution gets skewed even more if you consider all the questions with no direct monetary value—questions that offer incentives like jobs or kudos.

I strongly suspect that the wider data science market has the same long-tail shape. If I could get every company to declare every question that could be answered using data science, and what they would offer to have those questions answered, I believe that the concentration of value would look very similar to that of the Kaggle market.

Today, the prevailing wisdom for making money in data science is to go after the head of the market using centralized capabilities. Companies collect expensive resources (like specialized expertise and advanced technologies) to go after a small number of high-profile questions (like how to diagnose an illness or predict the failure of a machine). It makes me think of the early days of computing where, if you had the need for computation, you had to find an institution with enough resources to support a mainframe.

The computing industry changed. The steady progress of Moore’s law decentralized the market. Computing became cheaper and diffused outward from specialized institutions to everyday corporations to everyday people. I think that data science is poised for a similar progression.  But instead of Moore’s law, I think the catalyst for change will be the rise of collaboration among data scientists.

Data Science Tomorrow

I believe that the future of data science is in collaborations like outside-in innovation and open research. It means putting a hypothesis out in a public forum, writing openly with collaborators from other companies and holding open peer reviews. I used to think that this would all require expensive and exotic social business platforms, but, so far, it hasn’t.

Take, for example, work I did in business model simulation. Entirely new industries can form as the result of business model innovations, but testing out new ideas is still largely a matter of trial and error. I started looking into faster, more effective ways of finding solid business model innovations. We held an open collaboration between business strategists and data scientists using only a Google Hangout. I spent 8 weeks collaboratively writing a paper in Google Docs. We held an open peer review of the paper a using a CrowdChat, that generated 164 comments and reached over 45,000 people (Figure 8-2).

Figure 8-2. We peer reviewed research in business model simulation as an open collaboration between business strategists and data scientists

This kind of collaboration is a small part of what is ultimately possible. It’s possible to build entire virtual communities adept at extracting value from data, but it will take time. A community would have to go well beyond the kinds of data and analytics centers of excellence in many corporations today. It would have to evolve into a self-sustaining hub for increasing data literacy, curating and sharing data, doing research and peer review.

At first, these collaborations would only be capable of tackling problems in the skinniest parts of the data science long tail. It would be limited to solving general problems that don’t require much specialized expertise or data. For example, this was exactly how Chapter 5 was born. It started out as a discussion of occasional productivity problems we were having on my team. We eventually decided to hold an open conversation on the matter. In a 30-minute CrowdChat session, we got 179 posts, 600 views, and reached over 28,000 people (Figure 8-3). I summarized the findings based on the most influential comments, then I took the summary and used it as the basis for Chapter 5.

Figure 8-3. Chapter 5 was born as an open collaboration between data scientists and software engineers

But eventually, open data science collaborations will mature until they are trusted enough to take on even our most important business questions. Data science tools will become smarter, cheaper, and easier to use. Data transfer will become more secure and reliable. Data owners will become much less paranoid about what they are willing to share. Open collaboration could be especially beneficial to companies experiencing difficulties finding qualified staff.

I believe that in the not-so-distant future, the most important questions in business will be answered by self-selecting teams of data scientists and business change agents from different companies. I’m looking forward to the next wave, when business leaders turn first to open data science communities when they want to hammer out plans for the next big thing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset