Because big data and data science are here to stay, this chapter explores key features of data scientists, types of data scientists, and how to become a data scientist, including training programs available and the different types of data scientist career paths.
There are a few key features of data scientists you may have already noticed. These key features are discussed in this section, along with the type of expertise they should have or acquire, and why horizontal knowledge is important. Finally, statistics are presented on the demographics of data scientists.
Data scientists are not statisticians, nor data analysts, nor computer scientists, nor software engineers, nor business analysts. They have some knowledge in each of these areas but also some outside of these areas.
One of the reasons why the gap between statisticians and data scientists grew over the last 15 years is that academic statisticians, who publish theoretical articles (sometimes not based on data analysis) and train statisticians, are… not statisticians anymore. Also, many statisticians think that data science is about analyzing data, but it is more than that. It also involves implementing algorithms that process data automatically to provide automated predictions and actions, for example:
All this involves both statistical science and terabytes of data. People doing this stuff do not call themselves statisticians, in general. They call themselves data scientists. Over time, as statisticians catch up with big data and modern applications, the gap between data science and statistics will shrink.
Also, data scientists are not statisticians because statisticians know a lot of things that are not needed in the data scientist's job: generalized inverse matrixes, eigenvalues, stochastic differential equations, and so on. When data scientists indeed use techniques that rely on this type of mathematics, it is usually via high-level tools (software), where these mathematics are used as black boxes — just like a software engineer who does not code in assembly language anymore. Conversely, data scientists need new statistical knowledge, typically data-driven, model-free, robust techniques that apply to modern, large, fast-flowing, sometimes unstructured data. This also includes structuring unstructured data and becoming an expert in taxonomies, NLP (natural language processing, also known as text mining) and tag management systems.
Data scientists are not computer scientists either — first, because they don't need the entire theoretical knowledge computer scientists have, and second, because they need to have a much better understanding of random processes, experimental design, and sampling, typically areas in which statisticians are expert. Yet data scientists need to be familiar with computational complexity, algorithm design, distributed architecture, and programming (R, SQL, NoSQL, Python, Java, and C++). Independent data scientists can use Perl instead of Python, though Python has become the standard scripting language. Data scientists developing production code and working in teams need to be familiar with software development life cycle and lean architectures.
Data scientists also need to be domain experts in one or two applied domains (for instance, the author of this book is an expert in both digital analytics and fraud detection), have success stories to share (with metrics used to quantify success), have strong business acumen, and be able to assess the ROI that data science solutions bring to their clients or their boss. Many of these skills can be acquired in a short time period if you already have several years of industry experience and training in a lateral domain (operations research, applied statistics working on large data sets, computer science, engineering, or an MBA with a strong analytic component).
Thus, data scientists also need to be good communicators to understand, and many times guess, what problems their client, boss, or executive management is trying to solve. Translating high-level English into simple, efficient, scalable, replicable, robust, flexible, platform-independent solutions is critical.
Finally, the term analyst is used for a more junior role, for people who typically analyze data but do not create systems or architectures to automatically analyze, process data, and perform automated actions based on automatically detected patterns or insights.
To summarize, data science = Some (computer science) + Some (statistical science) + Some (business management) + Some (software engineering) + Domain expertise + New (statistical science), where
Vertical data scientists have deep technical knowledge in some narrow field. For instance, they might be any of the following:
The key here is that by “vertical data scientist” I mean those with a more narrow range of technical skills, such as expertise in all sorts of Lasso-related regressions but with limited knowledge of time series, much less of any computer science. However, deep domain expertise is absolutely necessary to succeed as a data scientist. In some fields, such as stock market arbitrage, online advertising arbitrage, or clinical trials, a lack of understanding of very complex business models, ecosystems, cycles and bubbles, or complex regulations is a guarantee of failure. In fraud detection, you need to keep up with what new tricks criminals throw at you (see the section on fraud detection in Chapter 6 for details).
Vertical data scientists are the by-product of a university system that trains a person to become a computer scientist, a statistician, an operations researcher, or an MBA — but not all the four at the same time.
In contrast, horizontal data scientists are a blend of business analysts, statisticians, computer scientists, and domain experts. They combine vision with technical knowledge. They might not be experts in eigenvalues, generalized linear models, and other semi-obsolete statistical techniques, but they know about more modern, data-driven techniques applicable to unstructured, streaming, and big data, such as the simple and applied first Analyticbridge theorem to build confidence intervals. They can design robust, efficient, simple, replicable, and scalable code and algorithms.
So by “horizontal data scientist,” I mean that you need cross-discipline knowledge, including some of computer science, statistics, databases, machine learning, Python, and, of course, domain expertise. This is technical knowledge that a typical statistician usually lacks.
Horizontal data scientists also have the following characteristics:
Another part of the data scientist's job is to participate in the database design and data gathering process, and to identify metrics and external data sources useful to maximize value discovery. This includes the ability to distinguish between the deeply buried signal and huge amounts of noise present in any typical big data set. For more details, read the section The Curse of Big Data in Chapter 2 which describes correlogram matching versus correlation comparisons to detect true signal when comparing millions of time series. Also see the section on predictive power in Chapter 6, which discusses a new type of entropy used to identify features that capture signal over those that are “signal-blind.”
Data scientists also need to be experts in robust cross-validation, confidence intervals, and variance reduction techniques to improve accuracy or at least measure and quantify uncertainty. Even numerical analysis can be useful, especially for those involved in data science research. This and the importance of business expertise become clearer in the section on stock trading Chapter 6.
Finally, there are three main types of analytics that data scientists deal with: descriptive, predictive, and prescriptive. Each type brings business value to the organization in a different way. I would also add “automated analytics” and “root cause analyses” to this list, although they overlap with predictive and prescriptive analytics in some ways.
The broad yet deep knowledge required of a true horizontal data scientist is one of the reasons why recruiters can't find them (so they find and recruit mostly vertical data scientists). Companies are not yet familiar with identifying horizontal data scientists—the true money makers and ROI generators among analytics professionals. The reasons are twofold:
EXERCISE
Can you name a few horizontal data scientists? D.J. Patil, previously a chief data scientist at LinkedIn, is an example of a horizontal data scientist. See if you can research and find a few others.
There are several different types of data scientists, including fake, self-made, amateur, and extreme. The following sections discuss each of these types of data scientists and include examples of amateur and extreme data science.
Fake data science and what a fake data scientist is has been previously discussed—a professional with narrow expertise, possibly because she is at the beginning of her career and should not be called a data scientist. In the previous section, vertical real scientists were considered not to be real data scientists. Likewise, a statistician with R and SQL programming skills, who never processed a 10-million-row data set, is a statistician, not a data scientist. A few books have been written with titles that emphasizes data science, while being nothing more than re-labeled statistic books.
A self-made data scientist is someone without the formal training (a degree in data science) but with all the knowledge and expertise to qualify as a data scientist. Currently, most data scientists are self-made because few programs exist that produce real data scientists. However, over time they will become a minority and even disappear. The author of this book is a self-made data scientist, formally educated in computational statistics at the PhD level. Just like business executives existed long before MBA programs were created, data scientists must exist before official programs are created. Some of them are the professionals who are:
Another kind of data scientist worth mentioning is the amateur data scientist. The amateur data scientist is the equivalent of the amateur astronomer: well equipped, she has the right tools (such as R, a copy of SAS, or RapidMiner), has access to large data sets, is educated, and has real expertise. She earns money by participating in data science competitions, such as those published on Kaggle, CrowdANALYTIX, and Data Science Central, or by doing freelancing.
With so much data available for free everywhere, and so many open tools, a new kind of analytics practitioner is emerging: the amateur data scientist. Just like the amateur astronomer, the amateur data scientist will significantly contribute to the art and science, and will eventually solve mysteries. Could someone like the Boston Marathon bomber be found because of thousands of amateurs analyzing publicly available data (images, videos, tweets, and so on) with open source tools? After all, amateur astronomers have been detecting exo-planets and much more.
Also, just like the amateur astronomer needs only one expensive tool (a good telescope with data recording capabilities), the amateur data scientist needs only one expensive tool (a good laptop and possibly subscription, to some cloud storage/computing services).
Amateur data scientists might earn money from winning Kaggle contests, working on problems such as identifying a Botnet, explaining the stock market flash crash, defeating Google page ranking algorithms, helping find new complex molecules to fight cancer (analytical chemistry), and predicting solar flares and their intensity. Interested in becoming an amateur data scientist? The next section gives you a first project to get started.
Question: Do large meteors cause multiple small craters or a big one? If meteors usually break up into multiple fragments, or approach the solar system already broken down into several pieces, they might be less dangerous than if they hit with a single, huge punch. That's the idea; although I'm not sure if this assumption is correct. Even if the opposite is true, it is still worth asking the question about frequency of binary impact craters.
Eventually, knowing that meteorites arrive in pieces rather than intact could change government policies and priorities, and maybe allow us to stop spending money on projects to detect and blow up meteors (or the other way around).
So how would you go about estimating the chance that a large meteor (hitting Earth) creates multiple small impacts, and how many impacts on average? An idea consists in looking at moon craters and checking how many of them are aligned. Yet what causes meteors to explode before hitting (and thus create multiple craters) is Earth's thick atmosphere. Thus the moon would not provide good data. Yet Earth's crust is so geologically active that all crater traces disappear after a few million years. Maybe Venus would be a good source of data? No, even worse than Earth. Maybe Mars? No, it's just like the moon. Maybe some moons from Jupiter or Saturn would be great candidates.
After a data source is identified and the questions answered, deeper questions can be asked, such as, When you see a binary crater (two craters, same meteor), what is the average distance between the two craters? This can also help better assess population risks and how many billion dollars NASA should spend on meteor tracking programs.
In any case, as a starter, I did a bit of research and found data at http://www.stecf.org/∼ralbrech/amico/intabs/koeberlc.html, along with a map showing impact craters on Earth.
Visually, with the naked eye, it looks like multiple impacts (for example, binary craters) and crater alignment is the norm, not the exception. But the brain can be lousy at detecting probabilities. So a statistical analysis is needed. Note that the first step consists of processing the image to detect craters and extract coordinates, using some software or writing your own code. But this is still something a good amateur data scientist could do; you can find the right tools on the Internet.
Another interesting type of data scientist is the extreme data scientist: Just like extreme mountain climbing or extreme programming, extreme data science means developing powerful, robust predictive solutions without any statistical models.
Extreme data science is a way of doing data science to deliver high value quickly. It can best be compared to extreme mountain climbing, where individuals manage to climb challenging peaks in uncharted territory, in a minimum amount of time, with a small team and small budget, sometimes successfully using controversial climbing routes or techniques. For instance, Reinhold Messner climbing Mount Everest in winter, alone, in three days, with no oxygen and no Sherpa — yes, he really did it!
Extreme data science borrows from lean six sigma, agile development and analytics, extreme programming, and in many cases advanced statistical analyses on big data performed without statistical models, yet providing superior ROI. (Part of it is due to saving money that would otherwise have been spent on a large team of expensive statisticians.)
Just like extreme mountain climbing, it is not for the faint-hearted: few statisticians can design sound predictive techniques and confidence intervals without mathematical modeling and heavy artillery such as general linear models.
The only data scientists who can succeed with extreme data science are those with good intuition, good judgment, and vision. They combine deep analytics and advanced statistical knowledge with consistently correct gut feelings and the ability to quickly drill down to the essence of any problem. They can offer a better solution to business problems, in one month of work, than a team of “regular,” model-oriented data scientists working six months on the same project. Better also means, simpler, more scalable, and more robust. These data scientists tend to be polyvalent. Sometimes their success comes from extensive automation of semi-mundane tasks.
One of the problems today is that many so-called data scientists do extreme data science without knowing it. But they don't have the ability to successfully do it, and the results are miserable as they introduce biases and errors at every step when designing and implementing algorithms. Job interview questions don't test the ability to work as an extreme data scientist. Result? If you had 50 respected professional mountain climbers, climbing Everest in winter with no Sherpa, no oxygen, and little money, 48 would fail (not reach the summit). However, 25 would achieve a few milestones (5 almost succeeding), 5 would die, 10 would refuse to do it, 8 would fail miserably, and 2 would win. Expect the same with extreme data science.
But those who win will do better than anyone equipped with dozens of Sherpas, several million dollars, and all the bells and whistles. Note that because (by design) the extreme data scientists work on small teams, they must be polyvalent — just like Reinhold Messner in his extreme mountain climbing expeditions, where he is a mountain climber, a medical doctor, a cameraman, a meteorologist, and so on, all at once.
The information discussed in this section is based on web traffic statistics and demographics for the most popular data science websites, based on Quantcast.com data, as well as on internal traffic statistics and other sources.
The United States leads in the number of visitors to data science websites. Typically, visitors from India are the second most frequent, yet are approximately four to five times less frequent than those from the United States. U.S. visitors tend to be in more senior and management roles, whereas visitors from India are more junior, entry-level analysts, with fewer executives. A few places that have experienced recent growth in data science professionals include Ireland, Singapore, and London.
Good sources of such data include:
Data science websites attract highly educated, wealthy males, predominantly with Asian origin, living mostly in the United States.
A few universities and other organizations have started to offer data science degrees, training, and certificates. The following sections discuss a sampling of these.
Following is a list of some of the U.S. universities offering data science and scientist programs. Included are descriptions of their programs as listed on their websites.
Data science overlaps multiple traditional disciplines at NYU such as mathematics (pure and applied), statistics, computer science, and an increasingly large number of application domains. It also stands to impact a wide range of spheres—from healthcare to business to government—in which NYU's schools and departments are engaged. (http://datascience.nyu.edu/)
With the Mining Massive Data Sets graduate certificate, you will master efficient, powerful techniques and algorithms for extracting information from large datasets such as the web, social-network graphs, and large document repositories. Take your career to the next level with skills that will give your company the power to gain a competitive advantage.
The Data Mining and Applications graduate certificate introduces many of the important new ideas in data mining and machine learning, explains them in a statistical framework and describes some of their applications to business, science, and technology. (http://scpd.stanford.edu/public/category/courseCategoryCertificateProfile.do?method=load&certificateId=10555807)
Many other institutions have strong analytics programs, including Carnegie Mellon, Harvard, MIT, Georgia Tech, and Wharton (Customer Analytics Initiative). For a map of academic data science programs, visit the following websites:
A typical academic program at universities such as the ones listed here features the following courses:
A few other types of organizations — both private companies and professional organizations — offer certifications or training. Here are a few of them:
Fees range from below $100 for a certification with no exam, to a few thousand dollars for a full program.
It is also possible to get data science training at many professional conferences that focus on analytics, big data, or data science, such as the following:
Vendors such as EMC, SAS, and Teradata also offer valuable training. Websites such as Kaggle.com enable you to participate in data science competitions and get access to real data, and sometimes the award winner is hired by a company such as Facebook or Netflix.
You can get some data science training without having to mortgage your house. Here are a couple of the top free training sites and programs.
Coursera.com offers online training at no cost. The instructors are respected university professors from around the globe, so the material can sometimes feel a bit academic. Here are a few of the offerings, at the time of this writing:
The data science apprenticeship offered at https://www.datasciencecentral.com is online, on demand, self-service, and free. The curriculum contains the following components:
The apprenticeship covers the following basic subjects:
Special tutorials are included focusing on various data science topics such as big data, statistics, visualization, machine learning, and computer science. Several sample data sets are also provided for use in connection with the tutorials.
Students complete several real-life projects using what they learn in the apprenticeship. Examples of projects that are completed during the apprenticeship include the following:
Sample source code is provided with the apprenticeship, along with links to external code such as Perl, Python, R, and C++, and occasionally XML, JavaScript, and other languages. Even “code” to perform logistic regression and predictive modeling with Excel is included, as is a system to develop automated news feeds.
This section discusses career paths for individuals interested in entrepreneurial initiatives. Resources for professionals interested in working for small or large employers and organizations are found in Chapter 8, “Data Science Resources.”
One career option for the data scientist is to become an independent consultant, either full-time or part-time to complement traditional income (salary) with consulting revenue. Consultants typically have years of experience and deep domain expertise, use software and programming languages, and have developed sales and presentation skills. They know how to provide value, how to measure lift, and how to plan a consulting project. The first step is to write a proposal. Consultants can spend as much as 50 percent of their time chasing clients and doing marketing, especially if they charge well above $100/hour.
Attending conferences, staying in touch with former bosses and colleagues, and building a strong network on LinkedIn (with recommendations and endorsements and a few thousand relevant connections) are critical and can take several months to build. To grow your online network, post relevant, noncommercial articles or contribute to forums (answer questions) on niche websites:
For instance, the LinkedIn group “Advanced Business Analytics, Data Mining and Predictive Modeling” has more than 100,000 members. It is the leading data science group, growing by 3,000 members per month, with many subgroups such as big data, visualization, operations research, analytic executives, training, and so on. (See www.linkedin.com/groups/Advanced-Business-Analytics-Data-Mining-35222.)
You can earn money and fame (and new clients) by participating in Kaggle contests (or in our own contests — for instance, see http://bit.ly/1arClAn), or being listed as a consultant on Kaggle. Its “best” consultants bill at $300 per hour (see http://bit.ly/12Qqqnm). However, you compete with thousands of other data scientists from all over the world, the projects tend to be rather small (a few thousand dollars each), and Kaggle takes a big commission on your earnings.
I started without a business plan (15 years later, I still don't have one), no lawyer, no office (still don't have one), no employees, with my wife taking care of our Internet connection and accounting. Now we have a bookkeeper (40 hours per month) and an accountant; they cost about $10,000 per year, but they help with invoicing and collecting money. They are well worth their wages, which represent less than 1 percent of our gross revenue.
You can indeed start with little expense. All entrepreneurs are risk takers, and a little bit of risk now means better rewards in the future, as the money saved on expensive services can help you finance your business and propel its growth. When we incorporated as an LLC in 2000, we used a service called MyCorporation.com (it still exists), which charged us only a few hundred dollars and took care of everything. To this date, we still have no lawyer, though we subscribed to some kind of legal insurance that costs $50/month. And our business model, by design, is not likely to create expensive lawsuits: we stay away from what could cause lawsuits. The same applies to training: you can learn a lot for free on the Internet, or from this book, or by attending free classes at Coursera.com.
One big expense is marketing and advertising. One way to minimize marketing costs is to create your own community and grow a large list of people interested in your services, to in essence “own” your market. Posting relevant discussions on LinkedIn groups and creating your own LinkedIn groups is a way to start growing your network. Indeed, when we started growing our community, we eventually became so large—the leading community for data scientists—that we just stopped there, and we now make money by accepting advertising and sending sponsored eBlasts and messages, blended with high-value, fresh noncommercial content, to our subscribers. This also shows that being a digital publisher is a viable option for data scientists interested in a nontraditional career path.
A final node about working with employees: You might work remotely with people in Romania or India, pay a fraction of the cost of an employee in Europe or the United States, not be subject to employee laws, and experience a win-win relationship. Indeed, as your consultancy grows but the number of hours you can spend any month is fixed, one way to scale is to take on more projects but outsource some work, maybe the more mundane tasks, abroad. Another way to grow is to automate many tasks such as report production, analyses, data cleaning, and exploratory analysis. If you can run a script that in one night, will work on five projects when you sleep, and produce the same amount of work as five consultants, now you have found a way to multiply your revenue by three to five times. You can then slash your fee structure by 50 percent and still make more money than your competitors.
Many websites provide information about salaries though, usually it applies to salaried employees rather than consultants. Job ads on LinkedIn typically show a salary range. Indeed.com and PayScale.com are also good resources. Glassdoor.com lists actual salaries as reported by employees. Data Science Central regularly publishes salary surveys, including for consultants, at http://bit.ly/19vA0n1.
Kaggle is known to charge $300/hour for data science projects, but typically the rate in the United States is between $60/hour and $250/hour based on project complexity and the level of expertise required. The field of statistical litigation tends to command higher rates, but you need to be a well-known data scientist to land a gig.
Here are a few hourly rates advertised by statistical consultants, published on statistics.com in 2013:
The starting point in any $3,000 project and above is a proposal, which outlines the scope of the project, the analyses to be performed, the data sources, deliverables and timelines, and fees. Here are two sample proposals.
This project was about click fraud detection. Take note of the following information:
The proposal included the following work items.
The data mining methodology will mostly rely on robust, efficient, simple, hidden decision tree methods that are easy to implement in Perl or Python.
This project was about web analytics, keyword bidding, sophisticated A/B testing, and keyword scoring. The proposal included the following components:
The bidding algorithm will use a hybrid formula, with strong weight on conversion rate for keywords with reliable conversion rates, and strong weight on score for keywords with no historical data. If the client already uses a bidding algorithm based on conversion rate, we will simply use the client's algorithm, substituting the conversion rate with an appropriate function of the score for keywords with little history. The score is used as a proxy for conversion rate.
If the client does not use a bidding algorithm, we will provide a parametric bidding algorithm depending on a parameter a (possibly multivariate), with a randomly chosen value for each keyword, to eventually detect the optimum a. For instance, a simplified version of the algorithm could be: Multiply bid by a if predicted ROI (for a given keyword) is higher than some threshold, a being a random number between 0.85 and 1.25. In this type of multivariate testing, the control case corresponds to a=1.
The predicted ROI is a simple function of the current bid, the score (which in turn is a predictor of the conversion rate, by design) and the revenue per conversion.
You need Customer Relationship Management (CRM) software that tracks leads (potential clients) and logs all your interactions with them (including contact information, typically more than one contact per company). Log entries look like this:
Another useful tool is FreshBooks.com, which is an online bookkeeping system that allows you to send invoices (and track if they were opened) and is great at measuring money flows per month. I used it as a financial CRM tool, and some of the reports from its dashboard are useful for budgeting purposes and to have a big picture about your finances.
Finally, processing credit cards and having a credit card authorization form can prove useful.
NOTE The section Managing Your Finances in this chapter is also applicable to the entrepreneur career path.
Entrepreneurship is more complex than managing a consultancy, as you want to scale, aim at higher revenues, and have an exit strategy. Raising venture capital (VC) money was popular in the late ‘90s but has become both more difficult and less necessary. The time you will spend chasing funding is substantial, like a six-month full-time job with no guarantee of success. We did it in 2006, raising $6 million to create a company that would score all Internet clicks, keywords, or impressions, in order to detect fake clicks and bad publishers, and also to help advertisers determine a bidding price for Google keywords with little or no history (the long tail). It lasted a few years and was fun. The one mistake we made was having a bunch of investors who, at one point, started to fight against each other.
VC money is less necessary now because you can launch a new company with little expense, especially if you purposely choose a business model that can easily be scaled and deployed using a lean approach. For instance, our Data Science Central platform relies on Ning.com. We don't have engineers, web developers, sys admins, and security gurus—all these functions are outsourced to Ning. We did not buy any servers: we are co-hosted on Ning for less than $200/month. The drawback is that you have more competitors as such barriers evaporate.
If you really need money for small projects associated with your business, crowdfunding could be a solution. We tried to raise money to pay a professor to write our training material for our data science apprenticeship but were not able to sign up on the crowd funding platforms. If we do raise money through crowd funding, we will actually do it entirely ourselves without using brokers, by advertising our offer in our mailing list.
Another issue with raising money is that you will have to share your original idea with investors and their agents (analysts doing due diligence on your intellectual property—IP). Your intellectual property might end up being stolen in the process. One way to make your IP theft-proof is to make it highly public so that it is not patentable. I call this an open source patent. If you do so (and it definitely fits with the idea of having a lean business with little cost), you need to make sure to control your market; otherwise, competitors with better funding will market your same product better than you can do it. If you own your market—that is, you own the best, largest opt-in mailing list of potential clients—anyone who wants to advertise the product in question must do it through you. It is then easy for you to turn down competitors. Owning your market is also a big component of a lean business because it reduces your marketing costs to almost zero.
Finally, you can use cheap services such as MyCorporation.com to incorporate (also a great lean business principle) and not have a business plan (another lean business idea), as long as you are highly focused yet with a clear vision, and don't get distracted by all sorts of new ideas and side projects. If you are a creator, associate yourself with a pragmatic partner. And think bottom line every day.
In my case, as a data scientist, I generate leads for marketers. A good-quality lead is worth $40. The cost associated with producing a lead is $10. It requires data science to efficiently generate a large number of highly relevant leads (by detecting and purchasing the right traffic, optimizing organic growth, and using other analytic techniques). If I can't generate at least 10,000 leads a year, nobody will buy due to low volume. If my leads don't convert to actual revenue and produce ROI for the client, nobody will buy.
Also, because of data science, I can sell leads for a lower price than competitors—much less than $40. For instance, our newsletter open rate went from 8 percent to 24 percent, significantly boosting revenue and lowering costs. We also reduced churn to a point where we actually grow, all this because of data science. Among the techniques used to achieve these results are:
For those interested in becoming a data science entrepreneur, here are a few ideas. Hopefully they will also spark other ideas for you.
Create a platform allowing users to enter R commands on Chrome or Firefox in a browser-embedded console. Wondering how easy it would be to run R from a browser on your iPad? I'm not sure how you would import data files, but I suppose R offers the possibility to open a file located on a web or FTP server, rather than a local file stored on your desktop. Or does it? Also, it would be cool to have Python in a browser.
See the example at http://www.datasciencecentral.com/profiles/blogs/r-in-your-browser.
This could be a great opportunity for mathematicians and data scientists: creating a startup that offers encrypted e-mail that no government or entity could ever decrypt, and offering safe solutions to corporations who don't want their secrets stolen by competitors, criminals, or the government.
Here's an example of an e-mail platform:
TECHNICAL NOTE
The encryption algorithm which adds semi-random text to your message prior to encryption, has an encrypted timestamp, and won't work if no semi-random text is added first, It is such that (i) the message cannot be decrypted after 48 hours (if the encrypted version is intercepted) because a self-destruction mechanism is embedded into the encrypted message and the executable file itself, and (ii) if you encrypt the same message twice (even an empty message or one consisting of just one character), the two encrypted versions will be very different (of random length and at least 1 KB in size) to make reverse engineering next to impossible.
Indeed, the government could create such an app and disguise it as a private enterprise: it would in this case be a honeypot app. Some people worry that the government is tracking everyone and that you could get in trouble (your Internet connection shut down and bank account frozen) because you posted stuff that the government algorithms deem extremely dangerous, maybe a comment about pressure cookers. At the same time, I believe the threat is somewhat exaggerated.
Anyone interested in building this encryption app? Note that no system is perfectly safe. If there's an invisible camera behind you, filming everything you do on your computer, then my system offers no protection for you—though it would still be safe for the recipient, unless he also has a camera tracking all his computer activity. But the link between you and the recipient (the fact that both of you are connected) would be invisible to any third party. And increased security can be achieved if you use the web app from an anonymous computer—maybe from a public computer in some hotel lobby.
Averaging numbers is easy, but what about averaging text? I ask the question because the Oxford dictionary recently added a new term: big data.
So how do you define big data? Do you pick up a few experts (how do you select them?) and ask them to create the definition? What about asking thousands of practitioners (crowdsourcing) and somehow average their responses? How would you proceed to automatically average their opinions, possibly after clustering their responses into a small number of groups? It looks like this would require
On a different note, have you ever heard of software that automatically summarizes text? This would be a fantastic tool!
What if the traditional login/password to your favorite websites was replaced by a web app asking you to talk for one second, to get your picture taken from your laptop camera, or to check your fingerprint on your laptop touch screen? Your one-second voice message or image of your fingerprint could then be encoded using a combination of several metrics and stored in a database for further login validation.
The idea is to replace the traditional login/password (not a secure way to connect to a website) with more secure technology, which would apply everywhere on any computer, laptop, cell phone, or device where this ID checking app would be installed.
Also, the old-fashioned login/password system could still coexist as an option. The aim of the new system would be to allow anyone to log on with just one click, from any location/device on any website, without having to remember a bunch of (often) weak passwords. This methodology requires large bandwidth and considerable computer resources.
It should handle ten questions, each with multiple choices, and display the results in real time on a world map, using colors. Geolocation of each respondent is detected in real time based on an IP address. The color displayed (for each respondent) would represent the answer to a question. For instance, male=red, female=yellow, and other=green. I used to work with Vizu; the poll service was available as a web widget, offering all these features for free, but it's now gone. It would be great if this type of poll/survey tool could handle 2,000 actual responses, and if you could zoom in on the map.
Here's a new idea for Google to make money and cover the costs of processing or filtering billions of messages per day. This is a solution to eliminate spam as well, without as many false positives as currently. The solution would be for Google to create its own newsletter management system. Or at least, Google would work with major providers (VerticalResponse, Constant Contact, iContact, MailChimp, and so on) to allow their clients (the companies sending billions of messages each day, such as LinkedIn) to pay a fee based on volume. The fee would help the sender to not end up in a Gmail spam box, as long as it complies with Google policies. Even better: let Google offer this newsletter management service directly to clients who want to reach Gmail more effectively, under Google's controls and conditions.
I believe Google is now in a position to offer this service because more than 50 percent of new personal e-mail accounts currently created are Gmail, and they last much longer than any corporate e-mail accounts. (You don't lose your Gmail account when you lose your job.) Google could reasonably charge $100 per 20,000 messages sent to Gmail accounts: the potential revenue is huge.
If Google would offer this service internally (rather than through a third party such as Constant Contact), it would make more money and have more control, and the task of eliminating spam would be easier and less costly.
Currently, because Google offers none of these services, you will have to face the following issues:
Newsletters are delivered too quickly: 100,000 messages are typically delivered in five minutes by newsletter management companies. If Gmail were delivering these newsletters via its own system (say, Gmail Contact), then it could deliver more slowly, and thus do a better job at controlling spam without creating tons of false positives.
In the meanwhile, a solution for companies regularly sending newsletters to a large number of subscribers is to do the following:
Problem:
The cost of healthcare in the United States is extremely high and volatile. Even standard procedures have high price volatility. If you ask how much your hospital will charge for what is typically a $5,000 or more procedure by explicitly asking how much it charged for the last five patients, it will have no answer (as if it doesn't keep financial records).
Would you buy a car if the car dealer has no idea how much it will charge you, until two months after the purchase? If you did, and then you hear your neighbor got exactly the same car from a different dealer for one-half the price, what would you do? Any other type of business would quickly go bankrupt if using such business practices.
Causes:
An Internet start-up offering prices for 20 top medical procedures, across 5,000 hospitals and 20 patient segments. Prices would come from patients filling a web form or sharing their invoice details, and partners such as cost-efficient hospitals willing to attract new patients. Statistical inferences would be made to estimate all these 20×20×500=200,000 prices (and their volatility) every month based on maybe as little as 8,000 data points. The statistical engine would be similar to Zillow (estimating the value of all houses based on a limited volume of sales data), but it would also have a patient review component (like Yelp), together with fake review detection.
Revenue would be coming from hospital partners, from selling data to agencies, from advertising to pharmaceutical companies, and from membership fees (premium members having access to more granular data).
Checks sent by e-mail, rather than snail mail—the idea works as follows:
Note that this start-up could charge the payee 20 cents per check (it's less than 50 percent of the cost of a stamp) and have a higher successful delivery rate than snail mail, which is easy to beat.
The analytic part of this business model is in the security component to prevent fraud from happening. To a lesser extent, there are also analytics involved to guarantee that e-mail is delivered to the payee at least 99.99 percent of the time. Information must travel safely through the Internet, and in particular, the e-mail sent to payees must contain encrypted information in the URL used to print the check.
PayPal already offers this type of functionality (online payments), but not the check printing feature. For some people or some businesses, a physical check is still necessary. I believe checks will eventually disappear, so I'm not sure if such a start-up would have a bright future.
Several digital currencies already exist and are currently being used (think about PayPal and Bitcoin), but as far as I know, they are not anonymous.
An anonymous digital currency would be available as encrypted numbers that can be used only once; it would not be tied to any paper currency or e-mail address. Initially, it could be used between partners interested in bartering. (For example, I purchase advertising on your website in exchange for you helping me with web design—no U.S. dollars exchanged). The big advantages are:
What are the challenges about designing this type of currency? The biggest ones are:
Based on tweets and blog posts, identify new scams before they become widespread.
Is such a public system already in place? Sure, the FBI must have one, but it's not shared with the public. Also, such a scam alert system is quite similar to systems based on crowd sourcing to detect diseases spreading around the world. The spreading mechanism in both cases is similar: scam/disease agents use viruses to spread and contaminate. In the case of scams, computer viruses infect computers using Botnets and turn them into spam machines (zombies) to send scam e-mail to millions of recipients.
Will analytics help amusement parks survive the 21st century? My recent experience at Disneyland in California makes me think that a lot of simple things can be done to improve revenue and user experience. Here are a few starting points:
When you book five nights in a hotel, usually each night (for the same room) will have a different price. The price is based on a number of factors, including these:
I am wondering which other metrics are included in the pricing models. For instance, do they include weather forecasts, special events, price elasticity? Is the price user-specific (for example, based on IP addresses) or based on how the purchase is made (online, over the phone, using an American Express card versus a Visa card), how long in advance the purchase is made, and so on?
For instance, a purchase performed long in advance results in lower prices, but it is because inventory is high. So you might be fine by just using inventory available.
Finally, how often are these pricing models updated, and how is model performance assessed? Is the price changing in real time?
What are your chances of being audited? And how can data science help answer this question, for each of us individually?
Some factors increase the chance of an audit, including
Companies such as Deloitte and KPMG probably compute tax audit risks quite accurately (including the penalties in case of an audit) for their large clients because they have access to large internal databases of audited and non-audited clients.
But for you and me, how could data science help predict our risk of an audit? Is there data publicly available to build a predictive model? A good start-up idea would be to create a website to help you compute your risk of tax audit. Its risk scoring engine would be based on the following three pieces:
Indeed, this would amount to reverse-engineering the IRS algorithms.
Which metrics should be used to assess tax audit risk? And how accurate can the prediction be? Could such a system attain 99.5 percent accuracy — that is, wrong predictions for only 0.5 percent of taxpayers? Right now, if you tell someone “You won't be audited this year,” you are correct 98 percent of the time. More interesting, what level of accuracy can be achieved for higher-risk taxpayers?
Finally, if your model predicts both the risk of audit and the penalty if audited, then you can make smarter decisions and decide which risks you can take and what to avoid. This is pure decision science to recoup dollars from the IRS, not via exploiting tax law loopholes (dangerous), but via honest, fair, and smart analytics—outsmarting the IRS data science algorithms rather than outsmarting tax laws.
This chapter explored how to become a data scientist, from university degrees currently available, certificate training programs, and the online apprenticeship provided by Data Science Central.
Different types of data science career paths were also discussed, including entrepreneur, consultant, individual contributor, and leader. A few ideas for data science start-ups were also provided, in case you want to become an entrepreneur.
The next chapter describes selected data science techniques in detail without introducing unnecessary mathematics. It includes information on visualization, statistical techniques, and metrics relevant to big data and new types of data.