CHAPTER
3

Becoming a Data Scientist

Because big data and data science are here to stay, this chapter explores key features of data scientists, types of data scientists, and how to become a data scientist, including training programs available and the different types of data scientist career paths.

Key Features of Data Scientists

There are a few key features of data scientists you may have already noticed. These key features are discussed in this section, along with the type of expertise they should have or acquire, and why horizontal knowledge is important. Finally, statistics are presented on the demographics of data scientists.

Data Scientist Roles

Data scientists are not statisticians, nor data analysts, nor computer scientists, nor software engineers, nor business analysts. They have some knowledge in each of these areas but also some outside of these areas.

One of the reasons why the gap between statisticians and data scientists grew over the last 15 years is that academic statisticians, who publish theoretical articles (sometimes not based on data analysis) and train statisticians, are… not statisticians anymore. Also, many statisticians think that data science is about analyzing data, but it is more than that. It also involves implementing algorithms that process data automatically to provide automated predictions and actions, for example:

  • Automated bidding systems
  • Estimating (in real time) the value of all houses in the United States (Zillow.com)
  • High-frequency trading
  • Matching a Google ad with a user and a web page to maximize chances of conversion
  • Returning highly relevant results from any Google search
  • Book and friend recommendations on Amazon.com or Facebook
  • Tax fraud detection and detection of terrorism
  • Scoring all credit card transactions (fraud detection)
  • Computational chemistry to simulate new molecules for cancer treatment
  • Early detection of an epidemic
  • Analyzing NASA pictures to find new planets or asteroids
  • Weather forecasts
  • Automated piloting (planes and cars)
  • Client-customized pricing system (in real time) for all hotel rooms

All this involves both statistical science and terabytes of data. People doing this stuff do not call themselves statisticians, in general. They call themselves data scientists. Over time, as statisticians catch up with big data and modern applications, the gap between data science and statistics will shrink.

Also, data scientists are not statisticians because statisticians know a lot of things that are not needed in the data scientist's job: generalized inverse matrixes, eigenvalues, stochastic differential equations, and so on. When data scientists indeed use techniques that rely on this type of mathematics, it is usually via high-level tools (software), where these mathematics are used as black boxes — just like a software engineer who does not code in assembly language anymore. Conversely, data scientists need new statistical knowledge, typically data-driven, model-free, robust techniques that apply to modern, large, fast-flowing, sometimes unstructured data. This also includes structuring unstructured data and becoming an expert in taxonomies, NLP (natural language processing, also known as text mining) and tag management systems.

Data scientists are not computer scientists either — first, because they don't need the entire theoretical knowledge computer scientists have, and second, because they need to have a much better understanding of random processes, experimental design, and sampling, typically areas in which statisticians are expert. Yet data scientists need to be familiar with computational complexity, algorithm design, distributed architecture, and programming (R, SQL, NoSQL, Python, Java, and C++). Independent data scientists can use Perl instead of Python, though Python has become the standard scripting language. Data scientists developing production code and working in teams need to be familiar with software development life cycle and lean architectures.

Data scientists also need to be domain experts in one or two applied domains (for instance, the author of this book is an expert in both digital analytics and fraud detection), have success stories to share (with metrics used to quantify success), have strong business acumen, and be able to assess the ROI that data science solutions bring to their clients or their boss. Many of these skills can be acquired in a short time period if you already have several years of industry experience and training in a lateral domain (operations research, applied statistics working on large data sets, computer science, engineering, or an MBA with a strong analytic component).

Thus, data scientists also need to be good communicators to understand, and many times guess, what problems their client, boss, or executive management is trying to solve. Translating high-level English into simple, efficient, scalable, replicable, robust, flexible, platform-independent solutions is critical.

Finally, the term analyst is used for a more junior role, for people who typically analyze data but do not create systems or architectures to automatically analyze, process data, and perform automated actions based on automatically detected patterns or insights.

To summarize, data science = Some (computer science) + Some (statistical science) + Some (business management) + Some (software engineering) + Domain expertise + New (statistical science), where

  • Some () means the entire field is not part of data science.
  • New () means new stuff from the field in question is needed.

Horizontal Versus Vertical Data Scientist

Vertical data scientists have deep technical knowledge in some narrow field. For instance, they might be any of the following:

  • Computer scientists familiar with computational complexity of all sorting algorithms
  • Statisticians who know everything about eigenvalues, singular value decomposition and its numerical stability, and asymptotic convergence of maximum pseudo-likelihood estimators
  • Software engineers with years of experience writing Python code (including graphic libraries) applied to API development and web crawling technology
  • Database specialists with strong data modeling, data warehousing, graph database, Hadoop, and NoSQL expertise
  • Predictive modelers with expertise in Bayesian networks, SAS, and SVM

The key here is that by “vertical data scientist” I mean those with a more narrow range of technical skills, such as expertise in all sorts of Lasso-related regressions but with limited knowledge of time series, much less of any computer science. However, deep domain expertise is absolutely necessary to succeed as a data scientist. In some fields, such as stock market arbitrage, online advertising arbitrage, or clinical trials, a lack of understanding of very complex business models, ecosystems, cycles and bubbles, or complex regulations is a guarantee of failure. In fraud detection, you need to keep up with what new tricks criminals throw at you (see the section on fraud detection in Chapter 6 for details).

Vertical data scientists are the by-product of a university system that trains a person to become a computer scientist, a statistician, an operations researcher, or an MBA — but not all the four at the same time.

In contrast, horizontal data scientists are a blend of business analysts, statisticians, computer scientists, and domain experts. They combine vision with technical knowledge. They might not be experts in eigenvalues, generalized linear models, and other semi-obsolete statistical techniques, but they know about more modern, data-driven techniques applicable to unstructured, streaming, and big data, such as the simple and applied first Analyticbridge theorem to build confidence intervals. They can design robust, efficient, simple, replicable, and scalable code and algorithms.

So by “horizontal data scientist,” I mean that you need cross-discipline knowledge, including some of computer science, statistics, databases, machine learning, Python, and, of course, domain expertise. This is technical knowledge that a typical statistician usually lacks.

Horizontal data scientists also have the following characteristics:

  • They have some familiarity with Six Sigma concepts (80/20 rule in particular), even if they don't know the terminology. In essence, speed is more important than perfection for these analytic practitioners.
  • They have experience in producing success stories out of large, complicated, messy data sets, including measuring the success.
  • They have experience in identifying the real problem to be solved, the data sets (external and internal) needed, the database structures needed, and the metrics needed, rather than being passive consumers of data sets produced or gathered by third parties lacking the skills to collect or create the right data.
  • They know rules of thumb and pitfalls to avoid, more than theoretical concepts. However, they have a bit more than just basic knowledge of computational complexity, good sampling and design of experiment, robust statistics and cross-validation, modern database design, and programming languages (R, scripting languages, MapReduce concepts, and SQL).
  • They have advanced Excel and visualization skills.
  • They can help produce useful dashboards (the ones that people use on a daily basis to make decisions) or alternative tools to communicate insights found in data (orally, by e-mail, or automatically and sometimes in real-time machine-to-machine mode).
  • They think outside the box. For instance, when they create a recommendation engine, they know that it will be gamed by spammers and competing users; thus they put an efficient mechanism in place to detect fake reviews.
  • They are innovators who create truly useful stuff. Ironically, this can scare away potential employers, who, despite claims to the contrary and for obvious reasons, prefer the good soldier to the disruptive creator.

Another part of the data scientist's job is to participate in the database design and data gathering process, and to identify metrics and external data sources useful to maximize value discovery. This includes the ability to distinguish between the deeply buried signal and huge amounts of noise present in any typical big data set. For more details, read the section The Curse of Big Data in Chapter 2 which describes correlogram matching versus correlation comparisons to detect true signal when comparing millions of time series. Also see the section on predictive power in Chapter 6, which discusses a new type of entropy used to identify features that capture signal over those that are “signal-blind.”

Data scientists also need to be experts in robust cross-validation, confidence intervals, and variance reduction techniques to improve accuracy or at least measure and quantify uncertainty. Even numerical analysis can be useful, especially for those involved in data science research. This and the importance of business expertise become clearer in the section on stock trading Chapter 6.

Finally, there are three main types of analytics that data scientists deal with: descriptive, predictive, and prescriptive. Each type brings business value to the organization in a different way. I would also add “automated analytics” and “root cause analyses” to this list, although they overlap with predictive and prescriptive analytics in some ways.

The broad yet deep knowledge required of a true horizontal data scientist is one of the reasons why recruiters can't find them (so they find and recruit mostly vertical data scientists). Companies are not yet familiar with identifying horizontal data scientists—the true money makers and ROI generators among analytics professionals. The reasons are twofold:

  • Untrained recruiters quickly notice that horizontal data scientists lack some of the traditional knowledge that a true computer scientist, statistician, or MBA must have—eliminating horizontal data scientists from the pool of applicants. You need a recruiter familiar with software engineers, business analysts, statisticians, and computer scientists, who can identify qualities not summarized by typical résumé keywords and identify which skills are critical and which ones that can be overlooked to detect these pure gems.
  • Horizontal data scientists, faced with the prospects of a few job opportunities, and having the real knowledge to generate significant ROI, end up creating their own startups, working independently, and sometimes competing directly against the companies that are in need of real (supposedly rare) data scientists. After having failed more than once to get a job interview with Microsoft, eBay, Amazon.com, or Google, they never apply again, further reducing the pool of qualified talent.

EXERCISE

Can you name a few horizontal data scientists? D.J. Patil, previously a chief data scientist at LinkedIn, is an example of a horizontal data scientist. See if you can research and find a few others.

Types of Data Scientists

There are several different types of data scientists, including fake, self-made, amateur, and extreme. The following sections discuss each of these types of data scientists and include examples of amateur and extreme data science.

Fake Data Scientist

Fake data science and what a fake data scientist is has been previously discussed—a professional with narrow expertise, possibly because she is at the beginning of her career and should not be called a data scientist. In the previous section, vertical real scientists were considered not to be real data scientists. Likewise, a statistician with R and SQL programming skills, who never processed a 10-million-row data set, is a statistician, not a data scientist. A few books have been written with titles that emphasizes data science, while being nothing more than re-labeled statistic books.

Self-Made Data Scientist

A self-made data scientist is someone without the formal training (a degree in data science) but with all the knowledge and expertise to qualify as a data scientist. Currently, most data scientists are self-made because few programs exist that produce real data scientists. However, over time they will become a minority and even disappear. The author of this book is a self-made data scientist, formally educated in computational statistics at the PhD level. Just like business executives existed long before MBA programs were created, data scientists must exist before official programs are created. Some of them are the professionals who are:

  • Putting together the foundations of data science
  • Helping businesses define their data science needs and strategies
  • Helping universities create and tailor programs
  • Launching startup to solve data science problems
  • Developing software and services (Data Science or Analytics as a Service)
  • Creating independent training, certifications, or an apprenticeship in data science (including privately funded data science research)

Amateur Data Scientist

Another kind of data scientist worth mentioning is the amateur data scientist. The amateur data scientist is the equivalent of the amateur astronomer: well equipped, she has the right tools (such as R, a copy of SAS, or RapidMiner), has access to large data sets, is educated, and has real expertise. She earns money by participating in data science competitions, such as those published on Kaggle, CrowdANALYTIX, and Data Science Central, or by doing freelancing.

Example of Amateur Data Science

With so much data available for free everywhere, and so many open tools, a new kind of analytics practitioner is emerging: the amateur data scientist. Just like the amateur astronomer, the amateur data scientist will significantly contribute to the art and science, and will eventually solve mysteries. Could someone like the Boston Marathon bomber be found because of thousands of amateurs analyzing publicly available data (images, videos, tweets, and so on) with open source tools? After all, amateur astronomers have been detecting exo-planets and much more.

Also, just like the amateur astronomer needs only one expensive tool (a good telescope with data recording capabilities), the amateur data scientist needs only one expensive tool (a good laptop and possibly subscription, to some cloud storage/computing services).

Amateur data scientists might earn money from winning Kaggle contests, working on problems such as identifying a Botnet, explaining the stock market flash crash, defeating Google page ranking algorithms, helping find new complex molecules to fight cancer (analytical chemistry), and predicting solar flares and their intensity. Interested in becoming an amateur data scientist? The next section gives you a first project to get started.

Example of an Amateur Data Science Project

Question: Do large meteors cause multiple small craters or a big one? If meteors usually break up into multiple fragments, or approach the solar system already broken down into several pieces, they might be less dangerous than if they hit with a single, huge punch. That's the idea; although I'm not sure if this assumption is correct. Even if the opposite is true, it is still worth asking the question about frequency of binary impact craters.

Eventually, knowing that meteorites arrive in pieces rather than intact could change government policies and priorities, and maybe allow us to stop spending money on projects to detect and blow up meteors (or the other way around).

So how would you go about estimating the chance that a large meteor (hitting Earth) creates multiple small impacts, and how many impacts on average? An idea consists in looking at moon craters and checking how many of them are aligned. Yet what causes meteors to explode before hitting (and thus create multiple craters) is Earth's thick atmosphere. Thus the moon would not provide good data. Yet Earth's crust is so geologically active that all crater traces disappear after a few million years. Maybe Venus would be a good source of data? No, even worse than Earth. Maybe Mars? No, it's just like the moon. Maybe some moons from Jupiter or Saturn would be great candidates.

After a data source is identified and the questions answered, deeper questions can be asked, such as, When you see a binary crater (two craters, same meteor), what is the average distance between the two craters? This can also help better assess population risks and how many billion dollars NASA should spend on meteor tracking programs.

In any case, as a starter, I did a bit of research and found data at http://www.stecf.org/∼ralbrech/amico/intabs/koeberlc.html, along with a map showing impact craters on Earth.

Visually, with the naked eye, it looks like multiple impacts (for example, binary craters) and crater alignment is the norm, not the exception. But the brain can be lousy at detecting probabilities. So a statistical analysis is needed. Note that the first step consists of processing the image to detect craters and extract coordinates, using some software or writing your own code. But this is still something a good amateur data scientist could do; you can find the right tools on the Internet.

Extreme Data Scientist

Another interesting type of data scientist is the extreme data scientist: Just like extreme mountain climbing or extreme programming, extreme data science means developing powerful, robust predictive solutions without any statistical models.

Example of Extreme Data Science

Extreme data science is a way of doing data science to deliver high value quickly. It can best be compared to extreme mountain climbing, where individuals manage to climb challenging peaks in uncharted territory, in a minimum amount of time, with a small team and small budget, sometimes successfully using controversial climbing routes or techniques. For instance, Reinhold Messner climbing Mount Everest in winter, alone, in three days, with no oxygen and no Sherpa — yes, he really did it!

Extreme data science borrows from lean six sigma, agile development and analytics, extreme programming, and in many cases advanced statistical analyses on big data performed without statistical models, yet providing superior ROI. (Part of it is due to saving money that would otherwise have been spent on a large team of expensive statisticians.)

Just like extreme mountain climbing, it is not for the faint-hearted: few statisticians can design sound predictive techniques and confidence intervals without mathematical modeling and heavy artillery such as general linear models.

The only data scientists who can succeed with extreme data science are those with good intuition, good judgment, and vision. They combine deep analytics and advanced statistical knowledge with consistently correct gut feelings and the ability to quickly drill down to the essence of any problem. They can offer a better solution to business problems, in one month of work, than a team of “regular,” model-oriented data scientists working six months on the same project. Better also means, simpler, more scalable, and more robust. These data scientists tend to be polyvalent. Sometimes their success comes from extensive automation of semi-mundane tasks.

One of the problems today is that many so-called data scientists do extreme data science without knowing it. But they don't have the ability to successfully do it, and the results are miserable as they introduce biases and errors at every step when designing and implementing algorithms. Job interview questions don't test the ability to work as an extreme data scientist. Result? If you had 50 respected professional mountain climbers, climbing Everest in winter with no Sherpa, no oxygen, and little money, 48 would fail (not reach the summit). However, 25 would achieve a few milestones (5 almost succeeding), 5 would die, 10 would refuse to do it, 8 would fail miserably, and 2 would win. Expect the same with extreme data science.

But those who win will do better than anyone equipped with dozens of Sherpas, several million dollars, and all the bells and whistles. Note that because (by design) the extreme data scientists work on small teams, they must be polyvalent — just like Reinhold Messner in his extreme mountain climbing expeditions, where he is a mountain climber, a medical doctor, a cameraman, a meteorologist, and so on, all at once.

Data Scientist Demographics

The information discussed in this section is based on web traffic statistics and demographics for the most popular data science websites, based on Quantcast.com data, as well as on internal traffic statistics and other sources.

The United States leads in the number of visitors to data science websites. Typically, visitors from India are the second most frequent, yet are approximately four to five times less frequent than those from the United States. U.S. visitors tend to be in more senior and management roles, whereas visitors from India are more junior, entry-level analysts, with fewer executives. A few places that have experienced recent growth in data science professionals include Ireland, Singapore, and London.

Good sources of such data include:

  • Quantcast.com. Data about gender, income, education level, and race is produced using demographic information gathered by ZIP codes. ZIP code data is derived from IP addresses.
  • User information from ISPs (Internet service providers). Several big ISPs sell information about user accounts, broken down by IP address, to Quantcast. Then the Quantcast data science engine performs statistical inferences based on this sample ISP data.
  • Surveys or other sources — for example, when you install some toolbar on your browser, allowing your web browsing activity to be monitored to produce summary statistics. (This methodology is subject to big biases, as people installing toolbars are different from those who don't.) This was Alexa.com's favorite way to produce the traffic statistics a while back.

Data science websites attract highly educated, wealthy males, predominantly with Asian origin, living mostly in the United States.

Training for Data Science

A few universities and other organizations have started to offer data science degrees, training, and certificates. The following sections discuss a sampling of these.

University Programs

Following is a list of some of the U.S. universities offering data science and scientist programs. Included are descriptions of their programs as listed on their websites.

  • University of Washington (Seattle, WA): Develop the computer science, mathematics, and analytical skills needed to enter the field of data science. Use data science techniques to analyze and extract meaning from extremely large data sets, or big data. Become familiar with relational and nonrelational databases. Apply statistics, machine learning, text retrieval, and natural language processing to analyze data and interpret results. Learn to apply data science in fields such as marketing, business intelligence, scientific research, and more. (http://www.pce.uw.edu/certificates/data-science.html)
  • Northwestern University (Evanston, IL): As businesses seek to maximize the value of vast new stores of available data, Northwestern University's Master of Science degree in predictive analytics program prepares students to meet the growing demand in virtually every industry for data-driven leadership and problem solving. Advanced data analysis, predictive modeling, computer-based data mining, and marketing, web, text, and risk analytics are just some of the areas of study offered in the program. (http://www.scs.northwestern.edu/info/predictive-analytics.php)
  • UC Berkeley (Berkeley, CA): The University of California at Berkeley School of Information offers the only professional Master of Information and Data Science (MIDS) degree delivered fully online. This exciting program offers the following:
    • A multidisciplinary curriculum that prepares you to solve real-world problems using complex and unstructured data.
    • A web-based learning environment that blends live, face-to-face classes with interactive, online course work designed and delivered by UC Berkeley faculty.
    • A project-based approach to learning that integrates the latest tools and methods for identifying patterns, gaining insights from data, and effectively communicating those findings.
    • Full access to the I School community including personalized technical and academic support.
    • The chance to build connections in the Bay Area—the heart of the data science revolution—and through the UC Berkeley alumni network. (http://requestinfo.datascience.berkeley.edu/index.html)
  • CUNY (New York, NY): The Online Master's degree in data analytics (M.S.) prepares graduates to make sense of real-world phenomena and everyday activities by synthesizing and mining big data with the intention of uncovering patterns, relationships, and trends. Big data has emerged as the driving force behind critical business decisions. Advances in the ability to collect, store, and process different kinds of data from traditionally unconnected sources enables you to answer complex, data-driven questions in ways that have never been possible before. (http://sps.cuny.edu/programs/ms_dataanalytics)
  • New York University (New York, NY): Its initiative is university-wide because data science has already started to revolutionize most areas of intellectual endeavor found at NYU and in the world. This revolution is just beginning. Data science is becoming a necessary tool to answer some of the big scientific questions and technological challenges of our times: How does the brain work? How can we build intelligent machines? How do we better find cures for diseases?

    Data science overlaps multiple traditional disciplines at NYU such as mathematics (pure and applied), statistics, computer science, and an increasingly large number of application domains. It also stands to impact a wide range of spheres—from healthcare to business to government—in which NYU's schools and departments are engaged. (http://datascience.nyu.edu/)

  • Columbia University (New York, NY): The Institute for Data Sciences and Engineering at Columbia University strives to be the single world-leading institution in research and education in the theory and practice of the emerging field of data science broadly defined. Equally important in this mission is supporting and encouraging entrepreneurial ventures emerging from the research the Institute conducts. To accomplish this goal, the Institute seeks to forge closer relationships between faculty already at the University, to hire new faculty, to attract interdisciplinary graduate students interested in problems relating to big data, and to build strong and mutually beneficial relationships with industry partners. The Institute seeks to attract external funding from both federal and industrial sources to support its research and educational mission. (http://idse.columbia.edu/)
  • Stanford University (Palo Alto, CA): With the rise of user-web interaction and networking, as well as technological advances in processing power and storage capability, the demand for effective and sophisticated knowledge discovery techniques has grown exponentially. Businesses need to transform large quantities of information into intelligence that can be used to make smart business decisions.

    With the Mining Massive Data Sets graduate certificate, you will master efficient, powerful techniques and algorithms for extracting information from large datasets such as the web, social-network graphs, and large document repositories. Take your career to the next level with skills that will give your company the power to gain a competitive advantage.

    The Data Mining and Applications graduate certificate introduces many of the important new ideas in data mining and machine learning, explains them in a statistical framework and describes some of their applications to business, science, and technology. (http://scpd.stanford.edu/public/category/courseCategoryCertificateProfile.do?method=load&certificateId=10555807)

    (http://scpd.stanford.edu/public/category/courseCategoryCertificateProfile.do?method=load&certificateId=1209602)

  • North Carolina State University (Raleigh, NC): If you have a mind for mathematics and statistical programming, and a passion for working with data to solve challenging problems, this is your program. The MSA is uniquely designed to equip individuals like you for the task of deriving and effectively communicating actionable insights from a vast quantity and variety of data—in 10 months. (http://analytics.ncsu.edu/?page_id=1799)

Many other institutions have strong analytics programs, including Carnegie Mellon, Harvard, MIT, Georgia Tech, and Wharton (Customer Analytics Initiative). For a map of academic data science programs, visit the following websites:

A typical academic program at universities such as the ones listed here features the following courses:

  • Introduction to Data (data types, data movement, terminology, and so on)
  • Storage and Concurrency Preliminaries
  • Files and File-Based Data Systems
  • Relational Database Management Systems
  • Hadoop Introduction
  • NoSQL — MapReduce Versus Parallel RDBMS
  • Search and Text Analysis
  • Entity Resolution
  • Inferential Statistics
  • Gaussian Distributions, Other Distributions, and The Central Limit Theorem
  • Testing and Experimental Design
  • Bayesian Versus Classical Statistics
  • Probabilistic Interpretation of Linear Regression and Maximum Likelihood
  • Graph Algorithms
  • Raw Data to Inference Model
  • Motivation and Applications of Machine Learning
  • Supervised Learning
  • Linear and Non-Linear Learning Models
  • Classification, Clustering, and Dimensionality Reduction
  • Advanced Non-Linear Models
  • Collaborative Filtering and Recommendation
  • Models That are Robust
  • Data Sciences with Text and Language
  • Data Sciences with Location
  • Social Network Analysis

Corporate and Association Training Programs

A few other types of organizations — both private companies and professional organizations — offer certifications or training. Here are a few of them:

  • INFORMS (Operations Research Society): Analytics certificate
  • Digital Analytics Association: certificate
  • TDWI (The Data Warehousing Institute): Courses; focus is on database architecture
  • American Statistical Association: Chartered statistician certificate
  • Data Science Central: Data science apprenticeship
  • International Institute for Analytics: More like a think tank. Founded by the famous Tom Davenport (visiting Harvard Professor and one of the fathers of data science).
  • Statistics.com: Statistics courses

Fees range from below $100 for a certification with no exam, to a few thousand dollars for a full program.

It is also possible to get data science training at many professional conferences that focus on analytics, big data, or data science, such as the following:

  • Predictive Analytics World
  • GoPivotal Data Science
  • SAS data mining and analytics
  • ACM — Association for Computing Machinery
  • IEEE — Institute of Electrical and Electronics Engineers analytics/big data/data science
  • IE Group — Innovation Enterprise Group
  • Text Analytics News
  • IQPC — International Quality and Productivity Center
  • Whitehall Media

Vendors such as EMC, SAS, and Teradata also offer valuable training. Websites such as Kaggle.com enable you to participate in data science competitions and get access to real data, and sometimes the award winner is hired by a company such as Facebook or Netflix.

Free Training Programs

You can get some data science training without having to mortgage your house. Here are a couple of the top free training sites and programs.

Coursera.com

Coursera.com offers online training at no cost. The instructors are respected university professors from around the globe, so the material can sometimes feel a bit academic. Here are a few of the offerings, at the time of this writing:

  • Machine Learning (Stanford University)
  • Data Structures and Algorithms (Peking University)
  • Web Intelligence and Big data (Indian Institute of Technology)
  • Introduction to Data Science (University of Washington)
  • Maps and Geospatial Revolution (Penn State)
  • Introduction to Databases (Stanford University, self-study)
  • Computing for Data Analysis (Johns Hopkins University)
  • Initiation à la Programmation (EPFL, Switzerland, in French)
  • Statistics One (Princeton University)

Data Science Central

The data science apprenticeship offered at https://www.datasciencecentral.com is online, on demand, self-service, and free. The curriculum contains the following components:

  • Training basics
  • Tutorials
  • Data sets
  • Real-life projects
  • Sample source code

The apprenticeship covers the following basic subjects:

  • How to download Python, Perl, Java, and R, get sample programs, get started with writing an efficient web crawler, and get started with Linux, Cygwin, and Excel (including logistic regression)
  • Hadoop, Map Reduce, NoSQL, their limitations, and more modern technologies
  • How to find data sets or download large data sets for free on the web
  • How to analyze data, from understanding business requirements to maintaining an automated (machine talking to machine) web/database application in production mode: a 12-step process
  • How to develop your first “Analytics as a Service” application and scale it
  • Big data algorithms, and how to make them more efficient and more robust (application in computational optimization: how to efficiently test trillions of trillions of multivariate vectors to design good scores)
  • Basics about statistics, Monte Carlo, cross-validation, robustness, sampling, and design of experiments
  • Start-up ideas for analytic people
  • Basics of Perl, Python, real-time analytics, distributed architecture, and general programming practices
  • Data visualization, dashboards, and how to communicate like a management consultant
  • Tips for future consultants
  • Tips for future entrepreneurs
  • Rules of thumb, best practices, craftsmanship secrets, and why data science is also an art
  • Additional online resources
  • Lift and other metrics to measure success, metrics selection, use of external data, making data silos communicate via fuzzy merging, and statistical techniques

Special tutorials are included focusing on various data science topics such as big data, statistics, visualization, machine learning, and computer science. Several sample data sets are also provided for use in connection with the tutorials.

Students complete several real-life projects using what they learn in the apprenticeship. Examples of projects that are completed during the apprenticeship include the following:

  • Hacking and reverse-engineering projects.
  • Web crawling to browse Facebook or Twitter webpages, then extract and analyze harvested content to estimate the proportion of inactive or duplicate accounts on Facebook, or to categorize tweets.
  • Taxonomy creation or improving an existing taxonomy.
  • Optimal pricing for bid keywords on Google.
  • Web app creation to provide (in real time) better-than-average trading signals.
  • Identification of low-frequency and Botnet fraud cases in a sea of data.
  • Internship in computational marketing with a data science start-up.
  • Automated plagiarism detection.
  • Use web crawlers to assess whether Google Search favors its own products and services, such as Google Analytics, over competitors.

Sample source code is provided with the apprenticeship, along with links to external code such as Perl, Python, R, and C++, and occasionally XML, JavaScript, and other languages. Even “code” to perform logistic regression and predictive modeling with Excel is included, as is a system to develop automated news feeds.

Data Scientist Career Paths

This section discusses career paths for individuals interested in entrepreneurial initiatives. Resources for professionals interested in working for small or large employers and organizations are found in Chapter 8, “Data Science Resources.”

The Independent Consultant

One career option for the data scientist is to become an independent consultant, either full-time or part-time to complement traditional income (salary) with consulting revenue. Consultants typically have years of experience and deep domain expertise, use software and programming languages, and have developed sales and presentation skills. They know how to provide value, how to measure lift, and how to plan a consulting project. The first step is to write a proposal. Consultants can spend as much as 50 percent of their time chasing clients and doing marketing, especially if they charge well above $100/hour.

Finding Clients

Attending conferences, staying in touch with former bosses and colleagues, and building a strong network on LinkedIn (with recommendations and endorsements and a few thousand relevant connections) are critical and can take several months to build. To grow your online network, post relevant, noncommercial articles or contribute to forums (answer questions) on niche websites:

For instance, the LinkedIn group “Advanced Business Analytics, Data Mining and Predictive Modeling” has more than 100,000 members. It is the leading data science group, growing by 3,000 members per month, with many subgroups such as big data, visualization, operations research, analytic executives, training, and so on. (See www.linkedin.com/groups/Advanced-Business-Analytics-Data-Mining-35222.)

You can earn money and fame (and new clients) by participating in Kaggle contests (or in our own contests — for instance, see http://bit.ly/1arClAn), or being listed as a consultant on Kaggle. Its “best” consultants bill at $300 per hour (see http://bit.ly/12Qqqnm). However, you compete with thousands of other data scientists from all over the world, the projects tend to be rather small (a few thousand dollars each), and Kaggle takes a big commission on your earnings.

Managing Your Finances

I started without a business plan (15 years later, I still don't have one), no lawyer, no office (still don't have one), no employees, with my wife taking care of our Internet connection and accounting. Now we have a bookkeeper (40 hours per month) and an accountant; they cost about $10,000 per year, but they help with invoicing and collecting money. They are well worth their wages, which represent less than 1 percent of our gross revenue.

You can indeed start with little expense. All entrepreneurs are risk takers, and a little bit of risk now means better rewards in the future, as the money saved on expensive services can help you finance your business and propel its growth. When we incorporated as an LLC in 2000, we used a service called MyCorporation.com (it still exists), which charged us only a few hundred dollars and took care of everything. To this date, we still have no lawyer, though we subscribed to some kind of legal insurance that costs $50/month. And our business model, by design, is not likely to create expensive lawsuits: we stay away from what could cause lawsuits. The same applies to training: you can learn a lot for free on the Internet, or from this book, or by attending free classes at Coursera.com.

One big expense is marketing and advertising. One way to minimize marketing costs is to create your own community and grow a large list of people interested in your services, to in essence “own” your market. Posting relevant discussions on LinkedIn groups and creating your own LinkedIn groups is a way to start growing your network. Indeed, when we started growing our community, we eventually became so large—the leading community for data scientists—that we just stopped there, and we now make money by accepting advertising and sending sponsored eBlasts and messages, blended with high-value, fresh noncommercial content, to our subscribers. This also shows that being a digital publisher is a viable option for data scientists interested in a nontraditional career path.

A final node about working with employees: You might work remotely with people in Romania or India, pay a fraction of the cost of an employee in Europe or the United States, not be subject to employee laws, and experience a win-win relationship. Indeed, as your consultancy grows but the number of hours you can spend any month is fixed, one way to scale is to take on more projects but outsource some work, maybe the more mundane tasks, abroad. Another way to grow is to automate many tasks such as report production, analyses, data cleaning, and exploratory analysis. If you can run a script that in one night, will work on five projects when you sleep, and produce the same amount of work as five consultants, now you have found a way to multiply your revenue by three to five times. You can then slash your fee structure by 50 percent and still make more money than your competitors.

Salary Surveys

Many websites provide information about salaries though, usually it applies to salaried employees rather than consultants. Job ads on LinkedIn typically show a salary range. Indeed.com and PayScale.com are also good resources. Glassdoor.com lists actual salaries as reported by employees. Data Science Central regularly publishes salary surveys, including for consultants, at http://bit.ly/19vA0n1.

Kaggle is known to charge $300/hour for data science projects, but typically the rate in the United States is between $60/hour and $250/hour based on project complexity and the level of expertise required. The field of statistical litigation tends to command higher rates, but you need to be a well-known data scientist to land a gig.

Here are a few hourly rates advertised by statistical consultants, published on statistics.com in 2013:

  • Michael Chernick: $150
  • Joseph Hilbe: $200
  • Robert A. LaBudde: $150
  • Bryan Manly: $130
  • James Rutledge: $200
  • Tom Ryan: $125
  • Randall E. Schumacker: $200

Sample Proposals

The starting point in any $3,000 project and above is a proposal, which outlines the scope of the project, the analyses to be performed, the data sources, deliverables and timelines, and fees. Here are two sample proposals.

Sample Proposal #1

This project was about click fraud detection. Take note of the following information:

  • How much and how I charge (I'm based in Issaquah, WA)
  • How I split the project into multiple phases
  • How to present deliverables and optional analysis, as well as milestones

The proposal included the following work items.

  1. Proof of Concept (5 days of work)
    • Process test data set: 7 or 14 most recent days of click data; includes fields such as user KW (keyword), referral ID (or better, referral domain), feed ID, affiliate ID, CPC (cost-per-click), session ID (if available), UA (user agent), time, advertiser ID or landing page, and IP address (fields TBD)
    • Identification of fraud cases or invalid clicks in test data set via my base scoring system
    • Comparing my scores with your internal scores or metrics
    • Explain discrepancies: false positives/false negatives in both methodologies. Do I bring a lift?
    • Cost: $4,000
  2. Creation of Rule System
    • Creation of four rules per week, for four weeks; focus on most urgent/powerful rules first
    • Work with your production team to make sure implementation is correct (QA tests)
    • Cost: $200/rule
  3. Creation of Scoring System
    • From day one, have your team add a flag vector associated with each click in production code and databases:
      • The flag vector is computed on the fly for each click and/or updated every hour or day.
      • The click score will be a simple function of the flag vector.
      • Discuss elements of database architecture, including look up tables.
    • The flag vector stores for each click:
      • Which rules are triggered
      • The value (if not binary) associated with each rule
      • Whether the rule is active
    • Build score based on flag vector: that is, assign weights to each rule and flag vector.
    • Group rules by rule clusters and further refine the system (optional, extra cost).
    • Cost: $5,000 (mostly spent on creating the scoring system)
  4. Maintenance
    • Machine learning: Test the system again after three months to discover new rules, fine-tune rule parameters, or learn how to automatically update/discover new rules.
    • Assess frequency of look up tables (for example, bad domain table) updates (every three months).
    • Train data analyst to perform ad-hoc analyses to detect fraud cases/false positives.
    • Perform cluster analysis to assign a label to each fraud segment (optional; will provide a reason code for each bad click if implemented).
    • Impression files: Should we build new rules based on impression data (for example, clicks with 0 impressions)?
    • Make sure scores are consistent over time across all affiliates.
    • Dashboard/high-level summary/reporting capabilities.
    • Integration with financial engine.
    • Integrate conversion data if available, and detect bogus conversions.
    • Cost: TBD

The data mining methodology will mostly rely on robust, efficient, simple, hidden decision tree methods that are easy to implement in Perl or Python.

Sample Proposal #2

This project was about web analytics, keyword bidding, sophisticated A/B testing, and keyword scoring. The proposal included the following components:

  1. Purpose: The purpose of this project is to predict the chance of converting (for keywords with little or no historical data), in pay-per-click programs. Using text mining techniques, a rule engine, and predictive modeling (logistics regression, decision trees, Naive Bayes classifiers, or hybrid models), a keyword score will be built for each client. Also, a simple parametric keyword bidding algorithm will be implemented. The bidding algorithm will rely on keyword scores.
  2. Methodology: The scores will initially be based on a training set of 250,000 keywords from one client. For each bid keyword/match type/source (search or content), the following fields will be provided: clicks, conversions, conversion type, and revenue per conversion (if available), collected over the last eight weeks.

    The bidding algorithm will use a hybrid formula, with strong weight on conversion rate for keywords with reliable conversion rates, and strong weight on score for keywords with no historical data. If the client already uses a bidding algorithm based on conversion rate, we will simply use the client's algorithm, substituting the conversion rate with an appropriate function of the score for keywords with little history. The score is used as a proxy for conversion rate.

    If the client does not use a bidding algorithm, we will provide a parametric bidding algorithm depending on a parameter a (possibly multivariate), with a randomly chosen value for each keyword, to eventually detect the optimum a. For instance, a simplified version of the algorithm could be: Multiply bid by a if predicted ROI (for a given keyword) is higher than some threshold, a being a random number between 0.85 and 1.25. In this type of multivariate testing, the control case corresponds to a=1.

    The predicted ROI is a simple function of the current bid, the score (which in turn is a predictor of the conversion rate, by design) and the revenue per conversion.

  3. Deliverables and Timeline:
    • Step 1: 250,000 keywords scored (include cross-validation). Algorithm and source code provided to score new keywords. Formula or table provided to compute predicted conversion rate as a function of the score. Four weeks of work, $4,000. Monthly updates to adjust the scores (machine learning): $500/month.
    • Step 2: Implementation of bidding algorithm with client: Four weeks of work, $4,000.
    • Payment: For each step, 50 percent upfront; 50 percent after completion.

CRM Tools

You need Customer Relationship Management (CRM) software that tracks leads (potential clients) and logs all your interactions with them (including contact information, typically more than one contact per company). Log entries look like this:

  • My partner met with VP of engineering at the INFORMS meeting on June 5, 2013, and discussed reverse clustering. This prospect is a hot lead, very interested.
  • I sent a follow-up e-mail on June 8.
  • Phone call with IT director scheduled on July 23.
  • Need to request sample data in connection with its project xyz.
  • Discussed performing a proof of concept on simulated data. Waiting for data.

Another useful tool is FreshBooks.com, which is an online bookkeeping system that allows you to send invoices (and track if they were opened) and is great at measuring money flows per month. I used it as a financial CRM tool, and some of the reports from its dashboard are useful for budgeting purposes and to have a big picture about your finances.

Finally, processing credit cards and having a credit card authorization form can prove useful.

NOTE The section Managing Your Finances in this chapter is also applicable to the entrepreneur career path.

The Entrepreneur

Entrepreneurship is more complex than managing a consultancy, as you want to scale, aim at higher revenues, and have an exit strategy. Raising venture capital (VC) money was popular in the late ‘90s but has become both more difficult and less necessary. The time you will spend chasing funding is substantial, like a six-month full-time job with no guarantee of success. We did it in 2006, raising $6 million to create a company that would score all Internet clicks, keywords, or impressions, in order to detect fake clicks and bad publishers, and also to help advertisers determine a bidding price for Google keywords with little or no history (the long tail). It lasted a few years and was fun. The one mistake we made was having a bunch of investors who, at one point, started to fight against each other.

VC money is less necessary now because you can launch a new company with little expense, especially if you purposely choose a business model that can easily be scaled and deployed using a lean approach. For instance, our Data Science Central platform relies on Ning.com. We don't have engineers, web developers, sys admins, and security gurus—all these functions are outsourced to Ning. We did not buy any servers: we are co-hosted on Ning for less than $200/month. The drawback is that you have more competitors as such barriers evaporate.

If you really need money for small projects associated with your business, crowdfunding could be a solution. We tried to raise money to pay a professor to write our training material for our data science apprenticeship but were not able to sign up on the crowd funding platforms. If we do raise money through crowd funding, we will actually do it entirely ourselves without using brokers, by advertising our offer in our mailing list.

Another issue with raising money is that you will have to share your original idea with investors and their agents (analysts doing due diligence on your intellectual property—IP). Your intellectual property might end up being stolen in the process. One way to make your IP theft-proof is to make it highly public so that it is not patentable. I call this an open source patent. If you do so (and it definitely fits with the idea of having a lean business with little cost), you need to make sure to control your market; otherwise, competitors with better funding will market your same product better than you can do it. If you own your market—that is, you own the best, largest opt-in mailing list of potential clients—anyone who wants to advertise the product in question must do it through you. It is then easy for you to turn down competitors. Owning your market is also a big component of a lean business because it reduces your marketing costs to almost zero.

Finally, you can use cheap services such as MyCorporation.com to incorporate (also a great lean business principle) and not have a business plan (another lean business idea), as long as you are highly focused yet with a clear vision, and don't get distracted by all sorts of new ideas and side projects. If you are a creator, associate yourself with a pragmatic partner. And think bottom line every day.

My Story: Data Science Publisher

In my case, as a data scientist, I generate leads for marketers. A good-quality lead is worth $40. The cost associated with producing a lead is $10. It requires data science to efficiently generate a large number of highly relevant leads (by detecting and purchasing the right traffic, optimizing organic growth, and using other analytic techniques). If I can't generate at least 10,000 leads a year, nobody will buy due to low volume. If my leads don't convert to actual revenue and produce ROI for the client, nobody will buy.

Also, because of data science, I can sell leads for a lower price than competitors—much less than $40. For instance, our newsletter open rate went from 8 percent to 24 percent, significantly boosting revenue and lowering costs. We also reduced churn to a point where we actually grow, all this because of data science. Among the techniques used to achieve these results are:

  • Improving user, client, and content segmentation
  • Outsourcing and efficiently recruiting abroad
  • Using automation
  • Using multiple vendor testing in parallel (A/B testing)
  • Using competitive intelligence
  • Using true computational marketing
  • Optimizing delivery rate from an engineering point of view
  • Eliminating inactive members
  • Detecting and blocking spammers
  • Optimizing an extremely diversified mix of newsletter metrics (keywords in subject line, HTML code, content blend, ratio of commercial versus organic content, keyword variance to avoid burnout, first sentence in each message, levers associated with retweets, word-of-mouth and going viral, and so on) to increase total clicks, leads, and conversions delivered to clients. Also, you need to predict sales and revenues—another data science exercise.

Startup Ideas for Data Scientists

For those interested in becoming a data science entrepreneur, here are a few ideas. Hopefully they will also spark other ideas for you.

R in Your Browser

Create a platform allowing users to enter R commands on Chrome or Firefox in a browser-embedded console. Wondering how easy it would be to run R from a browser on your iPad? I'm not sure how you would import data files, but I suppose R offers the possibility to open a file located on a web or FTP server, rather than a local file stored on your desktop. Or does it? Also, it would be cool to have Python in a browser.

See the example at http://www.datasciencecentral.com/profiles/blogs/r-in-your-browser.

A New Type of Weapons-Grade Secure E-mail

This could be a great opportunity for mathematicians and data scientists: creating a startup that offers encrypted e-mail that no government or entity could ever decrypt, and offering safe solutions to corporations who don't want their secrets stolen by competitors, criminals, or the government.

Here's an example of an e-mail platform:

  • It is offered as a web app, for text-only messages limited to 100 KB. You copy and paste your text on some web form hosted on some web server (referred to as A). You also create a password for retrieval, maybe using a different app that creates long, random, secure passwords. When you click Submit, the text is encrypted and made accessible on some other web server (referred to as B). A shortened URL displays on your screen; that's where you or the recipient can read the encrypted text.
  • You call (or fax) the recipient, possibly from and to a public phone, provide him with the shortened URL and password necessary to retrieve, and decrypt the message.
  • The recipient visits the shortened URL, enters the password, and can read the unencrypted message online (on server B). The encrypted text is deleted after the recipient has read it, or 48 hours after the encrypted message was created, whichever comes first.

TECHNICAL NOTE

The encryption algorithm which adds semi-random text to your message prior to encryption, has an encrypted timestamp, and won't work if no semi-random text is added first, It is such that (i) the message cannot be decrypted after 48 hours (if the encrypted version is intercepted) because a self-destruction mechanism is embedded into the encrypted message and the executable file itself, and (ii) if you encrypt the same message twice (even an empty message or one consisting of just one character), the two encrypted versions will be very different (of random length and at least 1 KB in size) to make reverse engineering next to impossible.

  • Maybe the executable file that does perform the encryption would change every three to four days for increased security and to make sure a previously encrypted message can no longer be decrypted. (You would have the old version and new version simultaneously available on B for just 48 hours.)
  • The executable file (on A) tests if it sits on the right IP address before doing any encryption, to prevent it from being run on, for example, a government server. This feature is encrypted within the executable code. The same feature is incorporated into the executable file used to decrypt the message, on B.
  • A crime detection system is embedded in the encryption algorithm to prevent criminals from using the system by detecting and refusing to encrypt messages that seem suspicious (child pornography, terrorism, fraud, hate speech, and so on).
  • The platform is monetized via paid advertising, by advertisers and antivirus software.
  • The URL associated with B can be anywhere, change all the time, or be based on the password provided by the user and located outside the United States.
  • The URL associated with A must be more static. This is a weakness because it can be taken down by the government. However, a workaround consists of using several specific keywords for this app, such as ArmuredMail, so that if A is down, a new website based on the same keywords will emerge elsewhere, allowing for uninterrupted service. (The user would have to do a Google search for ArmuredMail to find one website—a mirror of A—that works.)
  • Finally, no unencrypted text is stored anywhere.

Indeed, the government could create such an app and disguise it as a private enterprise: it would in this case be a honeypot app. Some people worry that the government is tracking everyone and that you could get in trouble (your Internet connection shut down and bank account frozen) because you posted stuff that the government algorithms deem extremely dangerous, maybe a comment about pressure cookers. At the same time, I believe the threat is somewhat exaggerated.

Anyone interested in building this encryption app? Note that no system is perfectly safe. If there's an invisible camera behind you, filming everything you do on your computer, then my system offers no protection for you—though it would still be safe for the recipient, unless he also has a camera tracking all his computer activity. But the link between you and the recipient (the fact that both of you are connected) would be invisible to any third party. And increased security can be achieved if you use the web app from an anonymous computer—maybe from a public computer in some hotel lobby.

Averaging Text, Not Numbers

Averaging numbers is easy, but what about averaging text? I ask the question because the Oxford dictionary recently added a new term: big data.

So how do you define big data? Do you pick up a few experts (how do you select them?) and ask them to create the definition? What about asking thousands of practitioners (crowdsourcing) and somehow average their responses? How would you proceed to automatically average their opinions, possibly after clustering their responses into a small number of groups? It looks like this would require

  • Using taxonomies.
  • Standardizing, cleaning, and stemming all keywords, very smartly (using stop lists and exception lists — for instance, booking is not the same as book).
  • Using n-grams, but carefully. (Some keywords permutations are not equivalent—use a list of these exceptions. For example, data mining is not the same as mining data.)
  • Matching cleaned/normalized keyword sequences found in the responses, with taxonomy entries, to assign categories to the response, to further simplify the text averaging process.
  • Maybe a rule set associated with the taxonomy, such as average(entry A, entry C)=xyz.
  • A synonym dictionary.

On a different note, have you ever heard of software that automatically summarizes text? This would be a fantastic tool!

Typed Passwords Replaced by Biometrics

What if the traditional login/password to your favorite websites was replaced by a web app asking you to talk for one second, to get your picture taken from your laptop camera, or to check your fingerprint on your laptop touch screen? Your one-second voice message or image of your fingerprint could then be encoded using a combination of several metrics and stored in a database for further login validation.

The idea is to replace the traditional login/password (not a secure way to connect to a website) with more secure technology, which would apply everywhere on any computer, laptop, cell phone, or device where this ID checking app would be installed.

Also, the old-fashioned login/password system could still coexist as an option. The aim of the new system would be to allow anyone to log on with just one click, from any location/device on any website, without having to remember a bunch of (often) weak passwords. This methodology requires large bandwidth and considerable computer resources.

Web App to Run Polls and Display Real-Time Results

It should handle ten questions, each with multiple choices, and display the results in real time on a world map, using colors. Geolocation of each respondent is detected in real time based on an IP address. The color displayed (for each respondent) would represent the answer to a question. For instance, male=red, female=yellow, and other=green. I used to work with Vizu; the poll service was available as a web widget, offering all these features for free, but it's now gone. It would be great if this type of poll/survey tool could handle 2,000 actual responses, and if you could zoom in on the map.

Inbox Delivery and Management System for Bulk E-mail

Here's a new idea for Google to make money and cover the costs of processing or filtering billions of messages per day. This is a solution to eliminate spam as well, without as many false positives as currently. The solution would be for Google to create its own newsletter management system. Or at least, Google would work with major providers (VerticalResponse, Constant Contact, iContact, MailChimp, and so on) to allow their clients (the companies sending billions of messages each day, such as LinkedIn) to pay a fee based on volume. The fee would help the sender to not end up in a Gmail spam box, as long as it complies with Google policies. Even better: let Google offer this newsletter management service directly to clients who want to reach Gmail more effectively, under Google's controls and conditions.

I believe Google is now in a position to offer this service because more than 50 percent of new personal e-mail accounts currently created are Gmail, and they last much longer than any corporate e-mail accounts. (You don't lose your Gmail account when you lose your job.) Google could reasonably charge $100 per 20,000 messages sent to Gmail accounts: the potential revenue is huge.

If Google would offer this service internally (rather than through a third party such as Constant Contact), it would make more money and have more control, and the task of eliminating spam would be easier and less costly.

Currently, because Google offers none of these services, you will have to face the following issues:

  • A big component in Gmail antispam technology is collaborative filtering algorithms. Your newsletter quickly ends up in the spam box, a few milliseconds after the delivery process has started, if too many users complain about it, do not open it, or don't click.
  • Thus fraudsters can create tons of fake Gmail accounts to boost the “open” and “click” rates so that their spam goes through, leveraging collaborative filtering to their advantage.
  • Fraudsters can also use tons of fake Gmail accounts to fraudulently and massively flag e-mail received from real companies or competitors as fraud.

Newsletters are delivered too quickly: 100,000 messages are typically delivered in five minutes by newsletter management companies. If Gmail were delivering these newsletters via its own system (say, Gmail Contact), then it could deliver more slowly, and thus do a better job at controlling spam without creating tons of false positives.

In the meanwhile, a solution for companies regularly sending newsletters to a large number of subscribers is to do the following:

  • Create a special segment for all Gmail accounts, and use that segment more sparingly. In our case, it turns out that our Gmail segment is the best one (among all our segments) in terms of low churn, open rate, and click rate—if we do not use it too frequently and reserve it for our best messages.
  • Ask your newsletter management vendor to use a dedicated IP address to send messages.
  • Every three months, remove all subscribers who never open or even those who never clicked. (Although you will lose good subscribers with e-mail clients having images turned off.)
  • Create SPF (Sender Policy Framework) records.
Pricing Optimization for Medical Procedures

Problem:

The cost of healthcare in the United States is extremely high and volatile. Even standard procedures have high price volatility. If you ask how much your hospital will charge for what is typically a $5,000 or more procedure by explicitly asking how much it charged for the last five patients, it will have no answer (as if it doesn't keep financial records).

Would you buy a car if the car dealer has no idea how much it will charge you, until two months after the purchase? If you did, and then you hear your neighbor got exactly the same car from a different dealer for one-half the price, what would you do? Any other type of business would quickly go bankrupt if using such business practices.

Causes:

  • Despite hiring the brightest people (surgeons and doctors) hospitals lack basic analytic talent for budgeting, pricing, and forecasting. They can't create a statistical financial model that includes revenues (insurance payments and so on) and expenses (clients not paying, labor, drug costs, equipment, lawsuits, and so on).
  • Costs are driven in part by patients and hospitals being too risk-averse (fear of litigation) and asking for unnecessary tests. Yet few patients ever mentioned that their budget is, say, $2,000 for XYZ procedure, and that he won't pay more and will cancel or find another provider that can meet that budget, if necessary.
  • Volatility is expected in prices but must be controlled. If I provide consulting services to a client, the client (like any patient) has a finite budget, and I (like hospitals) have constraints and issues. (I need to purchase good quality software; the data might be much messier than expected, and so on.)

Solution:

An Internet start-up offering prices for 20 top medical procedures, across 5,000 hospitals and 20 patient segments. Prices would come from patients filling a web form or sharing their invoice details, and partners such as cost-efficient hospitals willing to attract new patients. Statistical inferences would be made to estimate all these 20×20×500=200,000 prices (and their volatility) every month based on maybe as little as 8,000 data points. The statistical engine would be similar to Zillow (estimating the value of all houses based on a limited volume of sales data), but it would also have a patient review component (like Yelp), together with fake review detection.

Revenue would be coming from hospital partners, from selling data to agencies, from advertising to pharmaceutical companies, and from membership fees (premium members having access to more granular data).

Checks Sent by E-mail

Checks sent by e-mail, rather than snail mail—the idea works as follows:

  • You (the payer) want to send a check to, for example, John, the payee.
  • You log on to this new start-up website, add your bank information (with account name, account number, and routing number) and a picture of your signature (the first time you use the service), and then for each check, you provide the dollar amount, payment reason, the payee's name, and his e-mail address.
  • The payee receives an e-mail notification and is requested to visit a URL to process his check.
  • He prints his check from the URL in question (operated by the start-up), just like you would print a boarding pass.
  • The payee goes to his bank to cash his check.

Note that this start-up could charge the payee 20 cents per check (it's less than 50 percent of the cost of a stamp) and have a higher successful delivery rate than snail mail, which is easy to beat.

The analytic part of this business model is in the security component to prevent fraud from happening. To a lesser extent, there are also analytics involved to guarantee that e-mail is delivered to the payee at least 99.99 percent of the time. Information must travel safely through the Internet, and in particular, the e-mail sent to payees must contain encrypted information in the URL used to print the check.

PayPal already offers this type of functionality (online payments), but not the check printing feature. For some people or some businesses, a physical check is still necessary. I believe checks will eventually disappear, so I'm not sure if such a start-up would have a bright future.

Anonymous Digital Currency for Bartering

Several digital currencies already exist and are currently being used (think about PayPal and Bitcoin), but as far as I know, they are not anonymous.

An anonymous digital currency would be available as encrypted numbers that can be used only once; it would not be tied to any paper currency or e-mail address. Initially, it could be used between partners interested in bartering. (For example, I purchase advertising on your website in exchange for you helping me with web design—no U.S. dollars exchanged). The big advantages are:

  • It is not a leveraged currency — no risk of “bank run,” inflation, and so on.
  • Transactions paid via this currency would not be subject to any tax.

What are the challenges about designing this type of currency? The biggest ones are:

  • Getting many people to trust and use it.
  • Difficult to trade it in for dollars.
  • The entity managing this currency must make money in some way. How? Can it be done by charging a small fee for all transactions? The transaction fee would be much smaller than the tax penalty if the transaction were made in a non-anonymous currency (digital or not).
  • Security: Being an antifraud expert, I know how to make it very secure. I also know how to make it totally anonymous.
Detect Scams Before They Go Live

Based on tweets and blog posts, identify new scams before they become widespread.

  • Search for keywords such as scam, scammed, theft, rip-off, fraud, and so on.
  • Assign a date stamp, location, and category to each posting of interest.
  • Identify and discard bogus postings that report fake fraud stories.
  • Score each post for scam severity and trustworthiness.
  • Create a taxonomy to categorize each posting.
  • Create a world map, updated hourly, with scam alerts. For each scam, indicate a category, an intensity, and a recency measurement.

Is such a public system already in place? Sure, the FBI must have one, but it's not shared with the public. Also, such a scam alert system is quite similar to systems based on crowd sourcing to detect diseases spreading around the world. The spreading mechanism in both cases is similar: scam/disease agents use viruses to spread and contaminate. In the case of scams, computer viruses infect computers using Botnets and turn them into spam machines (zombies) to send scam e-mail to millions of recipients.

Amusement Park Mobile App

Will analytics help amusement parks survive the 21st century? My recent experience at Disneyland in California makes me think that a lot of simple things can be done to improve revenue and user experience. Here are a few starting points:

  • Optimize lines. Maybe charge an extra $1 per minute for customers who need more than five minutes to purchase their tickets.
  • Offer online tickets and other solutions (not requiring human interactions) to avoid the multiple checkpoint bottlenecks.
  • Make tickets more difficult to counterfeit; assess and fight fraud due to fake tickets. (These tickets seem relatively easy to forge.)
  • Increase prices and find optimum prices. (Obviously, it must be higher than current prices due to the extremely large and dense crowds visiting these parks, creating huge waiting lines and other hazards everywhere—from the few restaurants and bathrooms to the attractions.)
  • Make waiting time for each attraction available on cell phones to optimize user experience. (Allow users to choose attractions based on waiting times. I've seen attractions where fun lasts 5 minutes but waiting time is 2 hours, creating additional problems, such as people who need to go to the bathroom after a 90-minute wait.)
  • More restaurants, and at least one upscale restaurant.
  • Create more attractions where you join on demand, something like a gondola or Ferris Wheel where you can enter when the gondola or Ferris Wheel is in action. (In short, attractions that don't need to get stopped every five minutes to get everybody out and get new people in, but attractions where people are continuously entering and exiting. This, of course, reduces waiting times.)
  • Recommend to users the best days to visit to avoid huge crowds based on historical trends (and change ticket price accordingly—cheaper on “low days”).
Software to Optimize Hotel Room Prices in Real Time

When you book five nights in a hotel, usually each night (for the same room) will have a different price. The price is based on a number of factors, including these:

  • Season
  • Day of the week
  • Competitor prices
  • Inventory available

I am wondering which other metrics are included in the pricing models. For instance, do they include weather forecasts, special events, price elasticity? Is the price user-specific (for example, based on IP addresses) or based on how the purchase is made (online, over the phone, using an American Express card versus a Visa card), how long in advance the purchase is made, and so on?

For instance, a purchase performed long in advance results in lower prices, but it is because inventory is high. So you might be fine by just using inventory available.

Finally, how often are these pricing models updated, and how is model performance assessed? Is the price changing in real time?

Web App to Predict Risk of a Tax Audit

What are your chances of being audited? And how can data science help answer this question, for each of us individually?

Some factors increase the chance of an audit, including

  • High income
  • Being self-employed versus being a partner in an LLC
  • Filing early
  • Having earnings both from W2s and self-employment
  • Business losses four years in a row
  • Proportion of your total deductions in travel and restaurant categories >40%
  • Large donations
  • High home office deductions

Companies such as Deloitte and KPMG probably compute tax audit risks quite accurately (including the penalties in case of an audit) for their large clients because they have access to large internal databases of audited and non-audited clients.

But for you and me, how could data science help predict our risk of an audit? Is there data publicly available to build a predictive model? A good start-up idea would be to create a website to help you compute your risk of tax audit. Its risk scoring engine would be based on the following three pieces:

  • Users can anonymously answer questions about their tax return and previous audits (if any), or submit an anonymized version of their tax returns
  • User data is gathered and analyzed to give a real-time answer to users checking their audit risk
  • As more people use the system, the better the predictions, the smaller the confidence intervals for estimated audit probabilities

Indeed, this would amount to reverse-engineering the IRS algorithms.

Which metrics should be used to assess tax audit risk? And how accurate can the prediction be? Could such a system attain 99.5 percent accuracy — that is, wrong predictions for only 0.5 percent of taxpayers? Right now, if you tell someone “You won't be audited this year,” you are correct 98 percent of the time. More interesting, what level of accuracy can be achieved for higher-risk taxpayers?

Finally, if your model predicts both the risk of audit and the penalty if audited, then you can make smarter decisions and decide which risks you can take and what to avoid. This is pure decision science to recoup dollars from the IRS, not via exploiting tax law loopholes (dangerous), but via honest, fair, and smart analytics—outsmarting the IRS data science algorithms rather than outsmarting tax laws.

Summary

This chapter explored how to become a data scientist, from university degrees currently available, certificate training programs, and the online apprenticeship provided by Data Science Central.

Different types of data science career paths were also discussed, including entrepreneur, consultant, individual contributor, and leader. A few ideas for data science start-ups were also provided, in case you want to become an entrepreneur.

The next chapter describes selected data science techniques in detail without introducing unnecessary mathematics. It includes information on visualization, statistical techniques, and metrics relevant to big data and new types of data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset