Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Thomas W. Dinsmore, Disruptive Analytics, 10.1007/978-1-4842-1311-7_8

8. Machine Learning

Software That Learns

Thomas W. Dinsmore¹

(1)Newton, Massachusetts, USA

In Chapter Two, we surveyed the history of business analytics as a whole, noting that statistics and machine learning developed separately from data warehousing and business intelligence. In this chapter, we pick up where Chapter Two left off with a review of recent trends in machine learning.

Most of the key innovations in machine learning are distributed as open source software, which we discussed in Chapter Three. The discussion of scale-out architecture for machine learning extends the treatment of analytics in Hadoop covered in Chapter Four.

In Chapter Five, we covered Apache Spark , a distributed in-memory platform that is central to a discussion of distributed machine learning. We covered streaming machine learning briefly in Chapter Six and cloud-based machine learning in Chapter Seven.

Due to the significance of deep learning, we include a section covering this technology. We close the chapter with a survey of leading tools for modern machine learning.

Recent Trends in Machine Learning

The most important trends affecting machine learning today are:

Convergence of statistics and machine learning
Growth of formal machine learning competitions
Increased adoption of ensemble learning
Development of scalable techniques for machine learning with Big Data
Emergence of deep learning

Predicting the impact of these trends into the future requires some speculation; without question, though, they are affecting the machine learning discipline today.

Convergence

In 2001, Leo Breiman, professor emeritus of statistics at the University of California, Berkeley, wrote¹ of “two cultures” in predictive analytics. One culture, which he labeled as “data modelers,” approached the predictive modeling problem by testing the hypothesis that the data conformed to one of several established functional forms.

The second culture, which he dubbed “algorithmic,” approached the problem without assumptions and used machine learning tools to discover the model with the highest predictive power for the data at hand. Breiman used his own terminology, but it is clear that the “data modeling” label applied to the statistics community, and the “algorithmic” label to the machine learning community.

The “cultural divide” was even worse than Breiman described. Within the machine learning community there were numerous subcultures that developed around different core technologies, such as decision trees, neural networks, support vector machines, memory-based reasoning, and so forth.

Machine learning technologies developed separately from one another, with roots in different disciplines. Each developed its own language and tools. Practitioners developed skills and expertise in a single method, then vigorously argued that “their” method was better than all other methods. Each method had its own software implementation, which made comparison difficult.

Today, the debates are largely over and the cultural divide Breiman described is gone. For the most part, the “algorithmic” camp won; credentialed statisticians and actuaries freely use machine learning tools together with statistical techniques. Popular techniques, such as regularization, can’t be easily assigned to one camp or another.

Regularization is a technique in machine learning to control overfitting, or the tendency of an algorithm to “learn” the characteristics of training data. Overfitting produces a model that predicts well on the training data but not on new data. Regularization controls for this problem by penalizing the loss function for each additional variable added to the model.

Three are three reasons for this convergence. First, the machine learning approach aligns with business needs better than the statistical approach. Breiman’s “data modeling” culture defined success by methodological “correctness” and measured success with statistical “goodness of fit” measures. But most business leaders are not trained in statistics and have no interest in measures such as F-tests, T-tests, and R-squared; on the other hand, they immediately grasp measures such as accuracy and precision and understand testing predictions on historical data.

Second, the machine learning community has developed methods and procedures that control for concerns about bias or overfitting. Methods like out-of-sample and out-of-time testing, cross-validation, and partial dependency analysis are so powerful that they are used today with statistical techniques as well as with machine learning techniques.

Finally, data mining workbenches, introduced in the 1990s, combined different machine learning techniques with statistical techniques. These consolidated platforms made it easy for practitioners to test many different techniques and to choose the one best suited for the problem at hand.

Competition

Competitive machine learning, where teams and individuals compete to build the best model for prize money, has contributed greatly to the discipline. Competitions serve as laboratories for best practices in machine learning, and increase visibility of new techniques.

Since 1997, the Association for Computing Machinery’s Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) has sponsored an annual competition called the KDD Cup. Each annual competition invites participants to complete a specific challenge, such as categorizing Internet search queries (2005); detecting breast cancer (2008); or predicting ratings on educational funding proposals (2014).

The competitions grow increasingly complex over time, from a straightforward classification problem in 1997 to the 2016 competition, in which teams compete to measure the relative influence of research institutions in a social graph. Increasingly, the challenges require entrants to blend multiple tools and techniques into an integrated solution.

The Netflix Prize was a highly visible contest that ran from 2006 to 2009. Netflix, the online DVD rental and video streaming service, offered $1,000,000 to the team that could beat Netflix’ existing collaborative filtering algorithm by at least 10%. Netflix offered annual progress awards to the best performing team for the duration of the contest. For the contest, Netflix provided data sets for model training and for model evaluation, and specified the root mean squared error (RMSE) as the measure of model accuracy.

Netflix launched the competition on October 2, 2006. Within six days, a team beat Netflix’s baseline.² At the end of the first and second competition years, Netflix awarded progress prizes, as no team had yet exceeded the 10% threshold.

Finally, in June 2009, two teams beat the 10% threshold. Over the course of the contest, 5,169 teams submitted 44,014 entries; the top two teams were closely matched, scoring RMSEs of 0.8554 and 0.8553, respectively. Netflix award the $1,000,000 to a team of seven researchers from Austria and the United States.

Inspired by the impact of the Netflix Prize, Anthony Goldbloom and Ben Hamner founded Kaggle in 2010 as a platform for predictive modeling and analytics competitions. Under Kaggle’s model, a host organization sponsors a competition, defines the rules, and offers a prize. Kaggle provides a platform to host the data, accept submissions, maintains a leaderboard, and enforces the rules.

To date, Kaggle has hosted more than 200 public competitions for diverse sponsors, including Allstate, Caterpillar, GE, Heritage Health, Home Depot, Liberty Mutual, Merck, Prudential, Santander, and State Farm. First prizes range from knowledge, kudos, swag, and job opportunities to $500,000.

With more than a half-million registered users, Kaggle claims to have the world’s largest community of data scientists. Kaggle tracks the performance of registered users on a leaderboard. In the absence of well-defined credentials for data scientists, the Kaggle leaderboard defines an elite community of experts.

Many other competitions support advances in specialized areas, such as handwriting recognition, traffic sign recognition, brain image classification, breast cancer diagnosis, and so forth. Successful efforts in these competitions contributed greatly to renewed interest in deep learning, which we discuss later in this chapter.

It is difficult to overstate the impact of machine learning competitions. Competitions draw a great deal of interest from the machine learning community, and successful techniques are quickly disseminated. Moreover, the competitive environment demonstrates the value of collaboration and teamwork in advanced analytics and validates the crowdsourcing approach.

Ensemble Learning

As some researchers developed fundamentally new ways to train models, others found ways to improve models by combining techniques in various ways. Ensemble learning techniques use multiple models to produce an aggregate model whose predictive power is better than individual models used alone. These techniques are computationally intensive; growth in available computing power made ensemble learning accessible for mainstream users.

The many ways to combine models boil down to three: boosting, bagging, and blending. Boosting operates iteratively, successively building models on the errors of each previous model. ADABoost (Adaptive Boosting), introduced in 1995, is one of the most popular methods for ensemble learning. The ADABoost meta-algorithm operates iteratively, leveraging information about incorrectly classified cases to develop a strong aggregate model. With each pass, ADABoost tests possible classification rules and reweights them according to their ability to add to the overall predictive power of the model.

Leo Breiman developed a bagging algorithm in 1996. Bagging selects multiple subsamples from an original training data set, builds a model for each subsample, then builds a solution through averaging (for regression) or through a voting procedure (for classification). The principal advantage of bagging is its ability to build more stable models; its main disadvantage is its computational complexity and requirement for larger data sets. The growth of high-performance computing mitigates these disadvantages.

Stanford statistician Jerome H. Friedman introduced Gradient Boosting and a variant, Stochastic Gradient Boosting, in 1999. Gradient works in a manner similar to ADABoost, but uses a different measure to determine the cost of errors. Stochastic Gradient Boosting combines Gradient Boosting with random subsampling. In addition to improving model accuracy, this enhancement enables the analyst to predict model performance outside of the training sample.

In 2001, Breiman and Adele Cutler³ proposed a technique they trademarked as “Random Forests”. The Random Forests algorithm combines bagging (random selection of subsets from the training data) with the random selection of features, or predictors. The algorithm trains a large number of decision trees from randomly selected sub-samples of the training data set, then outputs the class that is the mode of the classes output by individual trees. The principal advantage of Random Forests compared to other ensemble techniques is that its models generalize well outside of the training sample. Moreover, Random Forests produces variable importance measures that are useful for feature selection.

Blended or stacked models are relatively new compared to the other techniques, but they have been used with great success in some highly visible competitions. A blended model leverages predictions from other models to develop an averaged prediction; the blended model outperforms any of the individual models. Blended models are more complex to train, since the analyst must train a number of based models first before building the blended model; they also take more time to produce predictions and may not be suitable for real-time applications.

Scaling to Big Data

A few software vendors developed software for statistics in the 1970s, 1980s, and 1990s. SAS Institute, through its strong partnership with IBM, established a reputation as the “enterprise” vendor for statistics through its commitment to the IBM mainframe. SPSS, spun off from the National Opinion Research Center at the University of Chicago in 1975, took a different approach, embracing the PC when it was introduced in 1984. SPSS delivered the first Windows-based statistical software in 1992 and grew rapidly by targeting the business user.

In the 1990s, SAS developed software that ran single-threaded on single machines. As analytic data sets grew larger in the 1990s and early 2000s, SAS hardware partners recommended larger and larger servers with more computing power to handle the expanded workload. Computing professionals call this approach “scaling up ”—for more computing power, implement the software on a bigger computer.

Scaling up poses a number of issues as data sets grow larger. First, even the largest servers are too small for some projects. The limits of a single server forces analysts working on larger jobs to break the data into pieces and process it serially; as a result, large jobs can run for days—or even weeks.

The cost of the “big boxes ” promoted by hardware vendors to enable scaling up is another issue. Large machines can run into the millions of dollars. Moreover, a computing architecture based on large machines is difficult to size and manage, because each increment to computing power is expensive. There is a tendency for “big box” architectures to behave like freeways: fast and expansive when new, but crowded and congested shortly thereafter.

Accordingly, most organizations have shifted toward a “scale-out” computing model, where applications run on many low-cost commodity servers. The scale-out model is easier to align with demand, because the computing infrastructure expands in small increments. Scale-out architecture is one of the primary reasons organizations adopt⁴ Hadoop.

Some analytic tasks are easy to implement in a scale-out environment; we call these tasks embarrassingly parallel (see following note). Most model training algorithms are not embarrassingly parallel. Some are iterative, requiring multiple passes through the data; for others, item-level computations depend, in part, on other item-level communications and require interaction among distributed computing nodes.

An operation is embarrassingly parallel if computations on each data item are independent of computations on all other data items, and the product is a linear combination of distributed computations. Examples include SQL SELECT; scoring a linear model; and computing a statistical mean.

Tasks that are not embarrassingly parallel must be rewritten to run in a scale-out environment. This is expensive to do, and as we will show later in this chapter, there are just a few distributed engines on the market today.

Scaling to Big Data means working with larger data sets, but it also means working with diverse types (variety) and data in motion (velocity). We address machine learning with images, audio, video, speech, and other types of data under deep learning later, and we discussed streaming analytics in Chapter Six.

We tend to think of data volume in terms of items or rows in a table—a billion rows is a very big data set. However, the width of the data set—the number of columns, variables, or features—has a much greater impact on machine learning. Scientists have long recognized the Curse of Dimensionality, the computational problems associated with analyzing data with a large number of dimensions.

Columns, variables, features, and dimensions are closely related concepts that many people use interchangeably. A column is a set of values in a relational database table; a variable in computer programming is a symbolic name for a value that can change; a feature is a measurable property of an observed phenomenon; a dimension is a mathematical property. A thing has features; in a relational database, features map to columns; in a computer program, columns map to variables; in a mathematical discussion of the problem, variables map to dimensions.

High-dimension data poses several problems for the analyst. The computational complexity of a problem increases rapidly with the number of dimensions. Additional dimensions also increase the number of possible ways a model can be specified, mandating more experiments to train and tune the model. Also, in the case of linear regression, a large number of dimensions increases the odds that some of them are correlated, leading to biased parameter estimates.

Machine learning researchers have developed several different approaches to feature selection, a pre-processing step implemented prior to model training. Stepwise regression, a method that iteratively adds or drops variables and re-trains the model, was a popular technique in the 1990s. However, it has fallen out of fashion⁵ in favor of embedded methods, such as regularization, which progressively penalize additional variables.

Deep Learning

Three factors contributed to the growth of modern deep learning. The first of these is the introduction of general purpose computing on graphics processing units in the early 2000s. Graphics Processing Units (GPUs) are special chips originally developed to support computer gaming and image processing. GPUs are much more powerful than standard CPUs for certain types of tasks and have a highly parallel architecture. Support for floating point arithmetic and the development of APIs such as CUDA for general purpose computing make it practical to offload computing from CPUs to GPUs.

The Compute Unified Device Architecture, CUDA, created by GPU chip vendor Nvidia, is a software layer enabling programmers to use high-performance GPU chips for general-purpose computing.

A second factor contributing to the growth of deep learning was the development of knowledge and heuristics enabling practitioners to train the models effectively. Machine learning disciplines do not suddenly emerge by magic; successful application is the end result of a long process of experimentation and learning. Researchers struggled for years to solve the “exclusive-or” problem; due to the sheer complexity of deep learning models, it took years for the machine learning community to develop the skills and knowledge necessary to put the method to work.

The third factor is the huge expansion of digitized content—text, documents, images, audio, and video documented in the first chapter of this book. This huge expansion of digital content—what we now call Big Data—created entirely new applications for machine learning in areas such as sentiment analysis, natural language processing, topic modeling, image recognition, image search, and speech recognition. Existing machine learning methods were less suited to these problems, which entail searching for hidden or latent patterns in massively “wide” sets of unlabeled data.

Deep learning reached an early milestone in 2007, when Geoff Hinton of the University of Toronto published⁶ a seminal paper that outlined how a Deep Neural Network with multiple hidden layers could be trained layer by layer, thus breaking down the computational challenge into smaller and more tractable problems. Prior to that, research in speech and handwriting recognition had turned to so-called generative models .

Generative models are a class of statistical models that model relationships by learning the joint probability distribution for each data point; in contrast to discriminant models, which learn the conditional probability distribution. Examples of generative models include Gaussian Mixture Models, Hidden Markov Models, Latent Dirichlet Allocation, and Restricted Boltzmann Machines.

With expanded computing power and better methods, researchers working with deep learning started to show results. Beginning in 2009, Microsoft Research invested heavily in the application of deep learning to speech recognition and were able to significantly reduce⁷ error rates compared to other methods.⁸

Similar efforts at Google in the field of image recognition also paid off. In 2012, The New York Times reported⁹ that a Google Brain team used deep learning deployed over 16,000 computers to recognize unlabeled images among the millions of images in YouTube.

Thus, while interest¹⁰ in neural networks has declined over the past decade, interest in deep learning has increased—markedly so since 2012 (see Figure 8-1). In that year, mainstream publications like The New York Times ¹¹ and The New Yorker ¹² wrote stories about how companies like Apple, Microsoft, and Google use deep learning to solve problems in speech recognition, image recognition, 3D object recognition, and natural language processing.

Figure 8-1. Search interest for neural networks and deep learning (Source: Google Trends)

In 2015, Google, Facebook, and Microsoft released open source deep learning frameworks to open source. We cover these frameworks later in this chapter.

Deep Learning Basics

In this section, we introduce the reader to some of the most important concepts in neural networks and deep learning.

Neural Networks

Since deep learning rests on the technology of neural networks, we begin with an introduction to key concepts in that field. This overview is necessarily simplified; there are volumes written on narrow subtopics in the field, and development is ongoing.

Animal brains are neural networks: networks of smaller cells, or neurons, linked together with synapses. As biologists studied animal brains, they built analog models of neural networks: physical devices that simulated brain function as well as possible using wires and light bulbs. These contraptions were artificial neural networks.

Neural networks as we know them today are symbolic representations of brain function coded in computer languages. A neural network represents a problem as a network of nodes (“neurons”) connected by directed graphs (“synapses”). Like animal brains, they are able to “learn” and “remember”. Figure 8-2 shows an example of a neural network.

Figure 8-2. Neural network

Neuroscientists developed neural networks as a way to simulate animal learning. However, the methods they developed are broadly applicable in other fields.

In a neural network, each neuron accepts mathematical input, processes the inputs with a transfer function, and produces mathematical output with an activation function. Neurons operate independently on their local data and on input from other neurons. Figure 8-3 shows a neuron and its functions.

Figure 8-3. Neuron

Neurons in neural networks use a variety of mathematical functions as activation functions. These functions are mathematical expressions of how the neuron transforms input data received from other neurons into output data that it passes to other neurons. In principle, the activation function can be any mathematical function; it is limited only by software capabilities and available computing power.

While a neural network may use linear functions , analysts rarely do so in practice; a neural network with linear activation functions and no hidden layer produces the same results as a linear regression model. Analysts are much more likely to use nonlinear activation functions, such as the logistic function; if a linear function is sufficient to model the target, there is no reason to use a neural network.

The neurons or nodes of a neural network form layers. The input layer accepts mathematical input from outside the network, while the output layer accepts mathematical input from other neurons and transfers the results outside the network. A neural network may also have one or more hidden layers that process intermediate computations between the input layer and output layer. Deep neural networks are neural networks with at least two hidden layers. Figure 8-4 shows a deep neural network .

Figure 8-4. Deep neural network

The input and output layers of a neural network usually represent real-world facts: the input layer represents a vector of data we want to use as predictors, and the output layer represents a target variable.¹³ Hidden layers, on the other hand, represent abstract concepts similar to factors in statistical factor analysis, except that they are not directly interpretable and simply serve to improve the accuracy of the model. Hidden layers enable neural networks to learn arbitrarily complex functions.

Practitioners classify neural network architectures according to the network topology, information flows within the network, mathematical functions, and training methods. The two most widely used architectures are:

Multilayer Perceptron. The Multilayer Perceptron (MLP) is a feedforward network; this means that neurons in one layer accept input from neurons in previous layers, but do not accept input from neurons in the same layer or subsequent layers. In an MLP, the parameters of the model include the weights assigned to each connection and to the activation functions in each neuron. Practitioners use a technique called backpropagation to train the network.

Radial Basis Function Network. A Radial Basis Function (RBF) network uses radial basis functions, a particular type of mathematical function, as activation functions in the neurons. This type of neural network is well suited to function approximation, classification, and for modeling dynamic systems.

Analysts train a neural network by using one of many optimization algorithms. The backpropagation technique uses a data set in which values of the target (output layer) are known to infer parameter values that minimize errors. The method proceeds iteratively; first computing the target value with training data, then using information about prediction errors to adjust weights in the network.

There are several backpropagation algorithms; gradient descent and stochastic gradient descent are the most widely used. Gradient descent uses arbitrary starting values for the model parameters and computes an error surface; it then seeks out a point on the error surface that minimizes prediction errors. Gradient descent evaluates all cases in the training data set each time it iterates. Stochastic gradient descent works with a random sample of cases from the training data set. Consequently, stochastic gradient descent converges more quickly than gradient descent, but may produce a less accurate model.

Neural networks are complex techniques that require many choices by the practitioner. Relative to other machine learning techniques, however, artificial neural networks have four key advantages: an ability to automatically detect and model complex interactions among features; the ability to learn low-level features from minimally processed raw data; the ability to work with a large number of classes; and the ability to work with unlabeled data.

Taken together, these four strengths mean that artificial neural networks can produce useful results where other methods fail, and they have the potential to build more accurate models than other methods.

Unlabeled data lacks information about what it represents. A bit-mapped untagged photo, for example, is a stream of data characterizing the value of points in two-dimensional space, but does not include data about the subject of the picture.

Deep Learning Architectures

Building on neural networks, deep learning practitioners use complex new architectures that are well-suited to the key problems in content analytics posed by the tsunami of Big Data. Some of the most popular architectures include:

A Restricted Boltzmann Machine (RBM)is a shallow network with two layers: an input, or visible layer, and a hidden layer. Neurons or nodes in the input layer map to the features in the input data; for example, if a group of texts have 5,000 unique words, there will be one node in the input layer for each word. Nodes in the hidden layer represent relationships among the entities represented in the input layer; they are conceptually similar to factors in statistical factor analysis.

Deep-Belief Networks (DBN)are stacks of Restricted Boltzmann Machines. The hidden layer of each RBM serves as the input layer to the RBM above it in the stack. Analysts use DBNs to mine word vectors in text analytics, for image and video recognition and for voice recognition.

A Deep Autoencoderincludes two symmetrical Deep-Belief Networks, each with four to five Restricted Boltzmann Machines arranged in layers. Deep autoencoders are useful for topic modeling, where the goal is to model abstract topics distributed across many documents. They are also used for data compression, and for image search applications, where images are first compressed into fixed-length numerical vectors.

Recursive Neural Tensor Networkshave a tree structure with a neural network at each node of the tree. They are useful in text analytics, where they operate with word vectors.

Stacked Denoising Autoencoders (SDA)are stacks of another type of neural network called an autoencoder. The purpose of an autoencoder is to learn a representation of a set of data that reduces its dimensionality; an SDA has multiple hidden layers, each of which is an autoencoder. SDAs are useful for supervised document classification; for example, if we want to classify each document in a batch of documents into one of several groups for subsequent routing.

Convolutional Neural Networks (CNN)are a type of deep neural network inspired by the structure of the visual cortex in animals; they perform object recognition with images. Unlike multilayer perceptrons, whose neurons are fully connected, neurons in a CNN are locally connected to neurons in the immediate region. In image recognition, for example, one neuron represents one pixel in an image; in a CNN , that pixel may be connected to surrounding pixels, but not to a pixel in the far corner of an image. This approach is efficient when working with images.

Recurrent Networks (RNN)recognize patterns in sequences of data: time series data, handwriting, text, speech, or in genomes. Feedforward networks learn from data one case at a time, adjusting weights to minimize errors as they proceed through the data. RNNs , on the other hand, learn from both the current case and from the state of their own output as of the previous case, which serves as a kind of memory. Unlike a feedforward network, in an RNN neurons may be connected to any other neuron in any layer, not just to neurons in previous layers.

Machine Learning in Action

This section presents 12 examples of machine learning in action.

Baidu.¹⁴ In 2014, Baidu , a Chinese search engine company, announced development of a speech recognition system it calls Deep Speech.¹⁵ Baidu claims that in noisy environments like restaurants, the system achieves an accuracy rate of 81%, a significant improvement over commercially available speech recognition software. The speech recognition system uses a Recurrent Neural Network (RNN). Baidu reported¹⁶ using a computer cluster that is able to support deep learning models with about 100 billion neural connections (or synapses, not neurons).

Carolinas Healthcare System.¹⁷ For hospitals, patient readmission is a serious matter, and not simply out of concern for the patient’s health and welfare; Medicare and private insurers penalize hospitals with a high readmission rate, so hospitals have a financial stake in making sure that they only discharge patients who are well enough to stay healthy. The Carolinas Healthcare System (CHS) uses machine learning to construct risk scores for patients, which case managers use to make discharge decisions. This system enables better utilization of nurses and case managers, prioritizing patients according to risk and complexity of the case. As a result, CHS has lowered its readmission rate from 21% to 14%.

Cisco.¹⁸ Marketers use “propensity to buy” models as a tool to determine the best sales and marketing prospects and the best products to offer. With a vast array of products to offer, from routers to cable TV boxes, Cisco’s marketing analytics team trains 60,000 models and scores 160 million prospects in a matter of hours. By experimenting with a range of techniques from decision trees to gradient boosted machines, the team has greatly improved the accuracy of the models—that translates to more sales, fewer wasted sales calls, and satisfied sales reps.

Comcast.¹⁹ For customers of its X1 interactive TV service, Comcast provides personalized real-time recommendations for content based on each customer’s prior viewing habits. Working with billions of history records, Comcast uses machine learning techniques to develop a unique taste profile for each customer, then groups customers with common tastes into clusters. For each cluster of customers, Comcast tracks and displays the most popular content in real time, so customers can see what content is trending now. The net result: better recommendations, higher utilization, and more satisfied customers.

Dstillery. ²⁰ Ad tech company Dstillery uses machine learning to help companies like Verizon and Williams-Sonoma target digital display advertising on real-time bidding (RTB) platforms. Using data collected about an individual’s browsing history, visits, clicks, and purchases, Dstillery runs predictions thousands of times per second, handling hundreds of campaigns at a time; this enables the company to significantly outperform human marketers targeting ads for optimal impact per dollar spent.

GenomeDx Biosciences.²¹ GenomeDx Biosciences is a startup in the business of genomic testing. To evaluate the efficacy of a genomic test in improving the diagnosis of prostate cancer, GenomeDx worked with major hospitals and medical schools to develop a clinical trial with 1,537 patients. The genetic test produced a vector of 46,000 features, far too many to analyze with conventional methods. Using a Deep Neural Network, GenomeDx built a classifier that predicted post-surgery outcomes for cancer patients more effectively than any other available method.

Jaguar Land Rover.²² New cars built by Jaguar Land Rover have 60 onboard computers that produce 1.5 gigabytes of data every day across more than 20,000 metrics. Engineers at the company use machine learning to distill the data and to understand how customers actually use the vehicle . By working with actual usage data, designers can predict part failure and potential safety issues; this helps them to engineer vehicles appropriately for expected conditions.

Microsoft.²³ In March 2015, a Microsoft team published²⁴ a paper documenting results from their computer vision system, which is based on deep convolutional networks (CNNs). The team tested the system on the ImageNet 2012 classification data set, which contains 1.2 million training images, 50,000 validation images, and 100,000 test images. The task assigned to the system is to assign each image into one of 1,000 classes. The Microsoft system achieved a 4.94% error rate, which actually outperformed humans, who classified the images with a 5.1% error rate.

NBC Universal.²⁵ NBC Universal stores hundreds of terabytes of media files for international cable TV distribution; efficient management of this online resource is necessary to support distribution to international clients. The company uses machine learning to predict future demand for each item based on a combination of measures. Based on these predictions, the company moves media with low predicted demand to low-cost offline storage. The predictions from machine learning are far more effective than arbitrary rules based on single measures, such as file age. As a result, NBC Universal reduces its overall storage costs while maintaining client satisfaction.

PayPal.²⁶ Online payments company PayPal handles more than $10 billion in money transactions every month. At that volume, small improvements in fraud detection and prevention translate to significant bottom-line impact: each 1% improvement in prediction accuracy adds $1 million on operating contribution. Working with a data set of 160 million records and 1,500 features, the company’s machine learning team continuously updates its fraud detection models, seeking small improvements. The company reports a “major leap forward” in its abilities since it started using nonlinear methods several years ago, and additional improvements since it started using deep learning three years ago. PayPal’s deep learning algorithms can analyze thousands of latent features, such as time signals, actors, and geographic location, and have produced a 10% improvement over the previous champion fraud detection model.

Spotify.²⁷ A team at Spotify used a hybrid deep convolutional network to learn similarities and differences among songs based on spectrograms of the audio signal. Trained on 30-second tracks extracted from the million most popular songs on Spotify, the network learned to predict the latent representations of the songs obtained from a collaborative filtering model. (By doing so, Spotify can recommend playlists with little or no prior usage data.)

U.S. Department of Energy.²⁸ Working with the Berkeley Lab, the National Energy Research Scientific Computing Center (NERSC) uses deep learning to analyze petabytes of data produced by climate simulation models. Using deep learning for pattern recognition, NERSC reports 95% accuracy detecting extreme weather events, such as tropical cyclones, atmospheric rivers, and weather fronts.

The New Machine Learning Software

In this section, we briefly describe scalable machine learning and deep learning software platforms, in four groups:

Open source distributed engines
Commercial distributed engines
In-database libraries
Deep learning frameworks

We do not include open source R and Python libraries, which we discussed in Chapter Four. While these languages are valuable developmental tools, they are not inherently scalable; R and Python users working with large data sets are best served by using one of the scalable engines listed here.

Many popular end user tools push processing down to one of the engines listed in this chapter. We cover those tools in Chapter Nine.

The distributed machine learning engines described in this section were originally developed to run either on clustered servers or special-purpose appliances; they were not originally designed to run in Hadoop. Under Hadoop 2.0 (after the release of YARN), all were quickly adapted to run in Hadoop under YARN.

Open Source Distributed Engines

There are just two open source general-purpose distributed engines for machine learning: Apache Spark and H2O. There are many other open source machine learning packages, such as Weka or Vowpal Wabbit, that do not support distributed model training. There are also machine learning tools, such as XGBoost, that support single algorithms. We limit the discussion to software that supports multiple algorithms.

Apache Spark Machine Learning

Apache Spark MLlib is a machine learning library that runs on top of Spark. MLlib has two primary APIs; the original API that works with Spark RDDs (see Chapter Five) and a newer API that works with Spark DataFrames. (The PySpark and SparkR APIs also support machine learning functions.) While the Spark team continues to support the RDD-based API, all new development takes place in the DataFrames API.

The RDD-based API includes basic statistical tools, including summary statistics, correlations, stratified sampling, hypothesis testing, streaming significance testing, and random data generation. For machine learning, the library includes:

Feature extraction and transformation, including tools for feature vectorization, text mining, standardization, normalization, feature selection, and vector transformation.
Binary classification with linear support vector machines, logistic regression, decision trees, Random Forests, Gradient-Boosted Trees, and naïve Bayes classifier.
Regression with linear least squares, Lasso, ridge regression, decision trees, Random Forests, Gradient-Boosted Trees, and isotonic regression.
Dimensionality reduction, with Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) .
Clustering with k-means, Gaussian Mixture, Power Iteration Clustering (PIC) , Latent Dirichlet Allocation (LDA) , bisecting k-means, and streaming k-means.
Frequent pattern mining with FP-Growth, Association Rules, and PrefixSpan for sequence analysis.
Collaborative filtering with Alternating Least Squares.

The API also includes a full set of statistics for model evaluation, including common metrics such as precision; recall, F-measure, ROC, and AUC. Users can also export PMML models for selected algorithms.

For developers who want to create their own algorithms, Spark exposes optimization primitives, including gradient descent, stochastic gradient descent, and the limited-memory Boyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm.

The DataFrames-based API models a complete machine learning pipeline. Users work with DataFrames defined in Spark SQL rather than directly with RDDs. The basic elements in the API are transformers and estimators. Transformers are algorithms that perform operations one DataFrame to produce another DataFrame; for example, an operation that standardizes all variables in a data set. Estimators are algorithms that operate on a DataFrame to create a transformer; for example, a linear regression algorithm produces a linear model, which is a transformer that a user can apply to another DataFrame.

The library includes three types of prebuilt transformers :

Feature extractors, which create features from raw data using algorithms like TF-IDF and Word2Vec.
Feature transformers, which scale, convert, or modify features.
Feature selectors, which select a subset from a larger set of features .

Machine learning functionality is rapidly expanding:

Binary classification with logistic regression, decision trees, Random Forests, Gradient-Boosted trees, and Multilayer Perceptron.
Multiclass classification through a “one versus all” algorithm used with binary classifiers.
Regression with linear regression, decision trees, Random Forests, Gradient-Boosted trees, and survival regression.
Clustering algorithms with k-means and Latent Dirichlet Allocation (LDA).

Spark Packages further extend Spark’s machine learning with unique and innovative capabilities contributed by third parties. As of early 2016, there are more than 50 packages for machine learning.

For R users, the SparkR interface offers a selection of machine learning algorithms, including a Gaussian GLM model and Binomial GLM model. The Spark team has designed SparkR to operate in a manner similar to other R packages.

In addition to Spark’s native machine learning libraries and Spark Packages, we note Apache SystemML, a module that runs on top of MapReduce and Spark.

SystemML is a declarative machine learning system developed by IBM and donated to the Apache Foundation; it is now an Apache Incubator project. Interacting with the software through Python and R APIs, users specify machine learning algorithms to run; SystemML generates optimized runtime plans for execution locally or in MapReduce or Spark.

As of early 2016, SystemML supports:

Descriptive statistics, including univariate, bivariate, and stratified bivariate statistics.
Classification techniques, including multinomial logistic regression, support vector machines, naïve Bayes, decision trees, and Random Forests.
k-means clustering.
Regression techniques, including linear regression, stepwise linear regression, generalized linear models, and stepwise generalized linear models.
Matrix factorization techniques, including principal components analysis.
Survival analysis, including Kaplan-Meier and Cox Proportional Hazards methods.

Users interact with SystemML through a high-level language (DML) with syntax similar to R or Python. DML includes linear algebra primitives, statistical functions, and ML-specific concepts. The algorithms, which are fully customizable, are dynamically compiled and optimized based on data and cluster characteristics.

Apache SystemML has a steadily growing code base and active contributor community.²⁹

H2O

H2O is an open source distributed in-memory computing platform designed for deployment in Hadoop, in free-standing clusters, or in the cloud. H2O has its own distributed computing engine; it works with data in HDFS, S3, SQL, and NoSQL datastores, and with Apache Spark through the Sparkling Water interface.

Current functionality includes deep learning, generalized linear models, gradient boosted classification and regression, k-means clustering, naive Bayes classifier, principal components analysis, and Random Forests.³⁰ The software also includes tooling for data transformation, model assessment, and scoring. H2O exports scoring objects as Plain Old Java Objects (POJOs) .

Users interact with the software through Java, Scala, Python, and R APIs, or through an easy-to-use web interface.

H2O.ai provides commercial support for the open source software. In July, 2014, H2O.ai received $8.9 million in Series A funding from a group of investors; subsequently, in November 2015, the company announced³¹ a $20 million Series B round of funding. The company claims a number of public reference customers, including AT&T, Comcast, Kaiser Permanente, Progressive Insurance, Transamerica, Walgreens, and Zurich Insurance. There is a rapidly growing user community for H2O; H2O.ai claims more than 40,000 users in more than 5,000 organizations.

H2O has a large and steadily growing code base.³²

Commercial Distributed Engines

Three software vendors offer commercially licensed distributed machine learning engines; SAS offers three different engines.

SAS High Performance Analytics

Industry leader SAS introduced SAS High Performance Analytics (HPA) , in late 2012. HPA is a distributed in-memory analytics platform designed to run on specially built appliances from Oracle, Pivotal, or Teradata, and subsequently repurposed for clusters of commodity hardware. HPA serves as a back-end component for SAS Enterprise Miner and other SAS clients, enabling selected SAS procedures to run in distributed mode on clustered hardware.

While the product supports multiple databases, it lacks an open API and can only be called from SAS. HPA reads data into memory quickly through a parallel load, but does not keep data in memory and does not support high concurrency.

SAS introduced LASR Analytics Server in 2013 to serve as the back-end for a new visualization product (SAS Visual Analytics). LASR Analytics Server, unlike HPA, keeps data in memory and supports high concurrency. Neither architecture offers capabilities equivalent to a true in-memory database, such as durability guarantees or the ability to update data without reloading the entire data set.

In April 2016, SAS announced a third modern architecture, branded as SAS Viya, which the company positions as “open, elastic and scalable.” As of August 2016, the software is in limited preview for existing SAS customers only, with planned general availability later in the third quarter of 2016.

Microsoft R Server

Microsoft R Server is a commercially licensed software bundle that includes Microsoft R Open, an enhanced R distribution; integration and connectivity tools; and ScaleR, a library of distributed algorithms for predictive analytics with an R interface. The software runs on Linux; it can be deployed in Cloudera, Hortonworks, or MapR Hadoop distributions, or in Teradata Database. Microsoft offers the software on Windows through R Services for SQL Server 2016.

Microsoft R Server works with data in text files, HDFS, relational databases, SAS data sets, and other common formats. Capabilities supported in ScaleR include tools for data transformation, descriptive statistics, linear and logistic regression, generalized linear models, decision trees, ensemble models, and k-means clustering.³³ The software supports native model scoring and model export through PMML. The deployment interface supports integration with Tableau, Qlik, and custom web applications.

Skytree

Skytree is a Silicon Valley-based startup that develops and markets commercial software for machine learning. Skytree’s core software began as an academic machine learning project (FastLab at Georgia Tech); the developers launched the company as a commercial software vendor in January 2013. The software runs under YARN on Cloudera, Hortonworks, MapR, and Amazon EMR, and integrates with Apache Spark to create what the company calls the Unified Machine Learning Platform.

The software supports data visualization, feature engineering, and machine learning algorithms for classification, regression, clustering, inference, and dimension reduction. Skytree claims an automated model selection capability, trademarked as AutoModel, which it is attempting to patent.

Users interact with the software through the Skytree Command Line Interface (CLI) , Java and Python APIs, or a browser-based GUI.

In-Database Libraries

In-database machine learning libraries work inside relational databases, generally through table functions. Users interact with the machine learning library through SQL, or through applications that can pass SQL to the database through an open interface.

We include here only those machine learning libraries that can support multiple database platforms. This excludes from consideration the native machine learning tools built into IBM DB2, IBM Netezza, Microsoft SQL Server, Oracle Database, and Teradata Database. While the machine learning capabilities of these databases may be useful (especially in organizations that are fully committed to the database platforms), most organizations are better off investing in capabilities that are not tied to single vendors.

Apache MADlib

Apache MADlib is an open source library of machine learning algorithms designed to operate in massively parallel databases, without data movement. Development started in 2010 as a collaboration between researchers at UC Berkeley and data scientists at EMC Greenplum (now Pivotal Software).

Pivotal donated the software assets to the Apache Software Foundation in 2015, and the project entered Apache incubator status. While the project seeks to broaden its contributor base, most recent commits come from two Pivotal employees.

The project explicitly supports PostgreSQL, Pivotal Greenplum Database, and Pivotal Hawq; in principle, users can implement the library in any database that supports UDAs, such as Impala.³⁴.

The MADlib algorithms operate as table functions in databases; users invoke them through SQL. MADlib also supports feature extraction from text and low-rank matrix factorization together with a number of utilities for discovery, validation, and model implementation. Machine learning capabilities include 10 different regression methods, linear systems, matrix factorization, tree-based methods, association rules, clustering, topic modeling, text analysis, time series analysis, and dimension reduction techniques.

Commercial support for MADlib is unclear at this time. Most MADlib users are customers of Pivotal Software, and that company provided consulting and technical support. Dell recently acquired EMC, Pivotal Software’s parent company; meanwhile, shifting project governance from Pivotal to Apache will likely expand the user and contributor base.

Fuzzy Logix DB Lytix

DB Lytix, a commercial software offering from Fuzzy Logix, is library of more than 800 functions for machine learning and advanced analytics. Functions run as database table functions in relational databases (Informix, MySQL, Netezza, ParAccel, SQL Server, Sybase IQ, Teradata Aster, and Teradata Database) and in Hadoop through Hive. DB Lytix also runs in GPU devices; Fuzzy Logix offers the Tanay Zx appliance for GPU-based analytics.

Users invoke DB Lytix functions from SQL, R, through BI tools, or from custom web interfaces. Functions support a broad range of machine learning capabilities, including feature engineering, model training with a rich mix of supported algorithms, plus simulation and Monte Carlo analysis. All functions support native in-database scoring. The software is highly extensible, and Fuzzy Logix offers a team of well-qualified consultants and developers for custom applications.

In November, 2015, Fuzzy Logix announced³⁵ that it raised $5.5 million in venture capital.

Deep Learning Frameworks

A recent article in VentureBeat lists³⁶ no fewer than 15 different software frameworks for deep learning. We describe some of the most promising open source projects in the following sections.

CNTK

The Computational Network Toolkit (CNTK) is a product of Microsoft Research. Microsoft developed CNTK to improve computer speech recognition³⁷, and it uses³⁸ it in products such as Windows Cortana, Skype Translator and Project Oxford Speech APIs. Microsoft released³⁹ the software to open source in January 2016.

Like TensorFlow and Theano, CNTK represents networks as a graph that represents mathematical operations as nodes; the edges between nodes represent multidimensional data arrays. This approach allows users to invent new network architectures and layer types. The software runs on standard CPUs as well as graphical processing units (GPUs), machines with multiple GPUs, and distributed on a cluster of multi-GPU machines.

Users interact with CNTK by first creating a configuration file, then running the software from a command-line interface. There is no API.

CNTK is based on C++, so developers can compile trained models and deploy them across platforms.

TensorFlow

TensorFlow is the second generation of a machine learning system developed by Google scientists and engineers. Google uses its first generation system, called DistBelief in a number of Google applications, including search, voice search, photo recognition and video matching⁴⁰. DistBelief learns concepts such as “cat” from unlabeled YouTube images, and improves speech recognition in the Google app. It won⁴¹ ImageNet’s Large Scale Visual Recognition Challenge in 2014.

Google engineers simplified and rebuilt the DistBelief code to create TensorFlow; in November, 2015, Google released⁴² a reference implementation of TensorFlow to open source under an Apache license. Although Google’s internal version of TensorFlow can distribute workload over clustered machines, the open source version runs on a single machine only⁴³. It supports GPUs through CUDA extensions. Supported operating systems include Linux and Mac OS. The package supports Python and C++ APIs.

TensorFlow models machine learning operations in the form of a graph that represents mathematical operations as nodes; the edges between nodes represent multidimensional data arrays (“tensors” in Google terminology). Although Google engineers developed TensorFlow for deep learning, the system can be generalized to other machine learning operations.

Developers can compile trained models and deploy them on a variety of devices. However, as of Spring 2016, they cannot be deployed on Windows.

In February, 2016, Google released⁴⁴ TensorFlow Serving, a software package designed to simplify the deployment of trained models. The software works natively with TensorFlow, and it can also support other tools.

Theano

Theano is a Python library for numerical computation developed by scientists at the Université de Montréal. It allows users to efficiently define, optimize, and evaluate mathematical expressions with multi-dimensional arrays. While Theano’s capabilities are not limited to deep learning, its transparent support for GPU processing makes it a popular platform for deep learning practitioners.

Like CNTK and TensorFlow, Theano represents neural networks as a symbolic graph. Theano was the first to do so, and due to its maturity most state of the art network architectures are available on the platform.

Since Theano lacks a low-level interface, it is less suitable for production applications due to Python overhead. In performance benchmarks, Theano lags⁴⁵ other deep learning frameworks, such as TensorFlow, Caffe, and CNTK. It supports single GPU machines only.

DL4J

Deeplearning4j (DL4J) is an open source computing framework written in Java that supports deep learning algorithms. Skymind, a small San Francisco based startup, leads software development for the project and provides commercial support. Skymind distributes Dl4J under an Apache 2.0 license.

DL4J is a distributed and multi-threaded framework; it is integrated with Hadoop and Spark and trains models within the cluster. In a distributed environment, DL4J shards, or splits a large data sets and passes the shards to worker nodes for execution. Each node trains a model on its local data; DL4J then iteratively averages the parameters to produce a single model.

Written in Java, DL4J also offers APIs to the related Scala and Clojure languages. It supports standard CPUs and GPUs.

Caffe

Caffe is a deep learning framework developed by the Berkeley Vision and Learning Center (BVLC) and released under an open source BSD license. Stemming from BVLC’s work in vision and image recognition, Caffe’s core strength is its ability to model a Convolutional Neural Network (CNN) .

Caffe is written in C++. Users interact with Caffe through pycaffe, a Python package, or through a command-line interface. Deep learning models trained in Caffe can be compiled for operation on most devices, including Windows.

In February 2016, Yahoo released⁴⁶ CaffeOnSpark, a package that enables Spark users to embed Caffe deep learning into Spark processes. Yahoo has successfully applied CaffeOnSpark to image recognition problems, significantly improving image recognition accuracy by training with millions of photos from the Yahoo Webscope Flickr Creative Commons data set.

Apache SINGA

Currently in Apache Incubator status, Apache SINGA is an open source distributed deep learning platform for training deep learning models on large data sets. Researchers from the National University of Singapore lead the development team.

The platform currently supports feed-forward models, convolutional neural networks, restricted Boltzmann machines, and recurrent neural networks. The project includes a stochastic gradient descent algorithm for model training.

SINGA currently supports GPU processing on a single node. Training on a GPU cluster is under development.

Torch

Torch is an open source scientific computing framework developed by a team of engineers from Facebook, Google, and Twitter. First released in 2002, the software is available under a BSD license.

Torch supports packages for multi-dimensional tensors, neural networks, optimization, 2D and 3D plotting, file manipulation, and image processing. Users can build Convolutional Neural Networks (CNN), including temporal convolution, as well as Recurrent Neural Networks (RNN).

The package offers an API in LuaJIT, an easy-to-use scripting language, so defining new network architectures is simple. While LuaJIT is easy to learn and use, it is more difficult to integrate into a production pipeline.

The New Machine Learning

The war of words between devotees of statistical techniques and machine learning techniques is largely over. Except for a few holdouts, the machine learning camp and its pragmatic approach has won the day. Practitioners freely mix techniques from the two camps using methods and procedures perfected by the machine learning community.

Highly visible competitions contribute to this convergence. Results from entries using different techniques are transparent to everyone. Teams using the most powerful techniques win; teams that are unwilling to part with traditional techniques lose.

Ensemble learning techniques, first developed in the 1990s, are increasingly mainstream. Several factors contribute to this development: high quality open source software implementations, increased availability of computing power at reduced cost, and visible successes in machine learning competitions.

As analysts grapple with Big Data, software developers have introduced distributed machine learning engines with a scale-out architecture. The transition to distributed engines is a discontinuity in the market. Single-threaded server-based machine learning software is increasingly seen as obsolete, creating opportunities for startups. Building distributed computing platforms is expensive and difficult, so many developers leverage existing open source frameworks such as Apache Spark.

Deep learning has emerged as a practical technique to address the most challenging problems in machine learning, such as speech and image recognition. Declining costs of computing and the emergence of high-performance low-cost GPU-based platforms have accelerated the adoption and use of deep learning.

There is growing interest in self-service machine learning for business users. We cover this in Chapter Nine, together with automated machine learning.