Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Thomas W. Dinsmore, Disruptive Analytics, 10.1007/978-1-4842-1311-7_1

1. Fundamentals

Disruption in the Analytics Value Chain

Thomas W. Dinsmore¹

(1)Newton, Massachusetts, USA

The analytics business is booming. Technology consultant IDC estimates¹ total spending for analytic services, software, and hardware exceeded $120 billion in 2015; through 2019, IDC forecasts that spending will increase to $187 billion, an 11% compound annual growth rate².

So, if analytics is such a hot field, why are the industry leaders struggling?

Oracle’s cloud revenue growth³ fails to offset declining software and hardware sales⁴.
SAP’s cloud revenue grows, but total software revenue is flat⁵.
IBM reports⁶ 16 straight quarters of declining revenue. Mass layoffs ensue⁷.
Microsoft underperforms⁸ analysts’ expectations despite 120% growth in Azure cloud revenue.
Predictive analytics leader SAS reports⁹ five years of low single-digit revenue growth; EVP departs¹⁰.
Data warehousing leader Teradata shuffles its leadership team after four years of declining product revenue¹¹.

Product quality is not the problem. Each company offers products that industry analysts rate highly:

Forrester and Gartner both¹² recognize¹³ IBM, SAS, SAP, and Oracle as leaders in data quality tools.
Gartner rates¹⁴ Oracle, SAP, IBM, Microsoft, and Teradata as leaders in data warehousing.
Forrester rates¹⁵ Microsoft, SAP, SAS, and Oracle as leaders in agile business intelligence.
Gartner recognizes SAS and IBM as leaders in Advanced Analytics¹⁶.

The answer, in a word, is disruption¹⁷. Powerful forces are rearranging the industry:

Digital transformation of the economy and rapidly declining storage costs produce a data tsunami.
The number of data sources is exploding. Data sources are everywhere: on-premises, in the cloud, in consumers’ pockets, in vehicles, in RFID chips, and so forth.
Data governance is complicated by decentralized data ownership as functional executives control an increasing share of technology spending.
The open source software business model offers an increasingly attractive alternative to commercial software licensing.
Increasingly, the Hadoop ecosystem displaces conventional data warehousing; R and Python displace commercial analytic software.
The elastic business model made possible by cloud computing undercuts conventional software licensing and provisioning.
Widely available and inexpensive computing power make computationally intensive techniques like Deep Learning practical.

Consider what has happened to Teradata. Late in 2012, the company started missing sales targets; in early 2013, it stunned investors by reporting an absolute decline in sales. Management offered excuses; Wall Street punished the stock, driving it down by half in the face of an overall bull market.

From 2013 through early 2016, Teradata continued to miss sales and earnings targets; Wall Street drove the stock price down to a fraction of its 2012 peak. While it is tempting to blame the problem on poor leadership, Teradata’s persistent failure to forecast its own sales and earnings indicates something amiss. The world changed; the value networks created in Teradata’s rise to leadership no longer exist; the mental models managers used to understand the market no longer work.

Disruptive Innovation

Clayton Christensen of the Harvard Business School outlined¹⁸ the theory of disruptive innovation in 1997. We summarize the theory briefly; for an extended discussion, read Christensen’s book:

Industries consist of value networks, collections of suppliers, channels, and buyers linked by relationships.
Innovations disrupt industries when they create a new value network.
Not all innovations are disruptive. Many innovations are introduced by market leaders to sustain a competitive position.
Disruptive innovations tend to be introduced by outsiders.
Purely technological innovation is not disruptive; what matters is the business model enabled by the new technology.

Christensen identified two forms of disruption. Low-end disruption occurs when industry leaders enhance products faster than customers can assimilate the enhancements; the disruptor enters the market with a “good enough” product and a better value proposition. The disruptor’s innovation makes it possible to serve customers at a lower cost than the industry leaders can deliver.

New market disruptiontakes place when the disruptor innovates in ways enabling it to serve customers that are not served by the industry leaders.

In this book, we discuss two kinds of disruption. The first is disruptive innovation within the analytics value chain (a concept we explore later in this chapter). The second is industry disruption by innovations in analytics.

There are many examples of disruption within the analytics value chain:

Hadoop disrupts the data warehousing industry from below. Hadoop does not do everything a relational database can do; but it does just enough to offer an attractive value proposition for the right use cases. When first introduced, Hadoop’s capabilities were quite limited relative to data warehouse appliances. But Hadoop’s flexibility and low cost were highly attractive for applications that did not need the performance and features of a data warehouse appliance. While established vendors struggle to maintain flat and declining revenue, Hadoop distributors grow at double-digit rates.
Tableau virtually created the market for agile self-service discovery. Tableau has no charting and visualization features not already available in mainstream business intelligence tools. But while business intelligence vendors targeted the IT organization in large enterprises and continuously added features, Tableau targeted the end user with a simple, easy to use, and versatile tool. As a result, Tableau has increased its revenue tenfold in five years, leapfrogging over many other BI vendors.

Examples of disruption by analytics are less prevalent, but they do exist:

General-purpose credit scoring introduced by Fair, Isaac and Co. in 1987 virtually created a national market in credit cards. Previously, banks issued credit cards to their local customers, with whom they had an established relationship. Uniform credit scoring enabled a few large issuers to identify creditworthy customers in the general population, without a prior relationship.
When the U.S. Securities and Exchange Commission authorized electronic trading in regulated securities in 1998, market participants quickly moved to develop algorithms that could arbitrage between markets, arbitrage between indexes and the underlying stocks, and exploit other short-term opportunities. Traders that most effectively deployed machine learning for electronic trading grew at the expense of other traders.

The relative importance of the two kinds of disruption depends on the reader’s perspective. Disruption within the analytics value chain is pertinent for readers who plan to invest in analytics technology for their organization. Technologies at risk of disruption are risky investments; they may have abbreviated useful lives, and their suppliers may suffer from business disruption. Taking a “wait-and-see” attitude toward disrupted technologies makes good sense, if only because prices will likely decline in the future.

For startups and analytics practitioners, disruption by analytics is key. To succeed, startups must disrupt their industries. Using analytics to differentiate a product is a way to create a disruptive business model or to create new markets.

To understand disruptive analytics, we must first understand the current state of analytics and its drivers. In the remainder of this chapter, we present a discussion of what drives the demand for analytics, and an overview of the analytics value chain. We close the chapter with an outline of the rest of the book

The Demand for Data-Driven Insight

The key to survival in a disrupted world is to ruthlessly re-examine business processes, working backward from a problem.

Analytics is the systematic production of useful insight from data. In business, people use insight to solve one of five core problems:

Develop a business strategy.
Manage a business unit.
Optimize a business process.
Develop products and services.
Differentiate products and services.

Each of these problems needs a different kind of insight, whose delivery requires distinctive people, processes, and tools.

Developing a Business Strategy

We define “strategy” narrowly to mean choices made by the top leadership of an organization: the “C-Suite”. Many people may participate in the development of strategy, but in every organization, the buck stops somewhere. Strategic analytics are any analytics that support strategic decisions.

What makes an issue “strategic?” Strategic questions and issues have four distinct characteristics:

The stakes are high; there are major consequences that depend on making the right choice. (Otherwise, the issue will be delegated.)
The issue falls outside of existing policy; no established rule enabling decisions at a lower level. (There may be a conflict of policies, or the situation may be unprecedented.)
Strategic issues are non-repeatable; in most cases, the organization addresses a strategic question once and never again. (Repeatable decisions are handled at lower levels through policy.)
There is no clear consensus about the best choice. (If everyone agrees on the best choice from the outset, there is no need for analysis).

Examples of strategic topics include:

Technology or product investments
Mergers and acquisitions
Business portfolio restructuring
Business reorganization
Branding, rebranding, and product positioning
Crisis management

Since the stakes are high for strategic analytics , so is the sense of urgency; some decisions, like merger proposals, may be strictly bounded in time. Crises provoked by product failure, natural disasters, or other issues may have actual life and death implications.

Deliverables for strategic analysis include reports, charts, visuals, and presentations. Owing to the high stakes of the decision, executives closely scrutinize the presented analysis. Analysis must be “bullet-proof,” especially if the results do not square with leadership’s prior beliefs. The methods used to produce the analysis must be clear.

Due to the ad hoc and non-repeatable nature of strategic analytics, enterprise data warehouses (EDWs) play at most a supporting role. In most cases, the data in EDWs is internal and supports existing processes with well-defined requirements. The data needed to support strategic decisions often comes from external sources, may be difficult to access, and may be needed only once.

Enterprises frequently engage outside consultants to deliver strategic analysis. While organization insiders may have no experience in a particular type of problem, outside experts have deep experience with similar problems. Firms also prize consultants’ independence and neutrality, since strategic decisions require resolving competing internal interests.

Managing a Business Unit

Managerial analytics support decisions a level down in the organization from top leadership. At this level, needs for analysis link to specific functions, such as Treasury, Product Management, Marketing, Merchandising, Operations, and so forth.

There are three distinct applications for managerial analytics:

Performance measurement
Performance optimization
Business planning

Performance measurement is the sweet spot for enterprise business intelligence (BI) systems. BI is highly effective when the data is timely and credible, reports are easy to use, and metrics align with business objectives. Most organizations want to measure business units in a consistent manner, so they ordinarily implement reporting systems centrally rather than letting business unit managers measure themselves.

Metrics tell the manager which entities (e.g., brands, products, campaigns, stores, and sales reps) performed well and which entities performed poorly. Optimization delivers guidance on how to improve or optimize performance by shifting budget investments. Marketing mix analysis, for example, estimates the revenue impact of spending on different channels and programs, so the organization can shift the marketing budget to its most productive uses.

Finally, business planning is a process of goal setting and goal alignment across functions, where the manager justifies operating and capital spending. In large organizations, the business planning process is highly templated and structured. Forecasting is an important tool for business planning.

Deliverables for managerial analysis are similar to strategic analysis. Detailed analysis and forecasts may be in the form of queryable “cubes” or interactive tools.

Optimizing Business Processes

Optimization at this level is much more granular than optimization for functional leadership. In marketing, for example, the CMO needs summary information about the effectiveness of all major programs; the CMO’s optimization problem requires shifting budget among programs. The program manager, on the other hand, seeks to optimally match programs, value propositions, and creative treatments to individual customers and customer segments.

There are many ways that analytics can optimize a business process. Examples include:

Automated decision engines
Targeting and routing systems
Operational forecasting systems

Automated decision engines apply consistent rules designed to balance risks and rewards. Embedded analytics help optimize criteria and ensure that decision rules reflect actual experience. Decision engines are faster than human decision-makers and make better decisions. Examples include payment authorization systems and credit approval systems.

Targeting and routing systems evaluate the characteristics of an incoming message or request and direct it to the appropriate agent or subsystem. Analytics extract essential information from the request, eliminating manual evaluation and triage. Examples include e-mail routing systems in customer service operations and SAR investigation routing systems in bank anti-money-laundering systems.

Operational forecasting systems project key metrics that affect operations, enabling the organization to align resources accordingly. Analytics leverage historical data to detect traffic patterns and shift resources to locations or shifts where they are most needed. Examples include retail staffing systems that plan shifts based on expected floor traffic, and police patrol routing systems that direct officers to projected high-crime areas.

Analytics that optimize business processes are ordinarily embedded in production systems, and usually must operate in real time. This implies a need for streaming analytics, which we cover in Chapter Six. Analytic deliverables are machine-consumable models implemented in software.

Developing Products and Services

The development process in organizations runs the gamut from creative brainstorming to formal scientific research, as in pharmaceutical laboratories, to “skunk works” prototyping. As such, the range of possible analyses is extremely broad. Developmental analytics fall into two broad categories:

Analytics for generating hypotheses
Analytics for testing hypotheses

Managers perform or commission hypothesis-generating analysis to identify unmet consumer needs or gaps in existing products. This can include activities like analyzing external data consumer surveys and consumption data; analyzing operational data; or evaluating clinical reports of treatment for a certain disease.

At a later stage in the product development process, managers test hypotheses about specific product concepts, prototypes, or small production run. Analysis at this stage can include analyzing clinical trial data to determine the efficacy of a drug; and analyzing test market data to assess the value of a product feature, or similar activities.

For practitioners, specialized domain expertise dominates purely analytical skills in this area. (One would not expect a biomedical specialist who specializes in Parkinson’s disease to easily switch to developing trading algorithms for a hedge fund.) Analytic processes must be highly flexible and agile, adapting to the particular problem at hand based on the product development cycle.

Differentiating Products and Services

We distinguish between analytics that support product development, and analytics that are the product, or embedded analytics.

For the previous four use cases, the “consumer” of insight is inside the organization—a top executive, functional manager, process participant, or product developer. Increasingly, however, analytics provide insight to end consumers outside of the organization. In these cases, analytics differentiate the product and make it stand out in the marketplace.

As the volume and variety of information available to consumers explodes, insight itself becomes a valued commodity. In this world, the most powerful analytic applications aren’t often viewed as analytics. Is Google an analytic application? Google uses analytics technology, including content analytics and graph analytics, and it produces a particular kind of insight.

Online retailing’s ability to carry a vastly larger number of unique items than brick-and-mortar retailers creates a shopping problem for consumers; with so many items from which to choose, what should we buy? Recommendation engines, which use machine learning to optimal products for an individual customer, are widely used. Most readers will be familiar with some salient examples, all of which use machine learning:

Facebook leverages a user’s profile and likes to optimize the news feed.
Streaming video sites like Netflix leverage the user’s ratings and other information to personalize recommendations.
Tinder pairs users based on profile and “swipes.”
Amazon.com uses data-driven similarity ratings to display products that are compatible to what a user has selected.
Spotify leverages a user’s prior preferences and content analytics to optimize the music stream.

Success in embedded analytics is a matter of software engineering; the end product must be tightly packaged for reliability and usability; in most cases it must operate in real time.

The Analytics Value Chain

Once we understand the demand for insight, we can define a value chain. The analytics value chain begins with data and ends with insight, progressively transforming data from low value to high value in a sequence of steps, as Figure 1-1 shows.

Figure 1-1. The analytics value chain

Of course, it’s possible to define the value chain at a much finer level of detail than we show here. At a high level, the analytics value chain includes three major components : steps that acquire data, steps that manage data, and steps that deliver insight. Delivering insight to human or machine users is the critical link in the chain; a system that successfully acquires and manages data but does not deliver insight has failed.

Acquiring Data

All data comes from an original source; capturing data from sources is the first step in the value chain. The processes that capture and manage data are variously called Extract/Transform/Load (ETL) , Data Integration, or Master Data Management (MDM). ETL refers to the physical movement of data; data integration addresses the challenge of consolidation across sources; and MDM addresses governance and administration of the process. Commercial vendors offer software to manage data flows through the value chain, cleanse the data, and load it into a analytic datastore. According to IDC, Informatica leads the commercial market, followed by SAS and IBM. Talend, Pentaho, and JasperSoft offer open core software, and Apache NiFi is a full open source project to manage data flows. (We discuss open source software business models in Chapter Three.)

Data Sources

Any system or device that creates data is a potential source for analytics. Most data sources are unsuited to serve as analytic platforms by themselves, for several reasons:

Much of the value in analysis comes from integrating data across sources. Few single data sources are sufficiently rich to provide valuable insight by themselves.
Production systems and devices rarely retain significant history, truncating data not necessary for immediate transaction processing needs.
Production systems and devices are usually designed to support transaction workloads and not analysis workloads.

Data sources are either static or streaming. Static data sources accumulate new data until a user requests data through a query or extract operation. Streaming data sources continuously publish data, “pushing” data to subscribers.

Data Extraction

The first step in the value chain is to “extract” data from one or more static source systems. While conventionally called “extraction,” in most cases data is copied rather than extracted.

For streaming data sources, this step is not necessary.

The organization that manages the production system (e.g., the IT organization) rarely permits free access to production systems, for two reasons:

The extract operation cannot interfere with transaction processing.
Production systems often contain sensitive information that must remain under data security protocols.

Hence, the organization that owns the production system generally controls the extract process and implements the procedure under a service level agreement.

At the beginning of the data warehouse era in the 1980s, the IT organization “owned” virtually all of the prospective data sources for analysis. As we discuss in Chapter Two, the digital transformation of business processes leads to an increasing share of technology spending controlled by functional executives. This, in turn, means that functional executives control the data sources as well.

Another radical change from the early days of data warehousing is the increased use of cloud computing and Software-as-a-Service platforms for production systems. This means that data sources are less and less likely to be physically located on-premises.

Data Cleansing

Data from source systems may be “dirty”: it may be inaccurate, incomplete, or erroneous. Data cleansing software scans incoming data and checks to see if items satisfy validity tests and are internally consistent. When the software finds an exception, it either force cleans the item or queues it to an exception file for human analysis.

Data cleansing ensures that data conforms to business logic, but it does not ensure accuracy. Verifying accuracy requires comparison to a reference value, which can only exist under lab conditions.

Few organizations have the resources to consistently research data cleaning exceptions. In practice, most issues in data are discovered by actual users with subject matter expertise.

Cleaning data in the analytics value chain violates the third of quality guru W. Edwards Deming’s 14 principles¹⁹ of business transformation:

Cease dependence on inspection to achieve quality. Eliminate the need for massive inspection by building quality into the product in the first place.

Rather than inspecting cars at the end of an assembly line and scrapping the ones that fail, it makes much better sense to design quality into the process and build high-quality cars. Similarly, it is much smarter to build data quality directly into the source systems that generate data than it is to trap and correct errors farther down the chain.

Data Structuring

We avoid use of the term “unstructured data” in this book. All data has structure. Some data has structure that is not yet known, and some data is difficult or impossible to map into the entity-relational (ER) framework that is the foundation of relational databases. Examples of such data include text, audio, video, images, and log files.

In conventional data warehousing practice, the data consolidation process is also a standardization process. This resolves differences in data structure, so that all data conforms to a unified data model—otherwise, it can’t be consolidated.

Some data is structured from inception, because the source system that produces it uses a relational database for storage. If the data model of the source system aligns with the data model that governs structured data in the analytics value chain, the data can be used directly.

However, even if the source data is structured, it may conform to a different data model than the analytics value chain. In this event, the data must be re-structured or mapped into the desired data model.

Some data is semi-structured: the data itself includes information about its structure. In this case, the organization must decide whether to structure the data prior to storing it, or to simply catalogue it and defer structuring until it is used.

Log files can be parsed and structured with special tools. Text, audio, video, and images generally cannot be structured into an ER framework; however, machine learning tools (discussed later in the chapter) can scan content to identify duplicates or classify it into categories.

Data Consolidation

For insight, organizations consolidate information from many data sources. In most cases, data sources lack a common data structure, for several reasons:

Data sources may include systems and devices from different manufacturers that produce data to different standards.
Large organizations may have many systems implemented at different times or acquired in mergers and acquisitions.
Source systems and devices may produce data that is difficult to map into a relational data model, such as log files.

More recently, with the growth of text, images, audio, and video data, standardization is difficult or impossible. In this environment, “consolidation” simply means the aggregation of files, with structuring postponed to the query phase of analysis.

Managing Data

An analytic datastore is any repository that holds data collected from original sources in a format that facilitates analysis. Every analytics value chain has one or more intermediate datastores variously called data warehouses, data marts, and data lakes. In large, mature organizations, there may be many analytic datastores.

Every analytic datastore should serve three primary purposes:

Accumulating history
Collecting data across sources
Cataloguing and organizing the data

Accumulating history is a key function of the analytic datastore, since primary data sources generally do not perform this function. Data not retained in an analytic datastore is simply lost.

As noted previously, the production of insight generally requires combining data from multiple sources. Consolidating data in a single repository adds value by saving time for the analyst, just as a supermarket saves time for the food shopper, who otherwise would have to make separate trips to the butcher, produce store, bakery, and so forth.

Data that is not catalogued is lost. Imagine an enormous library without an index, where books are simply stacked on shelves at random. Who would use such a library? In an analytic datastore, data is indexed and searchable, and its lineage is documented.

Analytic datastores must also support the organization’s data security policies. Since data preservation is a key priority, they must have backup, restore, and disaster recovery capabilities.

Theorists engage in extended debates about the definition of terms like data warehouse, data mart, and data lake; they also debate the relative merits of each architecture. The debates are academic and a waste of time; no organization can choose an optimal architecture for an analytic datastore in the abstract, without reference to an actual end user.

Of course, when data is created we don’t necessarily know how end users will want to produce insight. In the absence of firm requirements, organizations should simply catalogue and archive data in atomic form at the lowest possible cost, deferring more complex data integration until clear business cases emerge.

Oracle leads the commercial market for software to build analytic datastores, followed by IBM, Microsoft, Teradata, and SAP. The top five vendors control 80% of the market, according to IDC. In Chapter Four, we discuss the Hadoop ecosystem, an open source alternative to the leading commercial platforms.

Delivering Insight

Acquiring and managing data is an essential part of the analytics value chain, but delivering insight produces the most value. Figure 1-2 shows worldwide business analytics software spending forecast by IDC by high-level categories; about two-thirds of all projected spending is for software that delivers insight to end users. This includes spending on query, reporting, and analysis tools; advanced and predictive analytics tools; spatial analytics tools; content analytics tools; and performance management and analytic applications.

Figure 1-2. Software spending in the analytics value chain

In the sections that follow, we survey three major categories of tools and processes that produce insight: business intelligence, self-service discovery, and machine learning.

Business Intelligence

We can resolve many business issues with simple quantification:

How many cases of Product X did we sell in Region Y?
How much did each of our sales representatives sell in the first quarter?
What was our sales volume by category in each of the past four quarters?

In each case, the question can be addressed by aggregating facts into measures by dimensions. For example, in the first example, the facts are sales transactions; the measure is “number of cases”; and the dimensions are Product and Region.

For questions in this form, queries against relational databases with Structured Query Language (SQL) deliver the needed answer. In Chapter Two, we discuss SQL in its historical context.

Most business users prefer to interact with data through business intelligence (BI) tools rather than directly through SQL. Business intelligence tools offer a graphical user interface and “business-friendly” views of the data. Behind the scenes, however, BI tools generate SQL or MDX, a competing standard for queries.

Reports are formatted views of data, typically containing many individual items. Typically in the form of tables or cross-tabulations reports contain primary measures with calculated statistics. For example, a report showing the number of sales transactions and their dollar value by region can also show statistics calculated from those measures, such as the average value of a transaction by region, or the percentage distribution of sales across regions.

Dashboards are collections of individual measures, reports, and graphical displays that summarize key metrics. Organizations build predefined dashboards to support ongoing initiatives; for example, a customer service operation might develop a dashboard that summarizes many key service quality metrics.

There are three principal applications for quantification in an enterprise. The first of these is performance measurement. After taking an action, managers want to measure its success—or lack thereof. Moreover, managers have an ongoing interest in the performance of their domain, under the premise that “you can’t manage what you don’t measure”²⁰.

Managers place a premium on accurate, consistent, and timely performance reporting based on well-defined metrics. They also value metrics with a clear tie to the organization’s goals and objectives. Business intelligence tools perform very well for performance measurement, as they excel at delivering consistent and repeatable metrics to a large audience.

The second application is interactive discovery to support program and product development. For this application, questions are less well defined than for performance measurement; the answer to one question raises many other questions, analogous to peeling an onion.

Conventional business intelligence tools perform less well for this application than they do for performance measurement; they tend to be relatively inflexible, better suited to production reporting than agile discovery. OLAP tools designed for dimensional analysis are a little more flexible than reporting tools, but business users with high needs for interactivity may work directly with SQL.

The third application—business planning—requires forecasting as well as historical analysis. Most business intelligence tools support simple time series analysis, which is sufficient for many managers. In other cases, managers may integrate historical data from a business intelligence tool with forecasts developed by specialists, combining the two sets of values in a spreadsheet or presentation tool.

While queries, reports, and dashboards are powerful tools, they are limited to low-dimensional problems where the question can be addressed within the framework of facts, measures, and dimensions. Dimensionality is a key issue for these tools. An analyst can easily work with reports showing data in one, two, or three dimensions; with graphics, four and even five dimensions are feasible. With more than five dimensions to consider, the analyst must break the problem into separate low-dimensional analyses; the number of possible combinations rises exponentially as the number of dimensions increases.

Self-Service Discovery

Conventional business intelligence tools are too inflexible to support interactive discovery. Self-service discovery tools, on the other hand, are ideally suited to this application.

While sometimes called “visualization” tools , the charting and graphics capabilities of tools in this category are no better than many other analytic software packages on the market. These tools have three outstanding features:

Simplified user interface that is easy to learn and use
Basic charting and graphics functionality that aligns well with what most managers need
Flexible “back end” that simplifies connection to many different data sources

Among commercial vendors, Tableau Software and Qlik are the market leaders. Microsoft PowerBI and SAP Lumira also score very well in analyst evaluations²¹.

We cover self-service analytics in more detail in Chapter Nine.

Machine Learning

Machine learning is a set of algorithms and a discipline that governs how to use them. Machine learning identifies patterns in data that are inaccessible to a human user and produces output in human or machine-consumable form.

There are many techniques for machine learning, hundreds of algorithms and thousands of software implementations of those algorithms. We discuss machine learning at a managerial level here and in in Chapter Eight. For technical treatment of the subject, there are many excellent books²² on the machine learning discipline as a whole, and on individual techniques.

Data scientists distinguish between techniques for supervised and unsupervised learning. Supervised learning techniques require training data where the outcome we wish to predict is known. For example, if we want to predict which prospects will respond to a campaign, we need data for prospects targeted by the campaign showing whether or not they responded.

Supervised techniques provide powerful tools for prediction and classification problems. In classification problems, the outcome we wish to predict is categorical, such as response or no response. In prediction problems, the outcome we wish to predict is an amount, such as a customer’s future spending.

Frequently, however, we do not know the “ultimate” outcome of an event. For example, in some cases of fraud, we may not know that a transaction is fraudulent until long after the event. In this case, rather than attempting to predict which transactions are frauds, we might want to use machine learning to identify transactions that are unusual and flag these for further investigation. We use unsupervised learning when we do not have prior knowledge about a specific outcome, but still want to extract useful insights from the data²³.

While some machine learning techniques tend to consistently outperform others, it is rarely possible to say in advance which one will work best for a particular problem. Hence, most data scientists prefer to try many techniques and choose the best model. For this reason, high performance is essential, because it enables the data scientist to try more options and build the best possible model²⁴.

The potential applications for machine learning in organizations are highly diverse. For supervised learning, the three most common use cases are:

Prediction. Estimating the incidence or value of a measure that is unknown because it takes place in the future. For example, a bank seeks to predict the odds that a borrower will repay a loan during its term when evaluating an application; a retailer seeks to predict store traffic next week when scheduling staff. The temporal dimension , the element of time, plays a key role.

Organizations use prediction to support operational decisions on a large scale. Modern credit card operations, for example, are only possible because issuers can make rapid decisions to approve credit lines and authorize transactions. Such operations depend on predictive models developed with machine learning.

Inference. Estimating the odds or amount of an unknown measure that is not a future event. For example, a retailer seeks to determine the ethnicity of its customers through analysis of surnames, street addresses, and purchase behavior.

Attribution. Disaggregating the contribution of many factors to a desired outcome. For example, an ecommerce vendor seeks to determine how ad exposures impact sales; a sports team seeks to measure the contribution of each player to winning games. Executives rely on attribution for managerial and strategic decisions to allocate budgets, continue or discontinue programs, and similar decisions.

There are numerous applications for machine learning in content analytics:

Text processing applications extract features from text for visualization or inclusion in predictive models.
Machine learning can match documents to detect duplicates or identify plagiarism.
Image processing can classify images into categories, detect malignant tumors in cancer screenings, and so forth.

While SAS and IBM combined control²⁵ a little less than 50% of the commercial software market in machine learning, the market as a whole is less concentrated than elsewhere in business analytics. This is largely due to rapid innovation in machine learning, and overall rapid expansion in the number of potential applications.

Cloud-based services from Amazon Web Services, Microsoft, and Google have the potential to disrupt the established leaders; we discuss them in Chapter Seven. Open source offerings like R, Python, Spark, and H2O are gaining users at the expense of commercial vendors; we discuss them in Chapters Three, Five, and Eight.

Overview of the Book

Chapter Two is a short history of business analytics. It covers the last 50 years of innovation in analytics, to provide context for innovations currently impacting the analytics value chain.

In Chapter Three we cover the open source business model, including licensing and distribution. Today, there are open source options everywhere in the value chain, and enterprise adoption is on the rise.

Chapter Four covers the Hadoop ecosystem. Due to the importance of SQL processing in analytics, we also cover open source SQL engines in this chapter.

In Chapter Five, we document the rapidly declining cost of computer memory and the corresponding rise of large-scale in-memory computing, including in-memory databases and Apache Spark.

Chapter Six is a survey of streaming analytics. We include a brief history of streaming analytics for context, and introduce the reader to open source streaming platforms.

Cloud computing and the elastic business model is disrupting the software industry. The elastic business model is especially appropriate for analytics. We survey analytics in the cloud in Chapter Seven.

We briefly summarized machine learning in this chapter. In Chapter Eight, we discuss recent innovations in machine learning, with special emphasis on Deep Learning.

In Chapter Nine, we cover self-service analytics. Tableau’s self-service model is one of the best examples of disruption in the analytics value chain.

Finally, in Chapter Ten, we offer a manager’s handbook for disruptive analytics. We survey the key requirements—people, process, and tools—needed to build a platform for disruption.

Footnotes

1 https://www.idc.com/getdoc.jsp?containerId=IDC_P33195

2 http://www.cio.com/article/3074238/analytics/big-data-and-analytics-spending-to-hit-187-billion.html

3 http://www.forbes.com/sites/laurengensler/2016/03/15/oracle-third-quarter-earnings/#286720039d5d

4 http://investor.oracle.com/financial-news/financial-news-details/2016/Oracle-Reports-GAAP-EPS-of-050-Non-GAAP-EPS-of-064-Without-the-Effect-of-US-Dollar-Strengthening-Both-Would-Have-Been-4-Cents-Higher/default.aspx

5 http://go.sap.com/docs/download/investors/2016/sap-2016-q1-statement.pdf

6 https://www-03.ibm.com/press/us/en/pressrelease/49554.wss

7 http://fortune.com/2016/05/20/ibm-layoff-employees-may/

8 http://www.reuters.com/article/us-microsoft-results-idUSKCN0XI2NG

9 http://www.sas.com/en_us/company-information.html#stats

10 http://www.newsobserver.com/news/business/article48668040.html

11 http://www.mydaytondailynews.com/news/news/teradata-leadership-change-comes-as-company-strugg/nrHwg/

12 http://www.sas.com/en_us/news/analyst-viewpoints/forrester-names-sas-leader-in-data-quality-solutions.html

13 http://www.sas.com/en_us/news/analyst-viewpoints/gartner-names-sas-leader-in-data-quality-tools.html

14 http://www.gartner.com/doc/reprints?id=1-2ZFVZ5B&ct=160225&st=sb

15 http://www.forrester.com/pimages/rws/reprints/document/116447/oid/1-SFDMEH

16 http://www.sas.com/en_us/news/analyst-viewpoints/2016-gartner-magic-quadrant-advanced-analytics.html

17 http://blogs.forrester.com/brian_hopkins/15-11-03-ibm_and_teradata_a_tale_of_two_vendors_struggle_with_disruption

18 Christensen, Clayton M. (1997), The innovator’s dilemma: when new technologies cause great firms to fail, Boston, Massachusetts, USA: Harvard Business School Press, ISBN 978-0-87584-585-2.

19 Deming, W. Edwards (1986). Out of the Crisis. MIT Press.

20 http://management.about.com/od/metrics/a/Measure2Manage.htm

21 http://www.forrester.com/pimages/rws/reprints/document/116447/oid/1-SFDMEH

22 For example, Hastie, Tibshirani and Friedman, The Elements of Statistical Learning, Springer (2011); Provost and Fawcett, Data Science for Business, O’Reilly Media (2013).

23 http://www.infoworld.com/article/3010401/big-data/machine-learning-a-practical-introduction.html .

24 http://university.h2o.ai/business-101/downloads/practical-guide-to-machine-learning.pdf .

25 http://www.sas.com/content/dam/SAS/en_us/doc/analystreport/idc-apa-software-market-shares-108013.pdf

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 1. Fundamentals

Create new playlist

Sign In

Sign Up