Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Thomas W. Dinsmore, Disruptive Analytics, 10.1007/978-1-4842-1311-7_9

9. Self-Service Analytics

Hype and Reality

Thomas W. Dinsmore¹

(1)Newton, Massachusetts, USA

About 15 years ago, the author attended a sales presentation by a vendor touting an automated predictive analytics tool. His value proposition was, “buy our software and you can fire your SAS programmers”.

Unfortunately for that sales rep, every customer in the room was a SAS programmer.

The story underscores a basic problem for those who believe that software can democratize analytics: the people who care the most about analytics, and are most passionate about it, are not afraid to learn analytic programming languages like SAS, Python, and R.

Moreover, because they are accountable for the veracity and validity of what they deliver, they demand tools that give them control over the entire process. That is why experts on the analytics job market insist that coding skills are absolutely necessary to do the job.¹

So much for the Citizen Data Scientist idea.

Commercial vendors have touted their analytics software as “easy to use” or “self-service” for three decades, and yet, adoption for tools other than Microsoft Excel remains low. There are a number of reasons for this:

Many managers have minimal felt need or interest in analytics.
For managers who do value analytics, it’s relatively easy to delegate the hands-on work.
Making software easy to use is one thing; making data easy to access and navigate is an entirely different matter. After two decades of data warehousing, enterprise data remains messy, incomplete, or irrelevant to managers’ questions.

Meanwhile interest in analytic programming languages is booming, and the job market for data scientists is robust. That is because serious problems in analysis require serious people to perform them—people who are motivated to dig deeply into data.

In this chapter, we review the logic of self-service analytics—where it makes sense, and where it doesn’t. We describe the distinctly different user personas in organizations and discuss the role of experts in advanced analytics. We close the chapter with a survey of six innovations in self-service analytics.

The Logic of Self-Service

Vendor-driven discussions of self-service analytics often inhabit a magical world where everyone has the same skills and interests. Managers are best served by a realistic view of the actual user personas in their organization, and the use cases where self-service analytics make sense.

The User Pyramid

In most organizations, people vary widely in their analytic skills—from those with little or no skill on one end of the spectrum, to world-class experts on the other. Many things cause these differences: background, education, training, organization role, and intrinsic motivation.

In large organizations, analytics users tend to form a “pyramid,” as shown in Figure 9-1. Those with the lowest purely analytic skills, whom we label as “consumers,” tend to be the largest group by far; in large organizations, there can be thousands of them. We note that this is purely analytic skills; people can have highly advanced skills in other areas, but limited training or interest in the analysis of data.

Figure 9-1. The user pyramid

Analytics experts, on the other hand, tend to be few in number. However, they drive disproportionate value through analytics.

Commercial software vendors tend to target casual users with “easy to use” applications. Casual users are the largest audience, offer the potential to sell more “seats,” and are less loyal to existing tools. Power and expert users, on the other hand, tend to be very loyal to their existing tools. They have invested years to develop skills, have mastered their chosen tool, and are not attracted to “easy to use” tools .

Roles in the Value Chain

Serious discussions about analytics in the organization should begin with the recognition that users have diverse needs and are not all the same. This may seem obvious, but one frequently hears vendors speak of their tools as complete solutions for the enterprise, as if all users have the same needs.

A user persona is a model that describes how a class of user interacts with a system. Of course, every user is different and so is every organization. There are startups in Silicon Valley where “business users” work actively with SQL and Python; there are also companies where “business users” struggle with Microsoft Excel. We present this model not to stereotype people, but as a framework for managers to understand the needs of their own organizations.

Experts

Expert users are highly skilled in analytic software and programming languages. They spend 100% of their time working on analytics. They see analytics as a career choice and career path, and have invested in the necessary education and training. Their titles and backgrounds vary across and within industries.

Developers. These users have in-depth training in an organization’s data and software, and their primary role is to develop analytic applications. In addition to technical training, developers have a thorough understanding of an organization’s data sources, are able to write complex SQL expressions, and are trained in one or more programming languages. While traditionally domiciled in the IT organization, business units may employ their own developers when they want to control prioritization and queuing.

Data Scientists. Data scientists are individuals whose primary role is to produce insight from complex data and to develop predictive models for deployment in applications. Their background and training encompasses programming languages, statistics, and machine learning. Data scientists tend to come from an engineering or computer science background, and they prefer to work with programming languages such as Scala, Java, or Python. They tend to have a strong preference for open source analytics; the best data scientists actively contribute to one or more open source projects and may participate in data science competitions.

Analytic Specialists. The analytic specialist holds a position such as statistician, biostatistician, actuary, or risk analyst and often holds a degree in an academic discipline with historical roots in advanced analytics. They understand statistics and machine learning, and they have considerable working experience in applied analytics. Analytic specialists prefer to work in a high-level analytic programming language such as SAS or R, which they prefer over software packages with GUI interfaces. Their work product may be a management report, charts and tables, or a predictive model specification.

Analysts

These users are highly skilled in the use of end user software for analytics, such as Excel, SPSS, or Tableau. They use analytics actively in their work, but they do not create production applications. They tend to identify with a business function, such as marketing or finance, and see analytics as a means toward that end.

Strategic Analysts. Strategic analysts’ primary role is to perform ad hoc analysis for senior executives; they may be domiciled within a business unit or within a team dedicated to C-team support. Strategic analysts know their organization and industry well, and they are familiar with data sources. They are able to perform simple SQL queries and to use SQL together with other tools. They prefer tools with a graphical user interface. Strategic analysts’ work product leans toward charts, visuals, and storytelling.

Functional Analysts. The primary role of functional analysts is an analytic job function, such as credit analyst or marketing analyst. These roles require some analytic skill as well as domain knowledge. Functional analysts prefer tools that are relatively easy to use, with a graphical interface. They may have some training in statistics and machine learning. Like strategic analysts, they may be able to perform simple SQL queries. They are skilled with Microsoft Office and prefer to work with tools that integrate directly with Office. Functional analysts’ work product may be a spreadsheet, a report, or presentation.

Consumers

Information consumers have minimal tool-related skills and prefer information presented in a form that is easily retrieved and digested.

Business Leaders. Business leaders are keenly interested in the organization’s performance metrics, which they require to be timely, accurate, and delivered to a mobile device or browser. They may be interested in some limited drill-through capabilities, but rarely want to spend a great deal of time searching for information.

Information Users. Information users are employees who need information to perform a specific job role, such as handling customer service calls, reviewing insurance claims, performing paralegal tasks, and so forth. Their role in a business process defines their needs for information. While the information user may not engage with mathematical computation, they are concerned with the overall utility, performance , and reliability of the systems they use .

The Role of Experts

Software vendors tout their products as “easy to use”. This is not new. In the 1980s and 1990s, analytics vendor SPSS positioned its Windows-based interface as the easy alternative to SAS, which did not offer a comparable UI until 2004, when it introduced SAS Enterprise Guide. In the 1990s and early 2000s, Cognos claimed to target the business user in contrast to more complex products like Business Objects and MicroStrategy.

Vendor claims to the contrary, self-service analytics is an elusive goal. Large enterprises have thousands of users for their BI tools, but the vast majority are “consumers” who use the information contained in reports, views or dashboards developed by specialists. Most organizations still maintain specialist teams whose sole responsibility is developing reports or OLAP cubes for others to use.

There are two main reasons for the persistence of BI specialists:

Consistency. Measuring performance remains the leading use case for business intelligence. Most organizations want consistent measurement across functions and do not want teams to measure themselves, or “game” the metrics.

Data Integration.Few organizations have achieved the data warehousing “nirvana” envisioned by theorists. While BI tools have well-designed user-facing “interfaces,” the “back-end” that integrates with data sources remains as messy and complicated as the sources themselves.

Expanding use of Hadoop has exacerbated the data integration problem for BI, at least temporarily. Most conventional BI platforms did not work with Hadoop 1.0. Startups like Datameer and Pentaho tried to fill this vacuum, with limited success; but specialized tools just for Hadoop are unsatisfactory for enterprises seeking to standardize on a single BI platform.

Hadoop poses a problem even for relatively skilled users. An analyst accustomed to working interactively in SQL on a data warehouse appliance will struggle to perform the same analysis in Hive or Pig on Hadoop . Hadoop’s tooling for interactive queries has greatly improved in the last several years, as documented in Chapters Four and Five, companies that invested early in Hadoop added new layers of specialists with the necessary skills.

Predictive analytics also remains the domain of specialists even though easy-to-use tools have been available for years. This is especially so in strategic and “hard-money” applications, such as fraud detection and risk management, where the quality of a predictive model can mean the difference between business success and business failure.

Analytic experts provide executives with what auditors call the attest function ²—an independent certification that the analysis is correct and complete. For predictive models, the expert attests that the model predicts well, minimizes false positives and false negatives appropriately, and does not encourage adverse selection. Few executives have the necessary training and knowledge to verify the quality of complex analysis by themselves. The need for independent attestation is a primary reason that organizations outsource strategic analysis projects to consultants.

In short, organizations do not employ experts and specialists for their skill with analytic programming languages. They employ them primarily for their domain expertise, for their ability to take ownership for the analysis they provide, and for their willingness to be held accountable for its validity.

Drawing an analogy to medicine, robots are now able to perform the most complex heart surgery. This does not eliminate the need for cardiologists, although it may change the nature of the job. Patients are not likely to entirely entrust a diagnosis to a machine, nor do machines have the bedside manner that patients value.

A Balanced View of Self-Service

While performance measurement and strategic analytics will remain in the hands of experts and specialists, self-service analysis makes the most sense for two use cases: discovery and business planning.

Discovery and Insight. Before initiating action, managers want to understand the basic shape of a problem or opportunity. This naturally leads to questions such as:

How many customers do we have in New Jersey?
How many shoppers bought our brand of dog food last week?
What is the sales trend for our stores in the Great Lakes region?

Discovery is ideally suited to self-service analysis for three reasons. First, handing the work to a specialist slows the process down. Specialists are often backlogged; a task may only take the specialist an hour to complete, but due to previous requests in the queue, the requestor waits a week for results.

Second, handing the analysis to a specialist creates potential misunderstandings. The requestor must prepare a specification; requests grow increasingly detailed, and specialists develop a habit of doing exactly what the requestor requests rather than developing insight into the problem the requestor is trying to solve.

Third, discovery is iterative by nature; the answer to one question prompts additional questions. The specialist model discourages iteration, since every cycle requires another request, another wait in the queue, and more potential for misunderstandings.

From a functional perspective, interactive queries and visualization are the key requirements for discovery tools. A flexible back-end, with easy integration to many different data sources, is absolutely necessary. Additional capabilities managers may require for discovery include simple time series analysis, simple predictive modeling, basic content analytics, and a mapping capability.

Business Planning.After a manager identifies an opportunity, the next step is action planning; this, in turn, raises many tactical questions. For example, once a decision is made to conduct a marketing campaign among current customers to stimulate purchase of a particular product, managers may want to know:

How many customers purchase this product?
What is the average purchase volume among these customers?
How many customers purchased the product in the past twelve months but not the past three months?

As with discovery, managers assign a premium to speed and self-service as they develop business plans. In planning, however, managers need quantification and numerical analysis more than simple visualization. Interesting trends and hypotheses are less interesting than hard numbers at this point.

Predictive analytics play a greater role in business planning. The manager is concerned with forecasting the impact of a particular decision:

What is the expected response and conversion rate?
What credit losses can we expect?
If we do nothing, what attrition rate do we expect?

Hence, self-service tools for predictive analytics can play a role in business planning, provided that they are fully transparent and “idiot-proofed”. An “idiot-proof” tool has built-in constraints and guidelines that prevent a naive user from developing spurious insights.

Innovations in Self-Service Analytics

In the section that follows, we discuss six self-service innovations:

Data visualization: Tableau combines a simple interface for basic visualization with a powerful data access engine.
Data blending:Alteryx and ClearStory Data take very different approaches to the problem of blending data from diverse data sources.
BI on Hadoop: AtScale and Platfora demonstrate two distinct ways to deliver BI on Hadoop.
Insight-as-a-Service:Domo helps executives avoid the IT bottleneck.
Business user analytics: KNIME, RapidMiner, and Alpine offer distinct approaches to scalable analytics for business users.
Automated machine learning:DataRobot delivers an automated machine learning platform.

In previous chapters, we highlighted open source tools. All of the software we profile in this chapter is commercially licensed. In each of the six categories, we profile one or more vendors that exemplify the innovation.

Data Visualization

A picture is worth a thousand words.

Data visualization seeks to discover, communicate information, and persuade through statistical graphics, plots, and infographs. As a discovery tool , visualization helps the analyst quickly identify meaningful patterns in data that would be difficult to find through numerical analysis alone. As a communications tool, visualization makes complex relationships clear and comprehensible. Visualization is a powerful tool for persuasion.

Visualization is also a great way to lie and mislead people. Data visualization is no more “scientific” than its user intends it to be. Every manager should understand the rhetoric of visualization if only to identify deception.

Like most things in the world of analytics, visualization is hardly new. Statisticians have long understood the value of data visualization. In 1977, John Tukey, founding chair of the Statistics faculty at Princeton University, introduced the idea of the box plot. Box plots, shown in Figure 9-2, are a convenient way to compare multiple data distributions.

Figure 9-2. Box plot

Edward R. Tufte, professor of Political Science, Statistics, and Computer Science at Yale University, published The Visual Display of Quantitative Information in 1982. Tufte self-published the book due to lack of interest from publishers; today, it remains a best seller on Amazon.com. Tufte drew on examples of visualization dating back to 1686, noting that many of the best visuals pre-date the computer era, when artists drew graphs by hand.

Stephen Few, a data visualization consultant and author of Information Dashboard Design, argues³ that there are just eight types of visual messages:

Time series charts, with values showing change through time
Ranking of values ordered by quantity
The relationship of parts to the whole
The difference between two sets of values, such as forecast revenue and actual revenue
Counts or frequencies of values by interval
Comparison of two paired sets of values to show correlation (or the lack thereof)
Nominal comparison of values for a set of unordered items
Geospatial depiction of data, where values or measures are displayed on a map

In preliminary data analysis, statisticians and predictive modelers use scatterplots and frequency distributions to understand relationships in the data. Scatterplots show linear and non-linear relationships between two variables, which can expedite model development. Simple graphics are also an excellent data quality check, as an analyst can instantly identify problems in the data from a few visuals.

Statisticians and market researchers use visualization to convey complex findings. Correlations among many variables can be hard to interpret when presented as a table of numbers; presented as a heat map, as shown in Figure 9-3, patterns are easier to grasp.

Figure 9-3. Correlation heat map created in RapidMiner

Techniques like decision trees are popular because they are easy to visualize. Figure 9-4, for example, shows the characteristics of people who survived the Titanic sinking: women in the top two classes, younger women in third class, and boys under 13 survived at a much higher rate than other passengers.

Figure 9-4. Titanic survivor decision tree (Source: RapidMiner)

Demonstrating the importance of visualization, SAS added SAS/GRAPH to its statistical software in 1980. SAS/GRAPH was the first new functional extension added by SAS. While batch oriented, it was extremely powerful; a SAS programmer could create hundreds of graphs with a few lines of code. As a first step in a project after importing data, an analyst could use charts to comprehensively examine potential predictors and create a plan for subsequent analysis.

The business intelligence vendors that emerged in the 1990s, including Business Objects, Cognos, and MicroStrategy, all included graphics and visualization capabilities in their products together with reporting and dashboarding tools.

Graphics also figure strongly in the appeal of open source R and its ecosystem of packages. R users can choose from a wide range of standard and specialized plots, build graphical applications, and publish interactive graphics. The ggplot2 package , introduced in 2007, has attracted many new users to R because it is relatively easy to use and versatile.

It is impossible to discuss the recent emergence of self-service visualization without mentioning Tableau. Researchers at Stanford University founded Tableau Software in 2003 to commercialize an innovative visualization tool they had developed at the university. Tableau grew steadily, reaching revenues of $62 million in 2011; then, it took off, growing revenues tenfold to $654 million.

In so doing, Tableau passed Actuate, Hexagon, Panorama Software, Autodesk, TIBCO, Information Builders, ESRI, Qlik Tech, and MicroStrategy to take the sixth position on IDC’s Business Intelligence and Analytics Tools Software ranking. Tableau also passed industry stalwarts FICO, Infor, Adobe, and Informatica to assume eighth place in IDC’s overall business analytics software industry ranking.

Tableau went public in May 2013 at a valuation of $3 billion. From its founding to its IPO, Tableau created⁴ more value for its investors than all but three other startups from 2009 through 2014.

Tableau’s success is surprising when you consider that its graphical capabilities are no greater than most competing business intelligence tools, and considerably less than what a user can do in SAS and R. The key differences between Tableau and other tools are simplicity and data source connectivity. Conventional BI tools are designed to integrate with an organization’s data warehouse. While their graphics capabilities are not difficult to use, they require complex configuration for each data source. This makes them inflexible, requiring a high level of skill for ad hoc analysis.

For tools like SAS and R , visualization is a two-step process: one step to retrieve data, the second to create visuals. While these tools are powerful and highly flexible, they require expertise to use successfully. Tableau’s core innovation⁵ is a query language called VizQL, or Visual Query Language . VizQL combines SQL and a language for rendering graphics, so that ad hoc visualization requires only a single step. Tableau combines its query language with connectivity to an extraordinarily large collection of data sources, including Microsoft Excel and Access; text files; statistical files from SAS, SPSS, and R; relational databases; NoSQL datastores; Hadoop; Apache Spark; enterprise applications, including SAP and Salesforce; Google Analytics; and many others.

Arguably Tableau is successful not because it does so much, but because it does a few things very well, and the things it does well are exactly what users need. Tableau’s simplicity makes it easy to use. Combined with its powerful data source connections, Tableau works very well as an ad hoc discovery tool in diverse and complex data. Under conventional data warehousing theory, this use case should not exist, since data warehousing theory calls for the consolidation of data into a single datastore. Tableau’s extraordinary business success demonstrates the degree to which conventional data warehousing theory no longer applies.

Data Blending

Data blending tools enable a business user to blend and cleanse data from multiple sources. They come with rich facilities to access disparate data sources, select data, transform the data, and combine it into a single dataset for analysis. Most have some capability to analyze the blended data as well.

According to data warehousing theory , there should be no need for end user data blending tools; in principle, data warehousing processes should perform all of the necessary processing steps, presenting the end user with data that is already cleansed, standardized, and in the form needed for analysis.

In many organizations, budget constraints prevent the data warehousing team to keep up with the explosion of data.
Even well-funded data warehousing teams have substantial backlogs leading to extended delays bringing new sources into the warehouse.
Many analyses are ad hoc and do not warrant investment in permanent data warehousing feeds.

As an example of the last point, many marketing programs use external vendors, and the campaign may only run once or twice. A marketing analyst seeking to prepare an analysis of the campaign must merge data provided by the external vendor with data from the organization’s data warehouse to prepare a complete report.

A number of startups offer data blending tools, including Alteryx and ClearStory Data .

In 1997, three entrepreneurs founded a consultancy branded as SRC LLC; the company offered custom solutions for mapping and demographic analysis. Two years later, SRC won a bid to be the technology provider for the U.S. Bureau of the Census; over the next several years, the company developed several new software products for geospatial analysis.

SRC launched Alteryx in 2006. Alteryx, a software package offering a unified environment for the analysis of spatial and non-spatial data simplified the task of blending data from multiple databases; streamlined spatial analysis; and enabled users to publish integrated reports with maps, charts, tables, and graphs.

In 2010, the SRC founders rebranded the company as Alteryx Inc. to focus exclusively on this product⁶. Alteryx has raised a total of $163 million in three rounds of venture capital; the most recent round of funding, for $85 million, closed in October 2015.

As of June, 2016, Alteryx Analytics is in Release 10. The Alteryx Designer environment enables a business user to build workflows to prepare, blend, and analyze data from a wide range of sources and data types, including

Data warehouses and relational databases
Cloud and enterprise applications
Hadoop and NoSQL datastores
Social media platforms
Packaged data from suppliers like Experian, Dun & Bradstreet, and the U.S. Bureau of the Census
Microsoft Office and statistical software packages

For deeper analysis, Alteryx offers basic descriptive and predictive analytics built in open source R, as well as geospatial analytics. Users can export analysis in Microsoft Office formats, Adobe PDFs, HTML, and other common formats. Alteryx interfaces with leading visualization tools, such as Tableau, Qlik, Microsoft Power BI, and Salesforce Wave.

Figure 9-5 shows a view of the Alteryx Designer desktop .

Figure 9-5. Alteryx Designer

The Alteryx Server edition runs on Microsoft Windows Server. Alteryx Server supports scalable analytics through push-down SQL, which transfers user requests to the datastore for native execution without data movement. Release 10 supports push-down SQL for Amazon Redshift, Apache Hive, Apache Impala, Microsoft SQL Server, and Azure SQL Data Warehouse, Oracle Database, Apache Spark SQL, and Teradata Database.

ClearStory Data takes a very different approach to data blending. Founded in 2011, ClearStory Data released its ClearStory product to the market in 2013. ClearStory is an in-memory visualization and collaboration application combined with an inference engine and data blending capability. The platform runs exclusively on Apache Spark (discussed previously in this chapter and in Chapter Two).

Building on Spark’s data-ingestion capabilities, ClearStory provides organizations with the ability to integrate disparate internal and external data sources. Supported internal sources include:

Relational databases, including Oracle, SQL Server, Amazon Redshift, MySQL, and PostgreSQL
Hadoop
Files in a variety of formats
APIs for enterprise applications, such as Salesforce

Through partnerships with data providers, ClearStory provides a number of predefined external data sources:

Demographic data, including location-specific U.S. Census data
Firmographic data about businesses from a variety of providers
Market and sales intelligence data, including media spending and sales by product category
Macroeconomic data for measures such as GDP, inflation, unemployment, and commodity prices
Social media data from Twitter and other platforms
Weather data

Once a data source is registered with ClearStory, the application’s data inference engine gathers key statistics profiling the shape of the data, as well as information about its structure and semantics. When a user requests analysis, ClearStory uses this information to recommend additional data based on the problem the user is trying to solve. ClearStory’s data blending engine matches data with common dimensions, enabling the user to combine data from disparate sources.

Through late 2015, investors have provided ClearStory Data with $30 million in venture capital. The most recent⁷ round, in March 2014, was a $21 million Series B funding led by DAG Ventures, with Andreessen Horowitz, Google Ventures, Khosla Ventures, and Kleiner Perkins Caufield & Byers participating.

BI on Hadoop

As noted in Chapter Four, organizations are investing heavily in Hadoop. Hadoop’s significantly lower costs compared to traditional data warehouses make it an attractive alternative, especially for data whose value is not yet established.

However, Hadoop is much harder to use than traditional data warehouses. For end users accustomed to using business intelligence tools with a data warehouse, Hadoop is almost impossibly difficult to use. Even tools like Hive and Pig, which are easier to use than MapReduce, are only suitable for an advanced user.

As the volume of data residing in Hadoop expands, there is a growing need for business user tools that can work with the data. AtScale and Platfora are two startups with very different approaches to this problem. AtScale delivers a middle layer that enables existing business intelligence tools to work with Hadoop data. Platfora, on the other hand, creates a dedicated data mart to support its own end user tools. We discuss these two startups next.

Founded in 2013 by Yahoo veterans, AtScale emerged from stealth in April 2015; simultaneously, it announced a $7 million “A” round of funding. Unlike the other BI startups profiled in this chapter, AtScale does not offer its own BI end user client. Instead, AtScale operates on the principle that most organizations already have BI tools in place, so it works in the background to make these tools work with a Hadoop datastore.

In theory, most BI tools can connect directly to Hive tables or Spark DataFrames through the JDBC API . In practice, unless the data is already structured and aggregated with all of the needed measures, dimensions and relationships, the user will have to switch back and forth from the BI tool to Hive, Pig, or a programming API. Few business users have the skills needed to do this, so the organization must assign a developer, move the data elsewhere, or implement and maintain a special-purpose “BI-on-Hadoop” tool.

An edge node is a server on the periphery of a Hadoop cluster, which is typically used to broker interactions with other applications. An edge cluster is similar to an edge node, but consists of a cluster of servers on the periphery of the Hadoop cluster rather than a single server.

The AtScale engine resides on an edge node in a Hadoop cluster. With user input through the web-based AtScale Cube Designer, AtScale interacts with the Hive metastore to create and maintain a virtual dimensional data model, or “cube”. Users can specify hierarchies for drill-down, calculated fields and other dimensions as needed to represent the business problem at hand. The foundation data does not change or move, so users can specify different cubes based on the same data for different purposes. BI tools such as Microsoft Excel or Tableau submit SQL or MDX requests to AtScale through ODBC, JDBC, or OLE DB. End users work directly from tools like Excel, as shown in Figure 9-6.

Figure 9-6. Microsoft Excel with AtScale sidecar

AtScale develops and submits an optimized query through an available SQL engine (such as Hive on Tez, Spark SQL, or Cloudera Impala) and returns the results to the BI tool for further processing and end user display. Users can elect whether to retain or drop any aggregate tables created. AtScale provides a facility for managing aggregate tables and a capability to schedule cube generation and updates.

AtScale supports popular BI tools: Tableau, Microsoft Excel, Qlik, Spotfire, MicroStrategy, PowerBI, JasperSoft, SAP Business Objects, and IBM Cognos. It works with the Cloudera, Hortonworks, MapR, and HDInsights Hadoop distributions; Hive on Tez, Spark SQL, and Cloudera Impala SQL engines; and a broad selection of data storage formats, including Parquet, RC, ORC, Sequence, text files, and Hive SerDe. For security, the software offers role-based access control to selectively grant access to data to users across departments and organizations. The application maintains an audit trail of queries executed, so the organization can track request volume, data requested, and run times.

Ben Werther, a veteran of Siebel, Microsoft, and Greenplum, founded Platfora in 2011 with $5.7 million in funding⁸ from venture capitalists led by Andreesen Horowitz. The company emerged from stealth mode in October 2012 and closed⁹ a $20 million “B” round shortly thereafter. The company raised¹⁰ an additional $38 million in March 2014. (Author’s note: on July 21, 2016, cloud-based software vendor Workday announced plans to acquire Platfora for an undisclosed amount.)

Platfora offers an end-to-end data warehousing and BI platform that runs on an edge cluster next to Hadoop. Platfora Server is a distributed in-memory engine; it operates on copies of the data extracted from Hadoop.

Before end users work with the data, an administrator defines structured Platfora datasets from source data. In addition to defining data structure, the administrator defines access permissions for the dataset.

End users, who work from a browser interface, work with the defined datasets to specify a view of the data they need. Platfora translates these requests into MapReduce or Spark jobs, submits them for execution, and writes the results back to HDFS in Platfora’s proprietary file format. It also retains a copy of the view locally on disk and registers metadata about the view in a catalog.

Platfora presents a user-friendly interface that is accessible to a business user. As the user explores and analyzes the data, Platfora generates in-memory queries against the local copy of the view. If the data to be queried exceeds available memory in the Platfora cluster, the query spills to disk or fails.

For the most part, Platfora works only with data already loaded into Hadoop; it has a limited capability to pull small datasets from other sources. Platfora works with most Hadoop distributions; it can use data stored in HDFS, Hive, MapR FS, uploaded files and also with Amazon Web Services’ S3 file system.

Insight-as-a-Service

Many executives are frustrated by what they perceive as a lack of responsiveness and poor service quality from their IT organization. Motivated by a need for speed, they seek out services that can immediately provide them with the performance metrics and insight they desire.

In the past, bypassing the IT organization was difficult because IT physically controlled all of the data. Today, with many business processes delivered through hosted services and Software-as-a-Service, a considerable amount of data already resides in the cloud. Moreover, as functional leaders increasingly control technology spend, they effectively “own” the data.

Vendors with pre-packaged cloud-based solutions address the needs of these executives. One such vendor, Domo, demonstrates the power of the insight-as-a-service concept.

Domo, a startup located in Salt Lake City, Utah, takes a radically different approach to BI. Instead of promoting another set of tools, Domo positions itself as a management solution for busy executives frustrated by the delays and limitations of conventional BI tools. Domo provides these executives with the means to completely bypass IT bottlenecks with a packaged cloud-based Software-as-Service delivery model.

Josh James, a successful entrepreneur, founded the company in 2010. James co-founded Omniture, a successful web analytics startup, in 1996, and led the company through a successful $1.8 billion sale of the company to Adobe Systems in 2009. In late 2010, James acquired a small company offering visualization software and renamed the combined entity Domo. With ample capital—$484 million in eight rounds from 45 investors, including Andreessen Horowitz, Fidelity Investments, Jeff Bezos, Morgan Stanley, and T. Rowe Price—Domo remained in stealth mode for almost five years, developing and improving its offering.

A company operating in stealth mode does not disclose information about its product or service to the public; it does this so potential competitors cannot anticipate its offering and to allow sufficient time to develop a marketable product. The company may disclose information to investors or consultants, but only under strictly enforced nondisclosure agreement. Since they do no marketing, companies operating in stealth mode rarely have revenue or customers, so a company may need substantial funding to remain in stealth mode for an extended period.

Domo emerged¹¹ from stealth mode in April 2015 with a highly developed product. Around a core of standard BI functions (including queries, reports, dashboards, and alerts), Domo offers pre-built role-based and industry-based solutions and apps designed for decision support. The user-facing capabilities operate on a modified MPP columnar database running on Amazon Web Services.¹²

Domo has also pre-built more than 350 connectors to data sources to expedite data integration. This library of connectors includes the most widely used databases and applications; for marketing alone, there are 51 connectors, including Adobe Analytics, Facebook, Google AdWords, HubSpot, IBM Digital Analytics, Klout, Marketo, Salesforce, SurveyMonkey, Twitter, Webtrends, YouTube, and many others.

Combined with a facility for secure data transfer from on-premises systems, these pre-built connectors and solutions enable Domo to promise rapid time to value. Moreover, Domo’s focus on offering an integrated and customizable role-based decision-making solution differentiates it from conventional BI tools.

When it emerged from stealth in April 2015, Domo claimed to have more than 1,000 paying customers and annual sales of $50 million in 2014.

Business User Analytics

Commercial vendors compete actively to deliver software for predictive analytics that is both easy to use and powerful. This is not a new phenomenon; the following list shows five such products and the year each was introduced.

Angoss KnowledgeSeeker (1984)
SAS JMP (1989)
Dell Statistica (1986)
IBM SPSS Modeler (1994)
SAP InfiniteInsight (1998)

Three relatively new products deserve more detailed discussion. KNIME and RapidMiner, introduced in 2006, and Alpine, introduced in 2011. KNIME and RapidMiner operate under an open core model; each offers an open source edition together with commercially licensed extensions. All three are suitable for Big Data, offering push-down integration with Hadoop; Alpine also offers push-down integration with selected data warehouse appliances.

KNIME (rhymes with “lime”) is an open source platform for data integration, business intelligence, and advanced analytics. The platform, based on Eclipse and written in Java, features a graphical user interface with a workflow metaphor. Users build pipelines of tasks with drag-and-drop tools and run them interactively or in batch execution mode. Figure 9-7 shows a view of the KNIME Analytics Platform desktop.

Figure 9-7. KNIME Analytics Platform desktop

KNIME.com AG, a commercial enterprise based in Zurich, Switzerland, distributes the KNIME Analytics Platform under a free and open source GPL license with an exception permitting third parties to use the API for proprietary extensions. The company is privately held and does not disclose details of its ownership. There is no record of venture capital investment in the company.

The free and open source KNIME Analytics Platform includes the following capabilities, all implemented through the graphical user interface:

Data integration from text files, databases, and web services
Data transformation
Reporting through the bundled open source Business Intelligence and Reporting Tool (BIRT)
Univariate and multivariate statistics
Visualization using interactive linked graphs
Machine learning and data mining
Time series analysis
Web analytics
Content analytics, including text and image mining
Graph analytics, including network and social network analysis
Native scoring, as well as PMML export and import
Open API for integration with other open source projects and with commercial tools

KNIME.com AG also distributes a number of commercially licensed extensions offering additional capabilities not included in the open source platform. They include:

Enhanced tools for building workflows
Authoring tools to create custom extensions
Collaboration tools for file and workflow sharing
Server-based tools for enhanced security, collaboration, scheduling, and web access
Connectors enabling push-down execution in Apache Hive, Apache Impala, and Apache Spark
Tools to manage job execution on clustered servers

The KNIME Analytics Platform operates in-memory on single machines running Linux, Windows, or Mac OS. The software is multi-threaded to use multiple cores on a single machine. Server and cloud extensions run on the same operating systems. KNIME.com AG supports the Hive and Impala extensions on Cloudera, Hortonworks, and MapR Hadoop distributions; the company supports the Spark extension on Cloudera and Hortonworks.

Since KNIME buffers data to disk, it can in theory handle arbitrarily large datasets that exceed memory. Disk buffering, however, affects performance and can lead to longer runtimes.

The KNIME Big Data Extensions enable KNIME users to push SQL workloads into Hadoop through Apache Hive or Apache Impala, and to run Apache Spark applications. The Spark Executor serves as an interface to the Spark MLlib package, enabling users to run classification, regression, clustering, collaborative filtering, and dimension reduction tasks in Spark. The software includes a PMML 4.2 interface for prediction in Spark, and also enables the user to perform data preprocessing and manipulation with Spark.

KNIME.com AG offers commercial technical support for the extension software. For the open source KNIME Analytics Platform, it offers extensive product documentation and a community forum for troubleshooting. The company also certifies partners and resellers who offer consulting and support services.

RapidMiner is a mixed commercial and open source software platform for advanced analytics developed and distributed by RapidMiner, Inc. of Cambridge, Massachusetts. Started as predictive analytics project at the Technical University of Dortmund, RapidMiner has expanded its capabilities to span the entire advanced analytics process, from data integration to deployment.

RapidMiner, Inc. launched in 2006 (under the corporate name of Rapid-I) to drive software development, support, and distribution. The company moved its headquarters to the United States in 2013 and rebranded as RapidMiner. Since then, it has secured $36 million in venture capital in three rounds. The most recent, a $16 million “C” round, closed¹³ in January 2016.

Under a model it calls “business source,” RapidMiner distributes three software editions:

Basic edition: Available under a free and open source license.
Community edition: Available under a free commercial license with registration.
Professional edition: Commercially licensed under a paid subscription.

The core RapidMiner platform (Basic edition) includes:

Data ingestion from Excel, CSV, and open source databases, data blending, and data cleansing functions.
Diagnostic, predictive, and prescriptive modeling functions.
R and Python script execution.

The free Community edition also includes:

A small cloud instance.
Community technical support.
Access to the RapidMiner marketplace.
The “Wisdom of Crowds” feature. RapidMiner collects detailed usage information from its user community and leverages this information to provide recommended actions.¹⁴

The Professional edition also includes:

Reusable building blocks and processes for the Design Studio.
Access to commercial databases, cloud data sources, NoSQL datastores, and other file types.
A larger cloud instance.

RapidMiner offers a workflow interface that enables the user to construct complex analytic “pipelines,” as shown in Figure 9-8.

Figure 9-8. RapidMiner

In addition to the desktop version, RapidMiner commercially licenses software for servers and for Hadoop (branded as “Radoop”). The server version supports collaboration, performance, and deployment features; Radoop supports push-down integration with Hive, MapReduce, Mahout, Pig, and Spark. RapidMiner supports Radoop with Cloudera CDH, Hortonworks HDP, Apache Hadoop, MapR, Amazon EMR, and Datastax Enterprise.

RapidMiner has implemented about 1,500 functions in Spark, and it permits the user to embed SparkR, PySpark, Pig, and HiveQL scripts. RapidMiner supports the software with the open source Apache Hadoop distribution, plus distributions from Cloudera, Hortonworks, Amazon Web Services, and MapR; DataStax Enterprise NoSQL database; Apache Hive and Apache Impala; and Apache Spark.

RapidMiner’s key strengths are its easy-to-use interface, broad functionality, and strong integration with Hadoop. While RapidMiner’s predictive analytics and optimization features are strong, its visualization and reporting capabilities are limited, which makes it unsuitable for some users.

Alpine Data Labs, founded in 2011, offers Alpine ML, software with a visual workflow-oriented interface and push-down integration to relational databases, Hadoop, and Spark. Alpine claims support for all major Hadoop distributions and several MPP databases, though in practice most customers use Alpine with Pivotal Greenplum database¹⁵. (Alpine and Greenplum have common roots in the EMC ecosystem). Alpine ML supports data ingestion, feature engineering, machine learning, and scoring functions, all of which execute in the datastore.

Alpine Enterprise , previously branded as Chorus, facilitates collaboration among members of a data science team and offers data cataloging and search features. Alpine Touchpoints, a new product, offers tools to embed predictions in interactive applications.

In November 2013, Alpine closed a $16 million Series B round of venture capital financing.

Automated Machine Learning

Analysts skilled in machine learning are in short supply. VentureBeat¹⁶, The Wall Street Journal ¹⁷, the Chicago Tribune, ¹⁸ and many others all note the scarcity; a McKinsey report¹⁹ projects a shortage of people with analytical skills through 2018. The scarcity is so pressing that Harvard Business Review suggests²⁰ that you stop looking, or lower your standards.

Can we automate the work data scientists do? In IT Business Edge, Loraine Lawson wonders²¹ if artificial intelligence will replace the data scientist. In Forbes, technology thought leader Gil Press confidently asserts²² that the data scientist will be replaced by tools; Scott Hendrickson, Chief Data Scientist at social media integrator Gnip, agrees.²³

Data mining web site KDnuggets, which caters to data scientists, recently published²⁴ a poll of its own members that asked when will most expert level data scientist tasks be automated? Only 19% of respondents believe they will never be automated; 51% said they would be automated within the next 10 years.

Automated modeling techniques are not new. In 1995, Unica Software introduced Pattern Recognition Workbench (PRW) , a software package that used automated test and learn to optimize model tuning for neural networks. Three years later, Unica partnered with Group 1 Software (now owned by Pitney Bowes) to market Model 1, a tool that automated model selection over four types of predictive models. Rebranded several times, the original PRW product remains as IBM PredictiveInsight, a set of wizards sold as part of IBM’s Enterprise Marketing Management suite²⁵.

KXEN, a company founded in France in 1998, built its analytics engine around an automated model selection technique called structural risk minimization .²⁶ The original product had a rudimentary user interface, depending instead on API calls from partner applications; more recently, KXEN repositioned itself as an easy-to-use solution for marketing analytics, which it attempted to sell directly to C-level executives. This effort was modestly successful, leading to sale of the company in 2013 to SAP for an estimated²⁷ $40 million.

Early efforts at automation from Unica, MarketSwitch, and KXEN “solved” the problem by defining it narrowly; limiting the scope of the solution search to a few algorithms, they minimized the engineering effort at the expense of model quality and robustness. Second, by positioning their tools as a means to eliminate the need for expert analysts, they alienated the few people in customer organizations who understood the product well enough to serve as champions²⁸.

Data scientists say²⁹ they spend 50-80% of their time on data wrangling. In theory, this means organizations can mitigate the shortage of data scientists by improving data warehousing and management practices; in practice, this is not easy to do. Data warehousing is expensive, and data scientists often support forward-looking projects that move too fast for the typical data warehousing organization. Most data scientists see data wrangling as necessary and unavoidable, and to a considerable degree they are right.

Automation can, however, reduce the time, cost, and pain of data wrangling. Built-in integration with widely used data sources, for example, minimizes the time and cost to extract and move data. Interfaces to data warehousing and business intelligence platforms enable data scientists to directly leverage data that is already cleansed, minimizing duplicate effort. Features that automatically detect and handle missing data, outliers, complex categorical fields, or other “problematic” types of data enable data scientists to work with data “as is,” and eliminate the need for manual processing.

Beyond basic data cleansing and consolidation, the requirements for data transformation (“feature engineering”) depend entirely on the algorithm to be used for model training. Some algorithms, for example, will only work with categorical predictors, so any continuous variables in the input data set must be binned; other algorithms have the opposite requirement. Automated feature engineering must be linked to automated model specification and selection, since the two are intrinsically linked.

The best way to determine the right algorithm for a given problem and data set is a test-and-learn approach , where the data scientist tests a large number of techniques and chooses the one that works best on fresh data. (The No Free Lunch ³⁰ theorem formalizes this concept.) There are hundreds of potential algorithms; a recent benchmark study tested³¹ 179 for classification alone.

When computing power was scarce and expensive, modelers dealt with this constraint by limiting the search to a single algorithm—or a few, at most. They defended this practice by minimizing the importance of predictive accuracy or by defending the use of one technique above all others. This led to endless unempirical flame wars between advocates of one algorithm or another.

Cheap and pervasive computing power ends these arguments once and for all; it is now possible to test the power of many algorithms, selecting the one that works best for a given problem. In high-stakes hard-money analytics—such as trading algorithms , catastrophic risk analysis , and fraud detection —small improvements in model accuracy have a substantial bottom line impact³², and data scientists owe their clients the best possible predictions.

SAS and IBM recently introduced automated modeling features to their data mining workbenches. In 2010, SAS introduced SAS Rapid Predictive Modeler ³³, an add-in to SAS Enterprise Miner. Rapid Predictive Modeler is a set of SAS Macros supporting tasks such as outlier identification, missing value treatment, variable selection, and model selection. The user specifies a data set and response measure; Rapid Predictive Modeler develops and executes a test plan, measuring results from each experiment. The user controls execution time by selecting basic, intermediate, or advanced methods. In 2015, SAS introduced SAS Factory Miner, a more advanced product that runs on top of SAS Enterprise Miner.

IBM SPSS Modeler is a set of automated data preparation features as well as Auto Classifier, Auto Cluster, and Auto Numeric nodes. The automated data preparation features perform such tasks as missing value imputation, outlier handling, date and time preparation, basic value screening, binning, and variable recasting. The three modeling nodes enable the user to specify techniques to be included in the test plan, specify model selection rules, and set limits on model training.

The caret ³⁴ package in open source R is a suite of productivity tools designed to accelerate model specification and tuning. The package includes pre-processing tools for dummy coding, detecting zero variance predictors, and identifying correlated predictors; the package also includes tools to support model training and tuning. The training function in caret currently supports 217 modeling techniques; it also supports parameter optimization within a selected technique, but does not optimize across techniques. Users write R scripts to call the package, run the required training tasks and capture the results.

Auto-WEKA ³⁵ is another open source project for automated machine learning. First released in 2013, Auto-WEKA is a collaborative project driven by four researchers at the University of British Columbia and Freiburg University. Auto-Weka currently supports classification problems only. The software selects a learning algorithm from 39 available algorithms, including 2 ensemble methods, 10 meta-methods, and 27 base classifiers.³⁶ Since each classifier has many possible parameter settings, the search space is very large; the developers use Bayesian optimization to solve this problem.³⁷

Challenges in Machine Learning³⁸ (CHALEARN) is a tax-exempt organization supported by the National Science Foundation and commercial sponsors. CHALEARN organizes the annual AutoML³⁹ challenge, which seeks to build software that automates machine learning for regression and classification. The most recent conference⁴⁰, held in Lille, France in July 2015, included presentations⁴¹ featuring recent developments in automated machine learning, plus a hack-a-thon.

DataRobot , a Boston-based startup founded by insurance industry veterans, offers a machine learning platform that combines built-in expertise with a test-and-learn approach. By expediting the machine learning process, DataRobot enables organizations to markedly improve data scientist productivity and expand the pool of analysts without compromising quality. DataRobot has assembled⁴² a team of Kaggle-winning data scientists, whose expertise it leverages to identify new machine learning algorithms, feature engineering techniques, and optimization methods.

The DataRobot platform uses parallel processing to train and evaluate thousands of candidate models in R, Python, H2O, Spark, and XGBoost. It searches through millions of possible combinations of algorithms, pre-processing steps, features, transformations, and tuning parameters to identify the best model for a dataset and prediction problem.

DataRobot leverages⁴³ the cloud (Amazon Web Services⁴⁴) to provision servers on demand as needed for large-scale experiments; the software is also available for on-premises deployment and in Hadoop. Users interact with the software through a browser-based interface, or through an R API.

In August 2014, DataRobot raised⁴⁵ $21 million in Series A venture capital financing. Recruit Holdings, a Tokyo-based company, announced⁴⁶ an investment in DataRobot in November 2015. DataRobot announced an additional $33M in a Series B round on February 11, 2016.

The New Self-Service Analytics

In this chapter, we surveyed six key innovations in self-service analytics:

Self-service visualization via Tableau and its imitators.
Self-service data blending via Alteryx and similar products.
BI in Hadoop, especially middleware like AtScale that enables organizations to leverage existing BI assets.
Cloud-based prebuilt services like Domo that enable functional managers to bypass the IT bottleneck.
Business-oriented open core analytics platforms like RapidMiner and KNIME that enable collaboration between experts and business users.
Open and transparent expert systems for machine learning like DataRobot that make machine learning accessible for a broader pool of users.

Self-service visualization tools like Tableau work most effectively with single tables of clean data, since they lack strong data blending capabilities. Consequently, they belong at the end of an analytics value chain, where they facilitate collaboration between expert and non-expert users. End users working with Tableau, for example, can visualize data in many different ways; this saves enormous amounts of time for expert analysts, who can deliver a table or dataset rather than hundreds of charts.

The combination of a data blending tool like Alteryx and a data visualization tool like Tableau offers a powerful set of self-service capabilities. Complex data blending with rough data sources requires a relatively high level of skill in Alteryx, so this combination is better suited to the business analyst than the casual information user.

Enterprises should strive for “BI everywhere”—the idea that an end user should be able to use the same self-service tooling regardless of where the data is physically stored. Tableau partially accomplishes this end because it has a flexible and easily configured back-end that can work with a wide range of data storage options. However, pointing Tableau directly at a Hive metastore in Hadoop isn’t for the faint-hearted. AtScale middleware makes a Hadoop cluster as easy to access as a relational database.

Insight-as-a-service offerings like Domo seek to support the entire analytics value chain. Their value proposition is speed and simplicity for the functional manager; with predefined reports and dashboards, they can quickly deliver essential information, largely bypassing the IT organization altogether.

Analytics platforms like RapidMiner and KNIME inhabit a middle ground between analytic programming languages (such as Python and R) and simple desktop analytic tools. With a workflow-oriented drag-and-drop interface, they save time for the business analyst, while offering rich analytic functionality. They are also highly extensible, offering the ability to embed user-defined functions in a workflow.

Automated machine learning tools like DataRobot save enormous amounts of time for expert data scientists, and they also broaden the pool of people in an organization who can build predictive models. With the ability to tightly integrate with production systems, they radically reduce the time to value for machine learning.

People in organizations have diverse needs for analytics; there is no single tool that meets all needs. Enterprises will continue to employ experts even as they invest in simple tools with broad appeal. If anything, the job market for experts is tighter than ever.