Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Thomas W. Dinsmore, Disruptive Analytics, 10.1007/978-1-4842-1311-7_10

10. Handbook for Managers

How to Profit from Disruption

Thomas W. Dinsmore¹

(1)Newton, Massachusetts, USA

Let’s review what we have covered so far in this book.

Chapter One defined disruptive innovation and makes the argument for disruption in the business analytics marketplace today. The remainder of the chapter covered basic concepts, such as the demand for data-driven insight and the characteristics of the analytics value chain. We distinguished between disruption within the analytics value chain, and disruption of other markets by analytics.

Chapter Two briefly recapped the history of business analytics in the modern era. We showed previous examples of disruption in the value chain, such as the introduction of the enterprise data warehouse. We also provided examples where analytics disrupted other markets, as in the cases of credit scoring and fraud detection. Finally, we introduced you to key trends driving disruption today, including the digital transformation of the economy and declining costs of computing and storage.

Chapter Three detailed the open source business model. We introduced you to open source licensing and distribution, and to commercial business models based on open source software. We also provided a detailed profile of Python and R, the two leading open source software platforms for analytics.

In Chapter Four, we covered Hadoop and its ecosystem. We noted the distinction between Apache Hadoop and its commercial distributions, and documented the components most widely used together with Hadoop. Under analytics, we distinguished between the rudimentary capabilities available under Hadoop 1.0 and the increasingly powerful and sophisticated capabilities available today, under Hadoop 2.0.

Chapter Five documented the rapidly declining cost of computer memory and the corresponding rise of large-scale in-memory computing. This chapter covers Apache Spark, Apache Arrow, Alluxio (Tachyon), and Apache Ignite.

In Chapter Six, we briefly surveyed the history of streaming analytics, taking note of the longstanding gap between vision and reality in the category. We surveyed streaming data sources, such as Apache Kafka and Amazon Kinesis, and open source streaming analytics platforms: Apache Apex, Apache Flink, Apache Samza, Apache Spark Streaming, and Apache Storm.

Chapter Seven covered fundamentals of cloud computing and the elastic business model. We surveyed the capabilities of the top three cloud platforms: Amazon Web Services, Microsoft Azure, and Google Cloud Platform.

In Chapter Eight, we reviewed key trends in machine learning: convergence, competitions, ensemble learning, scalability, and Deep Learning. We offered a detailed introduction to neural networks and Deep Learning. Finally, we surveyed advanced machine learning platforms, including distributed engines, in-database libraries, and Deep Learning frameworks.

Finally, in Chapter Nine, under self-service analytics, we offered a balanced perspective on the role of casual and expert users in enterprise analytics and proposed a model of user personas. We profiled a number of products that exemplify innovations in this area, including data visualization, data blending, BI on Hadoop, “Insight as a Service,” business user analytics, and automated machine learning.

At this point, if you’re not convinced of the power and richness of emerging innovations in analytics today, stop reading; Chapter Ten is not for you.

In this chapter, we offer the manager a handbook for action to profit from innovation and disruption. We cover three broad areas:

People and organization
Processes
Platforms and tools

Some of the strategies we propose may seem radical. This handbook assumes that you want your organization to profit from innovation and disruption in the analytics marketplace. Either you do or you don’t—only you can decide.

To profit from innovation and disruption, it’s likely that your organization will need to do some things differently. There may be political or other barriers to overcome, and change management is always a concern. We don’t trivialize the difficulty of organizational change. However, this is not a book on organization politics or change management; it’s about how to profit from innovation and disruption.

People and Organization

In analytics, we tend to focus too much attention on purely technical problems. But people and organization matter—especially so in a disrupted world.

Organize around clients.
Define the Chief Analytics Officer’s Role.
Make costs visible to clients.
Hire the right people.

Motivated, well-organized people with basic tools and the right incentives outperform poorly motivated and poorly organized people with gold-plated tools every time.

Organize Around Clients

Organizing for a disrupted world requires a laser-like focus on client needs. (We use the term client instead of customer because the practice of analytics is a professional service. Clients can be internal or external — the same principles apply.) Every organization is different; we present here a typical model of analytic needs together with the skills and tools needed to meet those needs.

Enterprises typically organize analytics around technical functions: data integration , data warehousing , business intelligence , and so forth. There is a logic to this, including knowledge sharing and career development. But it is imperative that analytics teams organize around internal and external customers.

In Chapter One, we outlined five distinctly different sources of demand for data-driven insight:

Strategic: Insight for C-level executives.
Managerial: Insight for functional executives.
Operational: Insight for business process optimization.
Developmental: insight for new products and services.
Differentiating: insight for (external) customers.

Each of these groups of internal and external clients has distinct needs for people, skills and tools:

Strategic. Most of the analysis for C-level executives is ad hoc, unrepeatable, and urgent. Practitioners require deep knowledge of the business or industry, a highly professional approach, and a strong grasp of visual presentation techniques. A strategic analysis team requires broad access to internal and external data, as well as capabilities for ad hoc data integration, queries, and reporting.

Managerial. Rigorous performance measurement and ad hoc analysis for business planning are the principal requirements at this level. Practitioners often have a finance background; they must be familiar with the organization’s performance metrics and business planning process. Conventional data warehousing and business intelligence systems perform well for performance metrics. A managerial analytics team must be able to access the performance measurement system for ad hoc performance reports and needs tools to develop forecasts for business plans.

Operational. Business process stakeholders need low-latency real-time metrics of the business process, and they need deployable machine learning tools for optimization. Conventional business intelligence and reporting systems work well for operational metrics if they are deployed for real-time analysis. Operational analysts should have a strong background in statistics, machine learning, and content analytics, with the programming skills needed for model deployment.

Developmental. Product and service development executives need insight to support the development lifecycle from concept through product introduction. Analytics practitioners need a background in consumer and marketing research combined with statistical training in experimental design, “test and learn” techniques, and forecasting. Lightweight “desktop ” tooling is usually sufficient in this area, since these practitioners rarely work with Big Data.

Differentiating. Customer facing analytics , such as recommendation engines, are necessarily production-grade applications. Open source software is preferable to commercial software, especially if the organization plans to distribute software components to the end customer. Practitioners in this area should have a software engineering background supplemented with machine learning training. Knowledge of programming languages—such as C, Java, Scala, Python, and R—is required.

While there is a great deal of variation across organizations, the general rule is that junior-level analysts tend to report to the departments they support, such as marketing or credit risk; expert analysts tend to be centrally grouped; and specialists in data integration, data management, software administration, and provisioning tend to report to the IT organization. As a result, no senior executive holds responsibility and accountability for the analytics value chain as a whole.

A better organization model separates technical functions in the analytics value chain from IT and places them under a Chief Analytics Officer (CAO). Working analysts either remain in the functional organizations with a dotted line reporting relationship to the CAO or they report into the CAO organization and are assigned to support functional organizations.

Define the Chief Analytics Officer’s Role

There is an emerging trend towards designating an executive as the Chief Analytics Officer (CAO) . While this is not yet a universal practice, it signals that the enterprise believes analytics is a key strategic capability.

In theory, the Chief Analytics Officer (CAO) should be accountable for the entire analytics value chain, from data to insight. In practice, however, responsibility for the analytics value chain tends to be divided and is likely to remain so. The IT organization usually manages the data warehouse, the processes that acquire data, and “enterprise” grade business intelligence tools. IT also manages hardware and software procurement.

Functional departments manage “shadow” IT operations, which may include data marts and analytics tooling. “Expert” users sometimes reside in IT; more often, they reside in functional departments. In specialized analytic disciplines, such as actuarial analysis or credit risk, analysts generally report to a functional manager; this is unlikely to change.

Divided responsibilities lead to some dysfunctional outcomes. Since IT generally owns the data warehouse but not the delivery of insight, there is a tendency to view the collection and management of data as an end in itself; actual insight is someone else’s concern. While IT organizations are often very capable in the construction and management of data warehouses, they tend to overlook or ignore functional managers’ unmet needs. That is why so many functional managers have their own “shadow” IT operation.

The CAO should directly manage the team responsible for strategic analytics (as defined in the previous section). This team handles ad hoc requests for insight from the organization’s leadership, and it should be staffed and tooled accordingly. A strategic analytics team requires highly skilled and professional people with broad access to internal and external data and the tooling necessary for quick response.

For managerial analytics , the principal asset is the performance measurement system: the enterprise data warehouse and supporting business intelligence platform. IT’s core competence is in the operation of production systems, so it makes sense that the CIO manages the performance measurement system. The CAO, however, should own the process for driving requirements for performance measurement, coordinating across functional stakeholders.

Business process optimization typically requires skills in advanced analytics and operations research. Some functional departments may already have expert teams to support these capabilities; however, it also makes sense to pool these resources centrally to support departments that have not yet developed their own team. The CAO should manage these pooled resources, and also drive standards and best practices across the organization.

Most organizations have dedicated teams for product and service development, typically domiciled in the Marketing organization. The CAO’s role in this process should be to drive training and adoption of best practices, broker requirements with the IT organization, and define common standards.

Software and hardware provisioning require careful balancing of the CAO and CIO responsibilities. The CIO generally manages on-premises provisioning and often manages cloud provisioning as well. Consistent with security standards, however, the CAO should be free to move workloads to the cloud if the organization cannot provide competitive pricing or service levels.

In a similar manner, while the CIO generally manages software licensing, procurement, and support, the CAO should own this responsibility for advanced analytics software, which generally falls outside of IT’s core competence.

Data ownership is another area that requires careful balancing of CIO, CAO, and functional responsibilities. Data ownership and management are two different things. The data owner controls data access (within the framework of an organization’s overall policies) and defines the business rules under which the data is captured; the data manager handles administration and custody on behalf of the owner. Generally, data should “belong” to the organization that funds its production; the CIO should manage production systems and databases; the CAO should manage analytic datastores.

Make Costs Visible to Clients

The question of “chargebacks” may seem overly detailed for a book on analytics, but incentives matter.

The author once met with an IT executive of a large healthcare provider, who expressed frustration that SAS users were unwilling to switch to lower-cost alternatives. Asked if user departments contributed to the cost of the software, the executive replied that no, the organization wanted to encourage people to use analytics.

You can see the problem there: when costs are invisible, users prefer the gold-plated option.

There are only two viable models for software provisioning. Either software selection is a centrally managed process, with costs absorbed as overhead, or individual departments and users can pick and choose their software and pay the costs out of their own budgets.

The same principle applies to people costs, provisioning, and data collection. Nothing clarifies needs so quickly as a requirement to pay for what you use.

Hire the Right People

In mid-2016, there is a seller’s market for qualified data scientists. As a result, many people call themselves data scientists who aren’t really qualified. In the absence of professional certification, there are many articles in the media¹ ² covering “interview questions” for data scientists. Most of these cover data science trivia, and many highly accomplished analysts would fail.

Testing candidates on theoretical knowledge is a misguided approach. A better approach focuses on asking the candidate about actual projects they have completed:

What business problem did you address?
How did you go about solving the problem?
Did you work with others? If so, describe your interactions with them?
What tools did you use? What technical problems did you solve?
In the course of the project, what worked well? What could have worked better?
How did the project end?

Top candidates can easily point to dozens of projects to which they have contributed. Many excellent data scientists are active in Kaggle or other competitive platforms and may have contributed to an open source analytics project; these are indicators that the individual has a good command of the discipline.

At a junior level, the key characteristic to assess is motivation: does this person have a burning desire to perform analysis? Even new graduates can point to examples of analysis they have done—research projects, or independent work with public data sets. Learning an open source analytic language like R or Python is another key indicator of motivation.

A candidate for a position in analytics who cannot point to actual projects, or who has never learned an analytic language, is not serious.

Analytics executives sometimes ask if candidates should be required to know specific analytic tools and, if so, which ones. The short answer to the question is that it depends on what platform you have established as a standard. (If you haven’t established a standard, you need to do so. See “Build an Open Source Stack” later in this chapter). If you have standardized on SAS, look for people who know SAS. While it is theoretically possible to retrain a candidate who is otherwise qualified, some individuals never make the transition.

For any candidate in any organization, cultural fit is essential. It is unwise to generalize about the personalities of data scientists or analysts; it’s fair to say, however, that successful data scientists and analysts are able to engage with clients to understand business problems, explain results, and work collaboratively on a team. Gauging these qualities should be part of your evaluation.

Processes

Principles of agile development, as expressed³ in the Agile Manifesto, apply to business analytics as well as to general software development. For convenience, we restate the 12 principles (slightly paraphrased for business analytics):

Satisfy clients through early delivery.
Welcome changing client requirements.
Deliver work product frequently.
Cooperate closely with business stakeholders.
Build motivated teams, and trust them.
Communicate face-to-face.
Work product is the principal measure of progress.
Work at a sustainable pace.
Focus continuously on technical excellence and good design.
Simplify problems.
Self-organizing teams deliver the best architectures, requirements and designs.
Reflect regularly on how to be effective and adjust accordingly.

We apply these principles separately to business intelligence and machine learning in this chapter.

Separately, we note that IT-led data warehousing operations often collect too much of the wrong kind of data, and not enough of the data needed to drive critical insight. To correct that, we propose a lean data strategy.

Practice Agile Business Intelligence

Despite the growth of self-service BI, most organizations continue to employ trained specialists to satisfy ad hoc requests for analysis. These specialists are often staffed centrally; functional teams request analysis through written requests and detailed requirements. Specialists get into the habit of delivering exactly what is spelled out in the requirements. Since functional managers lack the expertise of the specialists, they may not know precisely what they want; this communications problem leads to disputes. The specialist team always has a work backlog, so any request takes days or weeks, no matter how trivial.

While self-service BI mitigates this problem by engaging the requestor directly in production of the analysis, there are limits to what can be accomplished simply through tooling. Self-service BI works best when the data is well-organized, accessible, and limited in scope. Even with the best BI tools, managers tend to delegate the BI task, so self-service BI may simply create a new breed of specialist.

Agile principles suggest that organizations can resolve the BI bottleneck effectively simply by distributing specialists into the functional teams, co-locating them for maximum interaction with business stakeholders. This approach makes it possible for specialists to anticipate business needs for insight, help the requestor frame the requirements, and work interactively with the requestor to develop a solution.

This approach does not rule out investing in self-service BI tools. As a rule, organizations that distribute BI specialists into functional teams discover that the total demand for data-driven insight increases. The BI specialist serves to spearhead broader use of the self-service tool and collaborates with business stakeholders when self-service tools are not sufficient to solve the problem.

Practice Agile Machine Learning

Machine learning differs from business intelligence for two reasons: the deliverable is a working predictive model rather than a report, table, or chart; and because the process itself requires a higher level of skill and expertise. Another key difference: while functional teams constantly use business intelligence, the demand for machine learning tends to be more selective. In the typology of demand for insight discussed earlier in the chapter, the greatest demand for machine learning stems from operational business process optimization and from differentiating products and services.

Rather than domiciling expert data scientists in specialist teams for short-term project engagement, agile principles suggest that organizations will achieve better results by placing data scientists directly into process improvement or product development teams. As is the case with BI specialists, placing data scientists into teams ensures a collaborative approach to the design, development, and evaluation of machine learning models. It also enables the data scientist to develop domain knowledge and an understanding of the business context for machine learning.

Agile principles imply some changes to standard data science practices.

First, the work product from machine learning is a production scoring model . This differs from standard practice, where modelers often view their work as complete once they build a satisfactory model in the lab environment; deployment is someone else’s problem.

Second, data scientists should evaluate predictive models solely on how they perform in production. “Sandbox” testing is useful for preliminary model selection, but performance in production is not always the same as sandbox performance. Where they differ, production performance is the correct metric.

These two principles imply a more rapid cycle time into production than data scientists may be accustomed to. It implies an approach where the data scientist seeks to quickly deliver an unbiased model that outperforms naïve criteria, then continuously improve it by examining prediction errors, supplementing data sources, testing new training algorithms, and so forth.

The need for rapid deployment implies a third key principle: as much as possible, data scientists should avoid modifications to the production data that cannot be reproduced in a scoring model. Data scientists like to enhance raw data in various ways that improve the predictive power of a machine learning model. But these modifications can make it more difficult to deploy the model, since any changes to the data in the modeling process must be reproduced in production.

Rapid cycle time also has implications for tooling. Platforms that automate routine data science functions and enable large-scale testing are highly desirable. So are capabilities that tightly integrate model development with model scoring.

Develop a Lean Data Strategy

In 2015, technology consultant Forrester surveyed⁴ more than 1,800 technology decision-makers in organizations around the world. Forrester asked respondents to estimate the percentage of the data currently used for business intelligence. Respondents reported separately for structured, semi-structured, and “unstructured” data; the average response by category was:

Structured data: 40% used
Semi-structured data: 27% used
“Unstructured” data: 31% used

Forrester interprets this low utilization as a problem with tools: if organizations simply invest in self-service business intelligence tools, end users will tap the data. There is golden insight hidden inside that unused data; all your organization needs is to buy another piece of software and all will be revealed. However, there are a number of possible explanations for the low data utilization other than tools availability:

Data may be structured and stored in a schema that is difficult for most users to navigate, or in a relational model that does not match the way managers think about the business.
Data lineage and metadata may be poorly documented, so that managers do not trust the data.
Data security policies may be unduly restrictive, preventing wide use of the data in the organization.
Data may not be catalogued, and prospective users simply do not know what data is available.

Note that if any of these conditions are true, the IT organization’s data warehousing initiative has failed. There may be good reasons for the failure, such as lack of budget, resources, or strategic alignment. But failure is failure.

There is one more possibility: the data is not used because it has no useful information value.

That idea may seem heretical in the era of Big Data, but let’s take a moment to explore it. By definition, data that nobody uses has no value. (It may have potential value for some theoretical future user, but until that user materializes the data is just sitting around taking up storage space.) The whole point of collecting and managing data is to produce useful insight; no insight, no value delivered. The only question that matters is whether the unused data has potential value; is there gold inside that pile of junk, or is it just junk?

There are a number of reasons to be skeptical of claims that your unused data has valuable “hidden” insight:

Any data—whether it is structured, semi-structured, or “unstructured”—is accessible with the right tools tools; if your unused data is valuable, why isn’t anyone using it today?
Motivated analysts climb mountains to get valuable data; if necessary, they learn new tools. Are your analysts unmotivated? New tools won’t solve that problem.
There are few recorded cases (if any) where an analyst produced useful insights by trolling through “found” data.

On that last point, data warehousing vendors have hyped the value of such trolling for years. The best example is the “beer and diapers” story.

In 1992, an analysis team at Teradata analyzed 1.2 million market baskets from 25 Osco drug stores and discovered that between 5:00 p.m. and 7:00 p.m. customers purchased beer and diapers together.⁵ Osco never did anything with the insight, because there were no clear merchandising implications. Nevertheless, Teradata’s marketing team cited the example as the kind of insight that justifies investing in a data warehouse. The beer-and-diapers story became part of the folklore of the data mining community.

In the same Forrester survey, two-thirds of the respondents reported that the majority of their organization’s business intelligence needs are met by “shadow” IT operations—processes that functional managers assemble by themselves. Managers do not sit passively and wait for IT to deliver the intelligence they need; they actively build their own processes, hiring people and investing in tools if necessary to do so.

To summarize: organizations collect a lot of data that is not used. At the same time, they fail to deliver the information functional managers do want to use. That is dysfunctional.

There are several reasons for the dysfunction.

One is a lack of input from users and prospective users. It seems obvious that user input is essential to good data warehouse design; yet, anyone with working experience in enterprise analytics can cite examples of highly touted projects built without it. Collecting user input is hard, and it takes time; prospective users often do not know what they want or need, and may not have stable information needs.

Another is a sort of inertia—in the absence of clear design , it is easier to simply copy data from a data source into a data warehouse structure and leave it at that. The author once worked with a global consumer marketer that maintained two data warehouses: one fed exclusively by its SAP ERP system, and the other fed exclusively by its Oracle CRM system . Users who needed data from both systems downloaded summary data and performed the consolidation in spreadsheets.

A third reason is a phenomenon best described as data fetishism, a belief in the magical powers of data, where more data is always better than less data or no data. The problem with this sort of thinking is that data is not a commodity, like crude oil or pork bellies, any unit of which is substitutable for any other unit. To the contrary, data is always particular to a specific event or set of events; a piece of data either answers a question or it doesn’t. Petabytes of data are worthless if they do not answer a question.

Cheap storage also encourages organizations to “squirrel” away data whose value is unclear. The cost per terabyte of disk storage has declined precipitously in the past decade, continuing a long-term decline in all computing costs. But while storage is cheap, it is not free; and while the cost of physical storage is declining, the costs of data governance, management, and security are not.

The term data warehouse is a metaphor borrowed from logistics, where the purpose of a warehouse is to store inventory. Imagine a warehouse for a retail chain where half the goods are unwanted, while store managers scramble to avoid stockouts by procuring the goods they need through other channels. The warehouse metaphor is doubly ironic when you consider that for the past 20 years and more, enterprises have gone to great lengths to reduce or eliminate inventories through lean manufacturing and just-in-time logistics.

What is the data warehousing equivalent of lean manufacturing?

First, do not acquire data unless there is a clear business need for the information it carries. In practical terms, “business need” means that a functional manager with a budget is willing to pay for the data. Stop acquiring data when the business need ends.

Second, build metrics into products, processes, and programs from inception. Do not create performance metrics after the fact; design them into every business entity. Include the cost of performance metrics into product and program financials.

Third, align data presentation and user personas . Typically, early adopters for a particular set of data are expert users who can work with messy and granular data, developing insights on behalf of business stakeholders. If and when business requirements stabilize, the analytics developed by experts can be productionized and made accessible to users who prefer to work with simpler tools.

Fourth, do not “clean” data; data cleansing tools do not make data more accurate; they simply make it appear more accurate by removing anomalies . Anomalies, however, have information value; Alan Turing and colleagues at Bletchley Park used them to decrypt the Enigma cypher. If a data source systematically produces erroneous data, fix the data source.

Finally, do not make the data warehouse an end in itself. At all times, the goal should be delivering insight; development initiatives should organize around specific projects to deliver insight to specific individuals, teams, functions, or applications. It may be possible to identify common data consolidation needs across multiple end user applications; when that is the case, a data warehouse can serve as an omnibus platform across these applications. But if it is difficult to define such commonalities, do not let the data warehouse idea get in the way of delivering insight.

Platforms and Tools

Unless your organization is a startup, you have a legacy tools environment—existing investments in tools to support various elements of the analytics value chain. Your organization’s past investments in software and hardware are a sunk cost; in many cases, it is more cost-effective to retain existing tools than it is to replace them. We are not suggesting that you toss out existing tools if they still meet your needs.

On the margin, however, there are things you can do to profit from disruption. First, assess what you are actually using and match this to your licensing; for incremental expansion, define needs rigorously. Second, build a credible open source alternative; even if you don’t use the open source option heavily, simply having it gives you negotiating leverage with commercial software vendors. Finally, leverage elastic provisioning—in the cloud, or through on-premises virtualization.

Assess Software Licensing and Use

Well-defined requirements are essential in a disrupted world because the conventions we use to anchor decisions don’t work anymore. Industry leaders struggle; outsiders bring new capabilities to market; established experts struggle to adapt. Organizations that know what they actually need and what they don’t need thrive in this environment.

“Define your needs” seems like obvious guidance, but it is surprising how often one encounters analytics managers who have only a cursory understanding of how their team uses tools. Formal assessment usually reveals that people use only a fraction of the functionality embedded in commercial software tools. Now, more than ever, you need to take a cold, hard look at your commercially licensed software and how it is used.

Open source software disrupts commercial software by delivering “good enough” functionality under a services-based business model. Commercial software vendors point out that their software products have more features than open source software. This is accurate, but misleading. Features only add value when your organization actually uses them; otherwise, they simply add cost.

If you do not have well-defined requirements, software selection will gravitate to the products with the most features, the best marketing, the best analyst relations, or all three. Rather than selecting software based on which one has the most features, choose the lowest-cost product that satisfies all of your organization’s demonstrated needs.

A side benefit of such an assessment: your organization is almost certainly overlicensed. The software industry focuses on underlicensing and pirated software; but unless your organization has actively managed software licensing, the odds are that a sizeable share of your total software spending goes to shelfware.

How do we know that you are overlicensed? Because commercial software vendors fear that elastic “pay for what you use” pricing will cannibalize their existing software licensing models. That fear is justified; while vendors like Oracle, SAP, IBM, and Microsoft all report double-digit growth in cloud-based revenue, that growth fails to offset the decline in conventional software licensing revenue.

If software vendors get less revenue from elastic “pay for what you use” pricing, it follows that the standard commercial licensing model makes you pay for what you do not use. This makes some sense when you consider commercial licensing terms, which require the buyer to pay in advance for the right to use software whose business value is unknown.

Commercial vendors warrant that software does what they say it does, and that it works under specified conditions. However, commercial vendors sell software based on business value and not on features and functions. They do not warrant these claims of business value; the buyer assumes this risk.

When your organization has a well-defined set of requirements for business analytics software, you are in a much better position to evaluate the claims of commercial vendors against one another and against an open source stack.

Build an Open Source Stack

Across the software economy, the open source business model is undermining the commercial software model. Consultant IDC writes⁶:

Open source products offer functionality that is competitive with proprietary products and applies downward pricing pressure on these products. Growth in the adoption of open source technologies will force an acceleration toward a services-based business model for many vendors.

There are open source software alternatives for every component in the analytics value chain:

Hadoop and its ecosystem offer comprehensive tooling for data acquisition and management.
Open source SQL engines —such as Spark SQL, Impala, Drill, and Presto—compete successfully against data warehouse appliances for interactive queries.
Machine learning engines like H2O and Spark MLlib provide scaleable machine learning options. R and Python are excellent general-purpose platforms for analytics.
JasperSoft, Pentaho, and Talend all deliver end-to-end capabilities for business analytics.

To build and deliver an open source stack , follow these steps:

Establish an analytics innovations team.
Assess your organization’s current open source software usage.
Evaluate open source components through live testing and pilot projects.
Define a support strategy.

An innovations team is a core group of individuals whose primary role is to evaluate innovative technologies and bring them into the organization. Members of the team may be full-time, or may be temporarily assigned from other roles; however, the team’s impact and time to value depend on its leadership, personnel, and resources.

Once your team is established, take stock of your organization’s current use of open source software. The results of such an assessment may surprise you. Many business leaders simply do not know the extent of open source software use in their organizations, because there is often no central control over acquisition and use of open source software.

The open source software your organization used successfully forms the foundation of your stack. From there, your team can define additional components to support functional gaps in the stack. Nobody can define the perfect open source stack for your organization; it depends on your needs, your previous experience, and the results of your ongoing evaluation.

Defining a support strategy is essential for your open source stack. There are two aspects to this problem:

Support for your end users
Support for your help desk

One way to mitigate the need for support is to choose supported open source distributions . Cloudera, Hortonworks, and MapR offer commercially supported bundles built on Apache Hadoop; Microsoft and Oracle offer supported R distributions; JasperSoft, Pentaho, and Talend all offer commercially supported versions of their products.

However, there is no single open source or open core product that comprehensively supports the entire analytics value chain. Consequently, your help desk plays a key role in diagnosing issues and directing them to the appropriate source for support.

Your open source stack serves as a baseline architecture. This does not mean that you will never use commercially licensed software; it means that you will use commercial software only when the open source stack lacks features and functions that are needed to solve a specific business problem.

A credible open source stack also creates negotiating leverage with commercial software vendors, who deeply discount their software when competing with an open source alternative. Consider the full software lifecycle when evaluating these discounts; some vendors simply discount the first year subscription fee or discount a perpetual license while increasing maintenance fees.

Leverage Elastic Provisioning

Once you have defined an open source software stack, you need to provide computing infrastructure, such as servers and storage. Choose an elastic solution: public cloud , virtual private cloud, private cloud, hybrid cloud, or on-premises data center virtualization and cluster management tools.

Elastic provisioning means that the computing resources available to users expand and contract based on actual workload. For example, if a user needs to run complex analysis to support a prospective merger, the computing resources expand accordingly; when that project is completed, the user releases the resources for use by other applications. Self-service provisioning means that end users can requisition additional resources without IT support or intervention.

Three key principles should govern your organization’s approach to elastic computing:

The computing infrastructure that you own and manage for business analytics should operate at a high level of utilization.
Computing resources for end users should be delivered through self-service elastic provisioning.
Computing costs (internal or external) should be metered and charged to the consuming application.

The breakeven capacity threshold for your organization depends on the efficiency and skill of your data center team and your level of skill in procurement. Cloud data centers operate at about 65% of capacity; the average utilization⁷ of on-premises servers is in the range of 12-18%, so most organizations have a lot of room for improvement.

As noted in Chapter Seven, average infrastructure utilization is low because organizations provision to support peak demand; during periods of slack demand, this computing capacity sits idle. Imagine a company with analytics teams in New York and Singapore, each with a dedicated server. Each team uses its server actively during local business hours, but each server sits idle outside of business hours. This company can double its server utilization and cut computing costs in half if the two teams can share computing infrastructure.

To optimize provisioning, segment your analytic workloads into three categories:

Baseline workload is predictable at a certain constant level.
Peak workload is predictable at a higher level than the baseline for short periods. Month-end reporting, for example, typically creates a short-term spike in demand.
Surge workload is an unpredictable spike in demand above the baseline level. For example, when an analyst trains a Deep Learning algorithm.

Under this framework, provision your baseline workload with infrastructure that you own and manage and can operate at a high percentage of capacity. For peak workload, use reserved instances; for surge workload use on-demand or spot cloud. Of course, you won’t want to move data back and forth from the cloud, so you should group workloads together that use common data.

It’s entirely possible that your analysis will show a very low baseline workload for analytics. That’s typical; workloads for an analytics platform tend to be inherently variable and difficult to predict, because much of the demand is ad hoc and project oriented. Nevertheless, computing and storage must be sufficient for high performance on large-scale problems.

If your workloads are mostly unpredictable, or if your organization lacks the skills to manage computing infrastructure effectively, put everything into the cloud.

Executives tend to raise three objections to the use of off-premises cloud computing: out-of-pocket costs, concerns about outages, and security concerns.

Cost concerns about cloud are largely an illusion. Cloud computing costs are measurable and tangible, while internal computing costs are often hidden away in depreciation charges, salaries, floor space, and electric bills. Even if the organization is very good at measuring costs and charging back costs to users, it still pays for unused capacity. With their purchasing efficiencies and skilled data center management, cloud data centers achieve economies of scale that most organizations can only dream about.

While anxiety about data center outages is real, there is no evidence that cloud data centers are more vulnerable to outages than on-premises data centers. Any data center is subject to outages for any number of reasons: natural disasters, cyber attack, or human error. As with cost concerns, there may be an illusory sense of security in an on-premises facility: of course, our people won’t mess up and bring the system down. Keep in mind that many organizations run mission-critical applications in the cloud today—and business analytics applications are rarely mission critical.

Security concerns are similar to concerns about outages: the anxiety is real, but there is no evidence that cloud data centers are less secure than on-premises data centers. (If anything, the opposite is true: in the past two years, 19 of the top 20 data breaches hit on-premises data centers; the one breach of cloud data is fully attributable to human error, and not related to the physical location of the data.)

In any case, there are work practices analysts can implement to minimize security risks in the cloud. These include avoiding use of Personally Identifiable Information (PII) , which is sensitive information about individuals that is rarely needed for analysis; removing identifiers from table names and column headers; and the use of hashing to encrypt data before it’s transferred to the cloud.

Elastic self-service provisioning with metered costs should be the standard of service provided to users. This is the service standard delivered by cloud providers; if your organization cannot stomach using off-premises cloud platforms, your goal should be to deliver the same level of service through data center virtualization. This is an essential requirement for an analytics platform.

Closing Thoughts

Your perspective on disruption depends on where you stand.

If your organization buys and uses analytic software and services, disruptive innovation is an opportunity for you to improve the effectiveness of your investments in analytics and to reduce costs. Avoid getting locked into vendors who are ripe for disruption.
If your organization seeks to disrupt others with innovative products and services, the open source projects described in previous chapters offer an excellent foundation.
If your organization has an established franchise providing business analytics software and services, watch your back; someone out there wants to eat your lunch.

To profit from disruptive innovation, do the following things:

Organize Around Client Needs for Data-Driven Insight. Stop thinking about analytics as a single problem that some big vendor can solve for you. Your clients have diverse needs for data-driven insight; tailor solutions accordingly.
Carefully Define the Role of the CAO. In most cases, it is impractical to expect any single executive to “own” the complete analytics value chain. Assign the CAO accountability to drive data-driven insight in the organization, then carefully balance roles and responsibilities of the CAO, CIO, and functional executives.
Align Decision-Making Authority Over Analytics Platforms with Responsibility for Costs. Avoid scenarios where users choose platforms without cost accountability, and costs are “someone else’s problem”.
Hire the Right People. For analysts, place less emphasis on credentials and theoretical knowledge, and more emphasis on analytic accomplishments and collaboration skills.
Practice Agile Business Intelligence. Deploy specialists to functional teams and encourage close collaboration. Invest in self-service tools if there is a demand, but don’t assume that your need for analytic specialists will go away.
Practice Agile Machine Learning. Focus on repeatable processes, reduced cycle time, rapid deployment, and continuous improvement of the production model. Invest in platforms that maximize the productivity of your high-value data scientists.
Develop a Lean Data Strategy. Stop thinking of your data warehouse as a strategic investment; it’s not. Align data collection with needs for data-driven insight. Do not collect data for which there are no defined users and stakeholders.
Assess your Commercial Software Licensing and Usage. Take a cold, hard look at software licensing in your organization; you are almost certainly overlicensed today. Challenge those who insist they must use high-end commercial software.
Build an Open Source Stack. Define, build, deliver, and support an open source stack for business analytics. Make the open source stack your baseline system. Use commercial software only when your organization’s documented requirements can’t be met with your open source stack.
Leverage Elastic Provisioning. Analytic workloads tend to be ad hoc and unpredictable, which makes them excellent candidates for elastic provisioning—in the cloud or on-premises.

The payoff for taking action: more effective analysts, more data-driven insight, better decisions, and lower total cost of ownership for your analytics infrastructure.

The penalty for inertia won’t be visible right away. Your business analytics software vendors will continue to send you renewal invoices. The cost of decisions not taken, of data-driven insights not produced, will never be measured. Life will go on.

At some point, however, someone will ask: “why are you here?”

Some people in your organization may object to the measures we outline in this chapter. They may even call them disruptive.

If they do, smile. You’re on the right track.