Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER EIGHT

THE ARCHITECTURE OF ANALYTICS AND BIG DATA

ALIGNING A ROBUST TECHNICAL ENVIRONMENT WITH BUSINESS STRATEGIES

Over the last decade or so, it has become technically and economically feasible to capture and store huge quantities of data. The numbers are hard to absorb for all but the geekiest, as data volumes have grown from megabytes to gigabytes (billions of bytes) to terabytes (trillions of bytes) to petabytes (quadrillions of bytes). While low-end personal computers and servers lack the power and capacity to handle the volumes of data required for analytical applications, high-end 64-bit processors, specialty “data appliances,” and cloud-based processing options can quickly churn through virtually unfathomable amounts of data.

However, while organizations have more data than ever at their disposal, they rarely know what to do with it. The data in their systems is often like the box of photos you keep in your attic, waiting for the “someday” when you impose meaning on the chaos. IDC estimated that only 0.5 percent of all data is ever analyzed, and we would guess that the amount of data is growing faster than the amount of it that’s analyzed.

Further, the unpalatable truth is that most IT departments strain to meet minimal service demands and invest inordinate resources in the ongoing support and maintenance of basic transactional capabilities. Unlike the analytical vanguard, even companies with sound transaction systems struggle with relatively prosaic issues such as data cleansing when they try to integrate data into analytical applications. In short, while improvements in technology’s ability to store data can be astonishing, most organizations’ ability to manage, analyze, and apply data has not kept pace.

Companies that compete on analytics haven’t solved all these problems entirely, but they are a lot better off than their competition. In this chapter, we identify the technology, data, and governance processes needed for analytical competition. We also lay out the components that make up the core of any organization’s analytical architecture and forecast how these elements are likely to evolve in the future.

The Architecture of Analytical Technology

While business users of analytics often play an important role, companies have historically delegated the management of information technology for analytics and other applications to an information technology (IT) organization. For example, by capturing proprietary data or embedding proprietary analytics into business processes, the IT department helps develop and sustain an organization’s competitive advantage.

But it is important to understand that this work cannot be delegated to IT alone. Most “small data” can be easily analyzed on a personal computer, and even the largest dataset can be sent to Amazon Web Services’ or Microsoft Azure’s clouds and analyzed by anyone with the requisite knowledge and a credit card. This can lead to uncontrolled proliferation of “versions of the truth,” but it can also lead to insightful answers to business problems. Determining how to encourage the latter and prevent the former is a critical task in any analytical architecture.

Even when IT help is required, determining the technical capabilities needed for analytical competition requires a close collaboration between IT organizations and business managers. This is a principle that companies like Progressive Insurance understand fully. Glenn Renwick, formerly both CEO of Progressive Insurance and head of IT there, understands how critical it is to align IT with business strategy: “Here at Progressive we have technology leaders working arm in arm with business leaders who view their job as solving business problems. And we have business leaders who are held accountable for understanding the role of technology in their business. Our business plan and IT are inextricably linked because their job objectives are.”¹

Although Renwick has just retired, Progressive has a long history of IT/business alignment and focus on analytics, and we’re sure they will continue. We found this same collaborative orientation at many analytical competitors.

Analytical competitors also establish a set of guiding principles to ensure that their technology investments reflect corporate priorities. The principles may include statements such as:

We will be an industry leader in adopting new technologies for big data and machine learning.
The risk associated with conflicting information sources must be reduced.
Applications should be integrated, since analytics increasingly draw data that crosses organizational boundaries.
Analytics must be enabled as part of the organization’s strategy and distinctive capability.

Responsibility for getting the data, technology, and processes right for analytics across the enterprise is the job of the IT architect (or the chief data or technology officer, if there is one). This executive (working closely with the chief information officer) must determine how the components of the IT infrastructure (hardware, software, and networks, and external cloud resources) will work together to provide the data, technology, and support needed by the business. This task is easier for digital companies, such as Netflix or eBay, that can create their IT environment with analytical competition in mind from the outset. In large established organizations however, the IT infrastructure can sometimes appear to have been constructed in a series of weekend handyman jobs. It does the job it was designed to do but is apt to create problems whenever it is applied to another purpose.

To make sure the IT environment fully addresses an organization’s needs at each stage of analytical competition, companies must incorporate analytics and big data technologies into their overall IT architecture. (Refer to the box “Data and IT Capability by Stage of Analytical Competition.”)

DATA AND IT CAPABILITY BY STAGE OF ANALYTICAL COMPETITION

Established companies typically follow an evolutionary process to develop their IT analytical capabilities:

Stage 1. The organization is plagued by missing or poor-quality data, multiple definitions of its data, and poorly integrated systems.
Stage 2. The organization collects transaction data efficiently but often lacks the right data for better decision making. Some successful analytical applications or pilot programs exist and they may even use some sophisticated statistics or technologies. But these are independent initiatives sponsored by functional executives.
Stage 3. The organization has a proliferation of business intelligence and analytics tools and data repositories, but some non-transaction data remains unintegrated, nonstandardized, and inaccessible. IT and data architecture are updated to support enterprise-wide analytics.
Stage 4. The organization has high-quality data, an enterprise-wide analytical plan, IT processes and governance principles, and some embedded or automated analytics. It is also working to some degree on big, less structured data.
Stage 5. The organization has a full-fledged analytics architecture that is enterprise-wide, automated and integrated into processes, and highly sophisticated. The company makes effective and integrated use of big and small data from many internal and external sources, including highly unstructured data. The company begins to explore and use cognitive technologies and autonomous analytics.

We’re using the term analytics and big data in this context to encompass not only the analysis itself—the use of large and small data to analyze, forecast, predict, optimize, and so on—but also the processes and technologies used for collecting, structuring, managing, and reporting decision-oriented data. The analytics and big data architecture (a subset of the overall IT architecture) is an umbrella term for an enterprise-wide set of systems, applications, and governance processes that enable sophisticated analytics by allowing data, content, and analyses to flow to those who need it, when they need it. (Refer to the box “Signposts of Effective IT for Analytical Competition.”)

SIGNPOSTS OF EFFECTIVE IT FOR ANALYTICAL COMPETITION

Analysts have direct, nearly instantaneous access to data, some of it real-time.
Information workers spend their time analyzing data and understanding its implications rather than collecting and formatting data.
Managers focus on improving processes and business performance, not culling data from laptops, reports, and transaction systems.
Analytics and data are incorporated into the company’s products and services.
Managers never argue over whose numbers are accurate.
Data is managed from an enterprise-wide perspective throughout its life cycle, from its initial creation to archiving or destruction.
A hypothesis can be quickly analyzed and tested without a lot of manual behind-the-scenes preparation beforehand—and some analytical models are created without human hypotheses at all (i.e., with machine learning).
Data is increasingly analyzed at the “edge” of the organization without needing to send it to a centralized repository.
Both the supply and demand sides of the business rely on forecasts that are aligned and have been developed using a consistent set of data.
High-volume, mission-critical decision-making processes are highly automated and integrated.
Data is routinely and automatically shared between the company and its customers and suppliers.
Reports and analyses seamlessly integrate and synthesize information from many sources, both internal and external.
Rather than have data warehouse or analytics initiatives, companies manage data and analytics as strategic corporate resources in all business initiatives.

“Those who need it” will include data scientists, statisticians of varying skills, analysts, information workers, functional heads, and top management. The analytics architecture must be able to quickly provide users with reliable, accurate information and help them make decisions of widely varying complexity. It also must make information available through a variety of distribution channels, including traditional reports, ad hoc analysis tools, corporate dashboards, spreadsheets, emails, and text message alerts—and even products and services built around data and analytics. This task is often daunting: Amazon, for example, spent more than ten years and over $1 billion building, organizing, and protecting its data warehouses.²

Complying with legal and regulatory reporting requirements is another activity that depends on a robust analytical architecture. The Sarbanes-Oxley Act of 2002, for example, requires executives, auditors, and other users of corporate data to demonstrate that their decisions are based on trustworthy, meaningful, authoritative, and accurate data. It also requires them to attest that the data provides a clear picture of the business, major trends, risks, and opportunities. The Dodd-Frank Act, a regulatory framework for financial services firms enacted in 2010, has equally rigorous requirements for that specific industry (although there are doubts that it will continue in its present form). Health care organizations have their own set of reporting requirements.

Conceptually, it’s useful to break the analytics and big data architecture into its six elements (refer to figure 8-1):

Data management that defines how the right data is acquired and managed
Transformation tools and processes that describe how the data is extracted, cleaned, structured, transmitted, and loaded to “populate” databases and repositories
Repositories that organize data and metadata (information about the data) and store it for use
Analytical tools and applications used for analysis
Data visualization tools and applications that address how information workers and non-IT analysts will access, display, visualize, and manipulate data
Deployment processes that determine how important administrative activities such as security, error handling, “auditability,” archiving, and privacy are addressed

We’ll look at each element in turn, with particular attention to data since it drives all the other architectural decisions.

FIGURE 8-1

Analytics and big data architecture

Data Management

The goal of a well-designed data management strategy is to ensure that the organization has the right information and uses it appropriately. Large companies invest millions of dollars in systems that snatch data from every conceivable source. Systems for enterprise resource planning, customer relationship management, and point-of-sale transactions, among others, ensure that no transaction or exchange occurs without leaving a mark. Many organizations also purchase externally gathered data from syndicated providers such as IRI and ACNielsen in consumer products and Quintiles IMS in pharmaceuticals. Additionally, data management strategies must determine how to handle big data from corporate websites, social media, internet clickstreams, Internet of Things data, and various other types of external data.

In this environment, data overload can be a real problem for time-stressed managers and professionals. But the greatest data challenge facing companies is “dirty” data: information that is inconsistent, fragmented, and out of context. Even the best companies often struggle to address their data issues. We found that companies that compete on analytics devote extraordinary attention to data management processes and governance. Capital One, for example, estimates that 25 percent of its IT organization works on data issues—an unusually high percentage compared with other firms.

There’s a significant payoff for those who invest the effort to master data management. For example, GE addressed the problem of multiple overlapping sources of supplier data within the company. Many business units and functions had their own versions of supplier databases across hundreds of transaction systems, and the same suppliers were represented multiple times, often in slightly different ways. As a result, GE couldn’t perform basic analytics to determine which suppliers sold to multiple business units, which suppliers were also customers, and how much overall business it did with a supplier. So it embarked on an effort to use new machine learning tools to curate and integrate the supplier data. After several months, it had created an integrated supplier database, and it could start pressing the most active suppliers for volume discounts. Overall, GE estimates that the work led to $80 million in benefits to the company in its first year, and it expects substantially higher benefits in the future. GE is also working on customer data and parts data using the same approach.

To achieve the benefits of analytical competition, IT and business experts must tackle their data issues by answering five questions:

Data relevance: What data is needed to compete on analytics?
Data sourcing: Where can this data be obtained?
Data quantity: How much data is needed?
Data quality: How can the data be made more accurate and valuable for analysis?
Data governance: What rules and processes are needed to manage data from its creation through its retirement?

What Data Is Needed to Compete on Analytics?

The question behind this question is, what data is most valuable for competitive differentiation and business performance? To answer, executives must have a clear understanding of the organization’s distinctive capability, the activities that support that capability, and the relationship between an organization’s strategic and operational metrics and business performance. Many of the companies described in this book have demonstrated the creative insight needed to make those connections.

But ensuring that analysts have access to the right data can be difficult. Sometimes a new metric is needed: the advent of credit scores made the mortgage-lending business more efficient by replacing qualitative assessments of consumer creditworthiness with a single, comparative metric. But not everything is readily reducible to a number. An employee’s performance rating doesn’t give as complete a picture of his work over a year as a manager’s written assessment. The situation is complicated when business and IT people blame each other when the wrong data is collected or the right data is not available. Studies repeatedly show that IT executives believe business managers do not understand what data they need.³ And surveys of business managers reflect their belief that IT executives lack the business acumen to make meaningful data available. While there is no easy solution to this problem, the beginning of the solution is for business leaders and IT managers to pledge to work together on this question. This problem has been eased somewhat in companies like Intel and Procter & Gamble, where quantitative analysts work closely alongside business leaders. Without such cooperation, an organization’s ability to gather the data it needs to compete analytically is doomed.

A related issue requiring business and IT collaboration is defining relationships among the data used in analysis. Considerable business expertise is required to help IT understand the potential relationships in the data for optimum organization. The importance of this activity can be seen in an example involving health care customers. From an insurance company’s perspective, they have many different customers—their corporate customers that contract for policies on behalf of their employees, individual subscribers, and members of the subscribers’ families. Each individual has a medical history and may have any number of medical conditions or diseases that require treatment. The insurance company and each person covered by a policy also have relationships with a variety of service providers such as hospitals, HMOs, and doctors. A doctor may be a general practitioner or a specialist. Some doctors will work with some hospitals or insurers but not others. Individuals can have insurance from multiple providers, including the government, that need to be coordinated. Without insight into the nature of these relationships, the data’s usefulness for analytics is extremely limited.

Where Can This Data Be Obtained?

Data for analytics and business intelligence originates from many places, but the crucial point is that it needs to be managed through an enterprise-wide infrastructure. Only by this means will it be streamlined, consistent, and scalable throughout the organization. Having common applications and data across the enterprise is critical because it helps yield a “consistent version of the truth,” an essential goal for everyone concerned with analytics. While it is possible to create such an environment by ex post facto integration and the transformation of data from many systems, companies are well advised to update and integrate their processes and transaction systems before embarking on this task.

For internal information, the organization’s enterprise systems are a logical starting point. For example, an organization wishing to optimize its supply chain might begin with a demand-planning application. However, it can be difficult to analyze data from transaction systems (like inventory control) because it isn’t defined or framed correctly for management decisions. Enterprise systems—integrated software applications that automate, connect, and manage information flows for business processes such as order fulfillment—often help companies move along the path toward analytical competition: they provide consistent, accurate, and timely data for such tasks as financial reporting and supply chain optimization. Vendors increasingly are embedding analytical capabilities into their enterprise systems so that users can develop sales forecasts and model alternative solutions to business problems. However, the data from such systems usually isn’t very distinctive to a particular firm, so it must be combined with other types of data to have competitive differentiation.

In addition to corporate systems, an organization’s personal computers and servers are loaded with data. Databases, spreadsheets, presentations, and reports are all sources of data. Sometimes these sources are stored in a common knowledge management application, but they are often not available across the entire organization.

Internal data also increasingly means data from Internet of Things (IoT) sensors and devices at the “edge” of the organization—in the oilfield drilling equipment, the retail point of sale device, or the aircraft engine, for example. The traditional model was to send all this data to a centralized repository to store and analyze it. But an alternative paradigm of edge analytics is growing in currency. The rapid growth of the IoT and other edge devices that generate data means that it is often unfeasible to send it all to headquarters or even to the cloud for analysis. In an oilfield, for example, operational data from drilling equipment (including drill-bit RPMs, cutting forces, vibration, temperature, and oil and water flows) can be used in real time to change drilling strategies. It’s often not feasible to send all this data to a central repository. Some drilling operations already use microprocessor-based analytics to determine drilling strategies in real time. The IoT will make edge-based analytics approaches much more common in the future.

There has been an explosion of external data over the past decade, much of it coming from the internet, social media, and external data providers. There has also long been the opportunity to purchase data from firms that provide financial and market information, consumer credit data, and market measurement. Governments at all levels are some of the biggest information providers (more so since the “Open Data” movement over the past decade), and company websites to which customers and suppliers contribute are another powerful resource. Less structured data can also come from such sources as email, voice applications, images (maps and photos available through the internet), photographs (of people, products, and of course cats), and biometrics (fingerprints and iris identification). The further the data type is from standard numbers and letters, however, the harder it is to integrate with other data and analyze—although deep learning technologies are making image recognition much faster and more accurate.

It can be difficult and expensive to capture some highly valuable data. (In some cases, it might even be illegal—for example, sensitive customer information or competitor intelligence about new product plans or pricing strategies.) Analytical competitors adopt innovative approaches to gain permission to collect the data they need. As we described in chapter 3, Progressive’s Snapshot program offers discounts to customers who agree to install a device that collects data about their driving behavior. Former CEO Peter Lewis sees this capability as the key to more accurate pricing and capturing the most valuable customers: “It’s about being able to charge them for whatever happens instead of what they [customers] say is happening. So what will happen? We’ll get all the people who hardly ever drive, and our competitors will get stuck with the higher risks.”⁴ Progressive has now gathered over 10 billion miles of customer driving data, and it has become the best source of insight about what insurance will cost the company.

How Much Data Is Needed?

In addition to gathering the right data, companies need to collect a lot of it in order to distill trends and predict customer behavior. What’s “a lot”? In 2007, the largest data warehouse in the world was Walmart’s, with about 600 terabytes. At roughly the same time, the size of the US Library of Congress’s print collection was roughly 20 terabytes.⁵

Fortunately, the technology and techniques for mining and managing large volumes of data are making enormous strides. The largest databases are no longer enterprise warehouses, but Hadoop clusters storing data across multiple commodity servers. The 600 terabytes in Walmart’s warehouse in 2007 grew a hundredfold by 2017 to 60 petabytes. Digital firms manage even bigger data: Yahoo!’s 600 petabytes are spread across forty thousand Hadoop servers. That’s the equivalent of storing about 30 trillion pages. Yahoo! isn’t the perfect example of an analytical competitor anymore, but it’s likely that more successful firms like Google and Facebook have similar volumes in their data centers.

Two pitfalls must be balanced against this need for massive quantities of data. First, unless you are in the data business like the companies we’ve just described, it’s a good idea to resist the temptation to collect all possible data “just in case.” For one thing, if executives have to wade through digital mountains of irrelevant data, they’ll give up and stop using the tools at hand. “Never throwing away data,” which has been advocated by Amazon’s Jeff Bezos, can be done, but the costs outweigh the benefits for most companies. The fundamental issue comes back, again, to knowing what drives value in an organization; this understanding will prevent companies from collecting data indiscriminately.

A related second pitfall: companies should avoid collecting data that is easy to capture but not necessarily important. Many IT executives advocate this low-hanging-fruit approach because it relieves them of responsibility for determining what information is valuable to the business. For example, many companies fall into the trap of providing managers with data that is a by-product of transaction systems, since that is what is most readily available. Others analyze social media data simply because it’s possible, even when they don’t have any actions in mind when sentiment trends down or up a bit. Perhaps emerging technologies will someday eliminate the need to separate the wheat from the chaff. But until they do, applying intelligence to the process is necessary to avoid data overload.

How Can We Make Data More Valuable?

Quantity without quality is a recipe for failure. Executives are aware of the problem: in a survey of the challenges organizations face in developing a business intelligence capability, data quality was second only to budget constraints.⁶ Even analytical competitors struggle with data quality.

Organizations tend to store their data in hard-walled, functional silos. As a result, the data is generally a disorganized mess. For most organizations, differing definitions of key data elements such as customer or product add to the confusion. When Canadian Tire Corporation, for example, set out to create a structure for its data, it found that the company’s data warehouse could yield as many as six different numbers for inventory levels. Other data was not available at all, such as comparison sales figures for certain products sold in its 450-plus stores throughout Canada. Over several years, the company created a plan to collect new data that fit the company’s analytical needs.⁷

Several characteristics increase the value of data:

It is correct. While some analyses can get by with ballpark figures and others need precision to several decimal points, all must be informed by data that passes the credibility tests of the people reviewing it.
It is complete. The definition of complete will vary according to whether a company is selling cement, credit cards, season tickets, and so on, but completeness will always be closely tied to the organization’s distinctive capability.
It is current. Again, the definition of current may vary; for some business problems, such as a major medical emergency, data must be available instantly to deploy ambulances and emergency personnel in real time (also known as zero latency); for most other business decisions, such as a budget forecast, it just needs to be updated periodically—daily, weekly, or monthly.
It is consistent. In order to help decision makers end arguments over whose data is correct, standardization and common definitions must be applied to it. Eliminating redundant data reduces the chances of using inconsistent or out-of-date data.
It is in context. When data is enriched with metadata (usually defined as structured data about data), its meaning and how it should be used become clear.
It is controlled. In order to comply with business, legal, and regulatory requirements for safety, security, privacy, and “auditability,” it must be strictly overseen.
It is analyzed. Analytics are a primary means of adding value to data, and even creating products from and monetizing it.⁸ Insights are always more valuable than raw data, which is a primary theme of this book.

What Rules and Processes Are Needed to Manage the Data from Its Acquisition Through Its Retirement?

Each stage of the data management life cycle presents distinctive technical and management challenges that can have a significant impact on an organization’s ability to compete on analytics.⁹ Note that this is a traditional data management process; an organization seeking to create analytics “at the edge” will have to do highly abbreviated versions of these tasks.

Data acquisition. Creating or acquiring data is the first step. For internal information, IT managers should work closely with business process leaders. The goals include determining what data is needed and how to best integrate IT systems with business processes to capture good data at the source.
Data cleansing. Detecting and removing data that is out-of-date, incorrect, incomplete, or redundant is one of the most important, costly, and time-consuming activities in any business intelligence technology initiative. We estimate that between 25 percent and 30 percent of an analytics initiative typically goes toward initial data cleansing. IT’s role is to establish methods and systems to collect, organize, process, and maintain information, but data cleansing is the responsibility of everyone who generates or uses data. Data cleansing, integration, and curation can increasingly be aided by new tools including machine learning and crowdsourcing.¹⁰
Data organization and storage. Once data has been acquired and cleansed, processes to systematically extract, integrate, and synthesize it must be established. The data must then be put into the right repository and format so that it is ready to use (see the discussion of repositories later in the chapter). Some storage technologies require substantially more organization than others.
Data maintenance. After a repository is created and populated with data, managers must decide how and when the data will be updated. They must create procedures to ensure data privacy, security, and integrity (protection from corruption or loss via human error, software virus, or hardware crash). And policies and processes must also be developed to determine when and how data that is no longer needed will be saved, archived, or retired. Some analytical competitors have estimated that they spend $500,000 in ongoing maintenance for every $1 million spent on developing new analytics-oriented technical capabilities. We believe, however, that this cost is declining with newer technologies such as Hadoop and data lakes.

Once an organization has addressed data management issues, the next step is to determine the technologies and processes needed to capture, transform, and load data into a data warehouse, Hadoop cluster, or data lake.

Transformation Tools and Processes

Historically, for data to become usable by managers in a data warehouse, it had to first go through a process known in IT-speak as ETL, for extract, transform, and load. It had to be put into a relational format, which stores data in structured tables of rows and columns. Now, however, new storage technologies like Hadoop allow storage in virtually any data format. Data lakes may be based on Hadoop or other underlying technologies, and the concept formalizes the idea of storing data in its original format. These are particularly useful for storing data before the organization knows what it will do with it. To analyze data statistically, however, it must eventually be put in a more structured format—typically rows and columns. The task of putting data into this format, whether for a data warehouse or a statistics program, can be challenging for unstructured data.

While extracting data from its source and loading it into a repository are fairly straightforward tasks, cleaning and transforming data is a bigger issue. In order to make the data in a warehouse decision-ready, it is necessary to first clean and validate it using business rules that use data cleansing or scrubbing tools such as Trillium or Talend, which are also available from large vendors like IBM, Oracle, or SAS. For example, a simple rule might be to have a full nine-digit ZIP code for all US addresses. Transformation procedures define the business logic that maps data from its source to its destination. Both business and IT managers must expend significant effort in order to transform data into usable information. While automated tools from vendors such as Informatica Corporation, Ab Initio Software Corporation, and Ascential Software can ease this process, considerable manual effort is still required. Informatica’s former CEO Sohaib Abbasi estimates that “for every dollar spent on integration technology, around seven to eight dollars is spent on labor [for manual data coding].”¹¹

Transformation also entails standardizing data definitions to make certain that business concepts have consistent, comparable definitions across the organization. For example, a “customer” may be defined as a company in one system but as an individual placing an order in another. It also requires managers to decide what to do about data that is missing. Sometimes it is possible to fill in the blanks using inferred data or projections based on available data; at other times, it simply remains missing and can’t be used for analysis. These mundane but critical tasks require an ongoing effort, because new issues seem to constantly arise.

Some of these standardization and integration tasks can increasingly be done by automated machine learning systems. Companies such as Tamr (where Tom is an adviser) and Trifacta work with data to identify likely overlaps and redundancies. Tamr, for example, worked with GE on the example we described earlier in this chapter to create a single version of supplier data from what was originally many different overlapping sources across business units. The project was accomplished over a few months—much faster than with traditional, labor-intensive approaches. GE is now working with the same tools on consolidating customer and product data.

For unstructured big data, transformation is typically performed using open-source tools like Pig, Hive, and Python. These tools require the substantial coding abilities of data scientists, but may be more flexible than packaged transformation solutions.

Repositories

Organizations have several options for organizing and storing their analytical data:

Data warehouses are databases that contain integrated data from different sources and are regularly updated. They may contain, for example, time series (historical) data to facilitate the analysis of business performance over time. They may also contain prepackaged “data cubes” allowing easy—but limited—analysis by nonprofessional analysts. A data warehouse may be a module of an enterprise system or an independent database. Some companies also employ a staging database that is used to get data from many different sources ready for the data warehouse.
A data mart can refer to a separate repository or to a partitioned section of the overall data warehouse. Data marts are generally used to support a single business function or process and usually contain some predetermined analyses so that managers can independently slice and dice some data without having statistical expertise. Some companies that did not initially see the need for a separate data warehouse created a series of independent data marts or analytical models that directly tapped into source data. One large chemical firm, for example, had sixteen data marts. This approach is rarely used today, because it results in balkanization of data and creates maintenance problems for the IT department. Data marts, then, should be used only if the designers are confident that no broader set of data will ever be needed for analysis.
A metadata repository contains technical information and a data definition, including information about the source, how it is calculated, bibliographic information, and the unit of measurement. It may include information about data reliability, accuracy, and instructions on how the data should be applied. A common metadata repository used by all analytical applications is critical to ensure data consistency. Consolidating all the information needed for data cleansing into a single repository significantly reduces the time needed for maintenance.
Open-source distributed data frameworks like Hadoop and Spark (both distributed by the Apache Foundation) allow storage of data in any format and typically at substantially lower cost than a traditional warehouse or mart. However, they may lack some of the security and simultaneous user controls that an enterprise warehouse employs, and they often require a higher level of technical and programming expertise to use. One company, TrueCar, Inc., stores a lot of data (several petabytes) on vehicles for sale and their attributes and pricing. In converting its storage architecture, it did a comparison of costs between Hadoop and an enterprise data warehouse. It found that its previous cost for storing a gigabyte of data (including hardware, software, and support) for a month in a data warehouse was $19. Using Hadoop, TrueCar pays 23 cents a month per gigabyte for hardware, software, and support. That two-orders-of-magnitude cost differential has been appealing to many organizations. There can be performance improvements as well with these tools, although they tend to be less dramatic than the cost differential.
A data lake employs Apache Hadoop, Apache Spark, or some other technology (usually open source) to store data in its original format. The data is then structured as it is accessed in the lake and analyzed. It is a more formalized concept of the use of these open-source tools. Traditional data management vendors like Informatica, as well as startups like Podium Data, have begun to supply data lake management tools.

Once the data is organized and ready, it is time to determine the analytic technologies and applications needed.

Analytical Tools and Applications

Choosing the right software tools or applications for a given decision depends on several factors. The first task is to determine how thoroughly decision making should be embedded into business processes and operational systems. Should there be a human who reviews the data and analytics and makes a decision, or should the decision be automated and something that happens in the natural process workflow? With the rise of cognitive computing, or artificial intelligence, over the last decade, there are several technologies that can analyze the data, structure the workflow, reach into multiple computer systems, make decisions, take action, and even learn over time.¹² Some of these are analytical and statistics-based; others rely on previous technologies like rule engines, event-streaming technology, and process workflow support. We addressed this issue from a human perspective in chapter 7.

The next decision is whether to use a third-party application or create a custom solution. A growing number of functionally or industry-specific business applications, such as capital budgeting, mortgage pricing, and anti–money laundering models, now exist. These solutions are a big chunk of the business for analytics software companies like SAS. Enterprise systems vendors such as Oracle, SAP, and Microsoft are building more (and more sophisticated) analytical applications into their products. There is a strong economic argument for using such solutions. According to IDC, projects that implement a packaged analytical application yield a median ROI of 140 percent, while custom development using analytical tools yields a median ROI of 104 percent. The “make or buy” decision hinges on whether a packaged solution exists and whether the level of skill required exists within the organization.¹³ Some other research organizations have found even greater returns from analytics applications; Nucleus Research, for example, argued in 2014 that analytics projects yielded $13.01 for every dollar spent.¹⁴

But there are also many powerful tools for data analysis that allow organizations to develop their own analyses (see the boxes “Analytical Technologies” and “Equifax Evolves Its Analytics Architecture”). Major players such as SAS, IBM, and SAP offer product suites consisting of integrated tools and applications, as well as many industry- or function-specific solutions. Open-source tools R and RapidMiner have been the fastest-growing analytical packages over the past several years.¹⁵ Some tools are designed to slice and dice or to drill down to predetermined views of the data, while others are more statistically sophisticated. Some tools can accommodate a variety of data types, while others are more limited (to highly structured data or textual analysis, for example). Some tools extrapolate from historical data, while others are intended to seek out new trends or relationships. Some programming languages like Python are increasingly used for statistical analysis and allow a lot of flexibility while typically requiring more expertise and effort of the analyst.

ANALYTICAL TECHNOLOGIES

Executives in organizations that are planning to become analytical competitors should be familiar with the key categories of analytical software tools:

Spreadsheets such as Microsoft Excel are the most commonly used analytical tools because they are easy to use and reflect the mental models of the user. Managers and analysts use them for “the last mile” of analytics—the stage right before the data is presented in report or graphical form for decision makers. But too many users attempt to use spreadsheets for tasks for which they are ill suited, leading to errors or incorrect conclusions. Even when used properly, spreadsheets are prone to human error; more than 20 percent of spreadsheets have errors, and as many as 5 percent of all calculated cells are incorrect.¹⁶ To minimize these failings, managers have to insist on always starting with accurate, validated data and that spreadsheet developers have the proper skills and expertise to develop models.
Online analytical processors are generally known by their abbreviation, OLAP, and are used for semistructured decisions and analyses on relational data. While a relational database (or RDBMS)—in which data is stored in related tables—is a highly efficient way to organize data for transaction systems, it is not particularly efficient when it comes to analyzing array-based data (data that is arranged in cells like a spreadsheet), such as time series. OLAP tools are specifically designed for multidimensional, array-based problems. They organize data into “data cubes” to enable analysis across time, geography, product lines, and so on. Data cubes are simply collections of data in three variables or more that are prepackaged for reporting and analysis; they can be thought of as multidimensional spreadsheets. While spreadsheet programs like Excel have a maximum of three dimensions (down, across, and worksheet pages), OLAP models can have seven or more. As a result, they require specialized skills to develop, although they can be created by “power users” familiar with their capabilities. Unlike traditional spreadsheets, OLAP tools must deal with data proliferation, or the models quickly become unwieldy. SAP’s BusinessObjects and IBM’s Cognos are among the leading vendors in this category.
Data visualization. OLAP tools were once the primary way to create data visualizations and reports, but a newer generation of easier-to-use tools that can operate on an entire dataset (not just a data cube) have emerged and gained substantial popularity. Tableau and QlikView are the most popular tools in this category; older vendors like Microsoft, MicroStrategy, and SAS also compete in it.
Statistical or quantitative algorithms enable analytically sophisticated managers or statisticians to analyze data. The algorithms process quantitative data to arrive at an optimal target such as a price or a loan amount. In the 1970s, companies such as SAS and SPSS (now part of IBM) introduced packaged computer applications that made statistics much more accessible. Statistical algorithms also encompass predictive modeling applications, optimization, and simulations. SAS remains the proprietary analytics software market leader; R and RapidMiner have emerged as open-source market leaders.
Rule engines process a series of business rules that use conditional statements to address logical questions—for example, “If the applicant for a motorcycle insurance policy is male and under twenty-five, and does not either own his own home or have a graduate degree, do not issue a policy.” Rules engines can be part of a larger automated application or provide recommendations to users who need to make a particular type of decision. FICO, IBM’s Operational Decision Manager, and Pegasystems, Inc. are some of the major providers of rule engines for businesses.
Machine learning and other cognitive technologies that can learn from data over time have superseded rule engines somewhat in popularity. These include various technologies, including machine learning, neural networks, and deep learning (the latter a more complex form of the former, with more layers of explanatory variables), natural language processing and generation, and combinations of these like IBM’s Watson. They are both more complex to develop and less transparent to understand than rule-based systems, but the ability to learn from new data and improve their analytical performance over time is a powerful advantage.
Data mining tools (some of which use machine learning) draw on techniques ranging from straightforward arithmetic computation to artificial intelligence, statistics, decision trees, neural networks, and Bayesian network theory. Their objective is to identify patterns in complex and ill-defined data sets. Sprint and other wireless carriers, for example, use neural analytical technology to predict which customers are likely to switch wireless carriers and take their existing phone numbers with them. SAS and IBM offer both data and text mining capabilities and are major vendors in both categories; R and RapidMiner offer open-source alternatives.
Text mining tools can help managers quickly identify emerging trends in near-real time. Spiders, or data crawlers, which identify and count words and phrases on websites, are a simple example of text mining. Text mining tools can be invaluable in sniffing out new trends or relationships. For example, by monitoring technical-user blogs, a vendor can recognize that a new product has a defect within hours of being shipped instead of waiting for complaints to arrive from customers. Other text mining products can recognize references to people, places, things, or topics and use this information to draw inferences about competitor behavior.
Text categorization is the process of using statistical models or rules to rate a document’s relevance to a certain topic. For example, text categorization can be used to dynamically evaluate competitors’ product assortments on their websites.
Natural language processing tools go beyond text mining and categorization to make sense of language and even answer human questions; they may employ semantic analysis, statistical analysis, or some combination of the two. Natural language generation creates text for contexts such as sports reporting, business earnings reports, and investment reports in financial services.
Event streaming isn’t, strictly speaking, an analytical technology, but it is increasingly being combined with analytics to support real-time smart processes. The idea is to analyze data as it comes in—typically from voluminous and fast-flowing applications like the Internet of Things. The goal isn’t normally to perform advanced analytics on the data, but rather to “curate” it—which may involve filtering, combining, transforming, or redirecting it. This approach has also been employed for a decade or longer in fast-moving data in the financial services industry.
Simulation tools model business processes with a set of symbolic, mathematical, scientific, engineering, and financial functions. Much as computer-aided design (CAD) systems are used by engineers to model the design of a new product, simulation tools are used in engineering, R&D, and a surprising number of other applications. For example, simulations can be used as a training device to help users understand the implications of a change to a business process. They can also be used to help streamline the flow of information or products—for example, they can help employees of health care organizations decide where to send donated organs according to criteria ranging from blood type to geographic limitations.
Web or digital analytics is a category of analytical tools specifically for managing and analyzing online and e-commerce data. The bulk of web analytics are descriptive—telling managers of websites how many unique visitors came to a site, how long they spent on it, what percentage of visits led to conversions, and so forth. Some web analytics tools allow A/B testing—statistical comparisons of which version of a website gets more clicks or conversions. Web analytics has largely been a world unto itself in the organizational analytics landscape, but is slowly being integrated into the larger group of quantitative analysts.¹⁷ Another related category of analytical tools is focused on social media analytics—not only counting social activities, but also assessing the positive or negative sentiment associated with them.

Whether a custom solution or off-the-shelf application is used, the business IT organization must accommodate a variety of tools for different types of data analysis (see the box “Analytical Technologies” for current and emerging analytical tools). Employees naturally tend to prefer familiar products, such as a spreadsheet, even if it is ill suited for the analysis to be done.

Another problem is that without an overall architecture to guide tool selection, excessive technological proliferation can result. In a 2015 survey, respondents from large organizations reported that their marketing organizations averaged more than twelve analytics and data management tools for data-driven marketing.¹⁸ And there are presumably many other tools being used by other business functions within these firms. Even well-managed analytical competitors often have a large number of software tools. In the past, this was probably necessary, because different vendors had different capabilities—one might focus on financial reporting, another on ad hoc query, and yet another on statistical analysis. While there is still variation among vendors, the leading providers have begun to offer business intelligence suites with stronger, more integrated capabilities.

There is also the question of whether to build and host the analytical application onsite or use an “analytics as a service” application in the cloud. As with other types of IT, the answer is increasingly the latter. Leading software vendors are embracing this trend by disaggregating their analytics tools into “micro-analytics services” that perform a particular analytical technique. SAS executives, for example, report that a growing way to access the vendor’s algorithms and statistical techniques is through open application program interfaces, or APIs. This makes it possible to combine analytics with other types of transactional and data management services in an integrated application.

EQUIFAX EVOLVES ITS ANALYTICS ARCHITECTURE

In 2010 Tom consulted at Equifax, a leading provider of consumer credit and financial information, on an assessment of the company’s analytical capabilities. The company’s then and current CEO, Rick Smith, an advocate of competing on analytics, wasn’t sure the needed capabilities were present in the firm. The assessment found that a key barrier to success for Equifax was that analytics activities took too long to complete due to organizational and data-related issues. The company had the SAS statistical package, but the absence of an enterprise data warehouse made it difficult to assemble different types of data in the necessary time frame. There were pockets of strong analytical capability, but the company didn’t address analytics as an enterprise resource. The assessment also recommended the creation of a chief analytics officer role.

Now, seven years later, the climate and capability for analytics have changed dramatically. Prasanna Dhore is the company’s chief data and analytics officer (he participated in our 2006 “Competing on Analytics” study at a different company). Peter Maynard, who arrived at Equifax from Capital One (another early analytical competitor) is the SVP of Global Analytics. He told us that both the technology and the speed with which analytics are conducted have undergone major change under Prasanna’s leadership and Equifax’s infrastructure investment.

A big component of the change is the shift to a Hadoop-based data lake, which allows Equifax to store and assemble multiple types of data with ease and at low cost. The company leverages the SAS High-Performance Analytics platform to get maximum value out of the data that resides in Hadoop.

Maynard notes that this in-memory architecture has dramatically accelerated the speed of analytics at Equifax:

We have moved from building a model using a month of consumer credit data to two years’ worth, and we are always analyzing the trended data across time. We have a neural network model that looks at all the data and identifies trends in the consumer’s credit history. Whenever we introduce new data and variables into the model, we need to determine how they affect the trend. It used to take about a month to evaluate a new data source, but now it’s just a few days because of our much faster analytics environment.

Maynard said that the neural network model was developed using SAS’s Enterprise Miner offering. It’s a complex model, because it requires a set of “reason codes” that help explain specific credit decisions to consumers.

The Equifax analytics technology architecture also makes room for open-source tools like R and Python. Recent graduates in their data science group like them, Maynard notes, but he says that Equifax has a lot of existing SAS models and code, and many of its data scientists and quantitative analysts are comfortable with it. Maynard is also considering moving to SAS streaming analytics for even more speed and to employ SAS Model Risk Management for ongoing assessment and governance of models.

Maynard and his colleagues regularly attend SAS events and visit the company’s headquarters in Cary, North Carolina, for briefings and discussions. Equifax’s analytical leaders have made major changes in their approaches to analytics, and they are satisfied that SAS’s offerings are changing along with them.

Data Visualization

Since an analysis is only valuable if it is acted on, analytic competitors must empower their people to impart their insights to others through business intelligence software suites, data visualization tools, scorecards, and portals. Business intelligence software allows users to create ad hoc reports, interactively visualize complex data, be alerted to exceptions through a variety of communication tools (such as email, texts, or pagers), and collaboratively share data. (Vendors such as SAP, IBM, SAS, Microsoft, and Oracle sell product suites that include data visualization, business intelligence, and reporting solutions.) Commercially purchased analytical applications usually have an interface to be used by information workers, managers, and analysts. But for proprietary analyses, these tools determine how different classes of individuals can use the data. For example, a statistician could directly access a statistical model, but most managers would hesitate to do so.

The current generation of visual analytical tools—from vendors such as Tableau and Qlik and from traditional analytics providers such as SAS—allow the manipulation of data and analyses through an intuitive visual interface. A manager, for example, could look at a plot of data, exclude outlier values, and compute a regression line that fits the data—all without any statistical skills.

Because they permit exploration of the data without the risk of accidentally modifying the underlying model, visual analytics tools significantly increase the population of users who can employ sophisticated analyses. Over the past several years they have made “analytics for the masses” much more of a reality than a slogan. At Vertex Pharmaceuticals, for example, longtime CIO Steve Schmidt (now a medical device analytics entrepreneur) estimated several years ago that only 5 percent of his users could make effective use of algorithmic tools, but another 15 percent could manipulate visual analytics. Our guess is that the percentage of potential visual analytics users has increased dramatically with the availability of these new tools.

Deployment Processes

This element of the analytics architecture answers questions about how the organization creates, manages, implements, and maintains data and applications. Great algorithms are of little value unless they are deployed effectively. Deployment processes may also focus on how a standard set of approved tools and technologies are used to ensure the reliability, scalability, and security of the IT environment. Standards, policies, and processes must also be defined and enforced across the entire organization. There may be times when a particular function or business unit will need its own analytics technology, but in general it’s a sign of analytical maturity for the technology to be centrally managed and coordinated. Some firms are beginning to use structured “platforms” to manage deployment process. One firm, FICO, has a deployment platform and discusses the deployment issue as managing “the analytics supply chain.”¹⁹

Latter-stage deployment issues such as privacy and security as well as the ability to archive and audit the data are of critical importance to ensure the integrity of the data and analytical applications. This is a business as well as a technical concern, because lapses in privacy and security (for example, if customer credit card data is stolen or breached) can have dire consequences. One consequence of evolving regulatory and legal requirements is that executives can be found criminally negligent if they fail to establish procedures to document and demonstrate the validity of data used for business decisions.

Conclusion

For most organizations, an enterprise-wide approach to managing data and analytics will be a major departure from current practice; it’s often been viewed as a “renegade” activity. But centralized analytical roles—a chief data and analytics officer, for example—and some degree of central coordination are signs of a company having its analytics act together. Top management can help the IT architecture team plan a robust technical environment by helping to establish guiding principles for analytical architecture. Those principles can help to ensure that architectural decisions are aligned with business strategy, corporate culture, and management style.²⁰ To make that happen, senior management must be committed to the process. Working with IT, senior managers must establish and rigorously enforce comprehensive data management policies, including data standards and consistency in data definitions. They must be committed to the creation and use of high-quality data—both big and small—that is scalable, integrated, well documented, consistent, and standards-based. And they must emphasize that the analytics architecture should be flexible and able to adapt to changing business needs and objectives. A rigid architecture won’t serve the needs of the business in a fast-changing environment. Given how much the world of analytics technology has changed in the last decade, it’s likely that the domain won’t be static over the next one.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Eight: The Architecture of Analytics and Big Data

Create new playlist

Sign In

Sign Up