© Thomas W. Dinsmore 2016

Thomas W. Dinsmore, Disruptive Analytics, 10.1007/978-1-4842-1311-7_7

7. Analytics in the Cloud

The Disruptive Power of Elastic Computing

Thomas W. Dinsmore

(1)Newton, Massachusetts, USA

Several years ago, SAS, one of the leading commercial business analytics software vendors, held an annual sales meeting for financial services account executives. Jim Goodnight, founder and CEO, spoke to the assembled sellers and led a Q&A session.

“What is our strategy for cloud computing?” asked one of the sales reps.

“There’s only one thing you need to know about cloud computing,” drawled Goodnight. “It’s all BS.” He compared cloud computing to mainframe service bureaus in the 1970s, which typically offered metered pricing based on usage. Goodnight repeated the comparison in a recent interview with The Wall Street Journal.1

Of course, everyone understands that modern cloud computing isn’t the same as mainframe timesharing, any more than a 2016 Ford Focus is the same as a 1976 Ford Pinto because they both have four wheels. Goodnight is right, though, to point out some similar principles—sharing IT resources across multiple users and metered pricing.

Whether cloud computing is radically new or “all BS,” as Goodnight suggests, it’s clear that cloud is eating the IT world. A leading analyst forecasts2 that cloud data center workloads will triple from 2013 to 2018; moreover, 78% of all data center workloads will be in the cloud by 2018.

We note, too, that shortly after Goodnight’s original remarks, SAS announced3 plans to invest $70 million in a cloud computing data center.

Cloud Computing Fundamentals

The National Institute of Standards and Technology (NIST) defines cloud computing as:

…a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources…that can be rapidly provisioned and released with minimal management effort or service provider interaction. 4

Cloud computing has five essential characteristics :

  • On-demand self-service: Users can provision the computing resources they need, such as server time and storage space, without help from a systems administrator.

  • Broad network access: Users can access computing services over a network from diverse connected platforms, including mobile phones, tablets, laptops, and workstations.

  • Resource pooling: The service provider pools computing resources to serve multiple consumers, with physical resources assigned and reassigned on demand. The user has no control or knowledge of the specific physical resources assigned.

  • Rapid elasticity: The consumer can acquire and release resources on demand; to the consumer, computing resources appear to be unlimited, and can be acquired and released in any quantity at any time.

  • Measured service: The cloud provider measures computing services on appropriate dimensions, such as server time or storage volume and time. Resource usage is monitored and reported in a manner that is transparent to the user and provider.

The NIST defines four cloud deployment models: private, community, public, and hybrid.

  • Public cloud: The cloud provider owns and maintains a cloud platform and offers services to members of the general public.

  • Private cloud: The cloud provider provisions a cloud for its own exclusive use. For example, a company builds and maintains a cloud computing platform and offers services to its own employees and contractors. The cloud may be owned and managed by the organization, a third party, or a combination thereof, and may reside on- or off-premises.

    A virtual private cloud (VPC) uses a software framework to provide the equivalent of a private cloud on public cloud infrastructure. VPCs are well-suited to the needs of small and medium businesses (SMBs) .

  • Community cloud: Groups of organizations, such as trade associations or affinity groups, provision a community cloud for the exclusive use of member organizations. Like private clouds, community clouds may be owned by one or more of the affiliated organizations, a third party, or a combination thereof, and may reside on- or off-premises.

  • Hybrid cloud: An organization combines infrastructure from multiple deployment models (private, community, or public). For example, a company runs its own private cloud, which it supplements with public cloud during periods of peak workload. A software framework unifies the hybrid cloud, so a user does not know whether a job runs on the private or public infrastructure. A full 70% of organizations surveyed by IDC report a hybrid approach to cloud computing.

Consultant IDC projects5 worldwide public cloud revenue to grow from nearly $70 billion in 2015 to more than $141 billion in 2019. In 2016, Amazon Web Services dominates the market; Microsoft ranks second and is growing rapidly. Other vendors include Google, VMware, IBM, DigitalOcean, and Oracle.6

IDC estimates7 total spending on private cloud infrastructure of $12.1 billion in 2015; consultant Wikibon estimates8 a much lower level of $7 billion. (Private cloud is harder to measure than public cloud.) Leading suppliers in the market include Hewlett Packard Enterprise (HPE) , Oracle, VMware, EMC, and IBM. The private cloud market is much more fragmented than the public cloud market, and the top ten vendors control only about 45% of spending.

One consultant projects9 total spending on VPCs to exceed $40 billion by 2022. Community clouds are a much smaller presence in the market, with total spending projected10 to reach $3.1 billion by 2020.

Within the NIST’s broad deployment models , there are three distinct services models for cloud computing:

  • Infrastructure-as-a-Service (IaaS): The most basic cloud service: the provider offers fundamental processing, storage, and networking services together with a virtual resource manager. The end user installs and maintains the operating systems and application software.

  • Platform-as-a-Service (PaaS): The provider offers a complete computing platform, including operating system, programming languages, databases, and web server. The end user develops or implements applications on the computing platform, but the cloud provider is responsible for installation and maintenance of the supported components.

  • Software-as-a-Service (SaaS): The end user has direct access to application software. There are two distinct SaaS models: direct selling and marketplace selling. Under a direct selling model, the software vendor handles software application and maintenance and either hosts the software on its own infrastructure or contracts with a cloud provider for IaaS or PaaS services. Under the marketplace selling model, the cloud provider creates a platform where end users can shop and select software from many vendors; the cloud provider handles installation and maintenance of the application software.

Consultancy Technology Business Research (TBR) estimates that SaaS currently accounts for about 60% of the public cloud market; Salesforce is the clear leader in SaaS. The PaaS market is relatively small, accounting for about 10% of the 2016 market; TBR projects a 17% growth rate through 2020. Microsoft leads in PaaS, followed by IBM, Salesforce, and Google.

As a general rule, IaaS services target developers and IT organizations, who configure and manage the platform for the benefit of their end users. PaaS services target developers and power users willing and able to perform basic software configuration tasks. SaaS services target business users.

The distinction among the service models tends to be blurred in practice. Amazon Web Services, for example, is generally considered to be an IaaS provider; but AWS offers PaaS and even SaaS services. As cloud providers seek to expand the reach and profitability of their offerings, we can expect they will seek to move up the value chain, offering higher level services and solutions.

Cloud computing offer benefits similar to managed software hosting, but the business model and technologies differ. Software hosting dates back at least to the mainframe service bureau model of the 1960s; in the 1990s, service bureaus evolved into application service providers (ASPs) with a more diverse technology stack.

Unlike cloud service providers, ASPs host software and applications for customers under contract. The customer licenses software from a software vendor; instead of deploying the software on-premises, however, the hosting provider implements and manages the software under a long-term contract. The ASP generally does not pool IT infrastructure across customers; instead, the ASP builds the cost of dedicated infrastructure into contract costs. Contract terms generally call for fixed periodic payments, with a schedule of extra services billed as used.

ASPs tend to operate as partners of large software vendors. The software vendor’s marketing muscle helps the ASP’s business development effort, and the ASP’s hosted model helps the software vendor sell to companies with limited IT skills and capacity. Many large software vendors, like SAS, operate their own in-house ASP operations.

Cloud computing rests on three innovations:

Virtualization. Hardware virtualization is a process that creates one or more simulated computing environments (“virtual machines” or “instances”) from a single physical machine. Virtualization improves IT efficiency, since individual applications rarely make full use of modern computer hardware.

The primary goal of virtualization is to manage workloads by making computing more scalable. Virtualization is not new, but has progressively evolved for decades. Today it can be applied to a wide range of system layers, including operating system, applications, workspaces, services, and hardware components (including memory, storage, file systems, and networks).

Virtualization provides other benefits in addition to improved hardware utilization. For example, IT organizations can more easily administer security and access control on virtual instances, and they can quickly redeploy an instance from one physical machine to another when necessary for maintenance or failover.

Service-Oriented Architecture (SOA). Service-oriented architecture is a computer design architecture in which application components provide services to other components through a communications protocol. Services are self-contained, loosely coupled representations of repeatable business activities. Breaking complex applications down into self-contained services simplifies development, maintenance, distribution, and integration.

Autonomic Computing. In 2001, IBM coined11 the term Autonomic Computingto refer to an approach to distributed computing that seeks to build self-regulating systems with autonomic components. Autonomic components can configure, heal, optimize, and protect themselves.

The technologies that enable the cloud are not new. The extreme growth of companies like Amazon, Google, and Facebook and the commensurate demands on their IT infrastructure produced a perfect storm of innovation and skills in virtualization, provisioning, energy consumption, security, and other disciplines required to run a modern data center.

The Business Case for Cloud

In this section, we discuss the economic benefits of cloud and the use cases that drive firms to use the cloud, together with concerns about data movement and security.

Cloud Economics

Is cloud computing more expensive or less expensive than on-premises computing? The answer is a matter of some controversy. One reason the problem is challenging is that many firms do not accurately measure the total costs of their IT infrastructure; they measure “hard costs” of equipment and purchased software, but fail to measure “soft costs” of IT personnel. Cloud computing costs, on the other hand, are tangible; at the end of the billing period the vendor sends an invoice, so costs are clear. This tends to create a bias against cloud in some organizations, especially so if IT executives see cloud computing as a threat.

Cloud computing vendors enjoy a number of advantages compared to on-premises IT organizations.

Economies of Scale. Vendors like Amazon Web Services and Google purchase hardware in huge quantities. They have highly capable purchasing organizations who negotiate hard bargains with hardware vendors.

Economies of Skill. Public vendors hire the best and brightest people to manage their data centers. They are extremely competent managers with deep experience running massive worldwide networks. In every aspect of data center management, from virtualization to energy consumption, they are simply better at their jobs.

UtilizationEconomies. By pooling resources and spreading them over a large worldwide user base, public cloud vendors achieve a much higher level of utilization for IT infrastructure than individual firms are able to accomplish. This means cloud vendors can charge a lower unit price for computing resources.

As a result of these economies, combined with aggressive competition among the leading cloud vendors, public cloud charges declined12 by double digit percentages each year prior to 2016.

Of course, these economies do not apply across the board. Some large organizations have the scale and the skill to deliver unit computing costs that match or beat cloud pricing. Organizations with accurate total cost of ownership (TCO) metrics can confidently assert that it makes good economic sense to keep computing on-premises.

But there a number of logical use cases for the cloud even for those firms with lower unit costs of computing.

Predictable Peak Workload. Many organizations have workloads with predictable peaks that are significantly higher than the base workload. For example, retailers process much higher transaction volumes during the peak Christmas season, and most organizations have high month-end processing for accounting and financial applications. If an organization builds IT infrastructure to meet its peak workloads, that infrastructure will be underutilized most of the time. A hybrid approach that uses cloud for peak workloads and on-premises systems for base workload is the optimal approach for these organizations.

Unexpected Peak Workload. A similar calculus applies for organizations with an unexpected surge in IT workload. Rather than rushing to add more infrastructure to support a sudden and unexpected increase in demand, it makes sense for the organization to use the cloud to support the incremental workload, at least until a root-cause analysis is complete.

Variable Cost Businesses. Many businesses in the services industry are inherently variable cost businesses; they are thinly capitalized and operate under a model where costs are charged back to clients and projects. Consulting firms, advertising agencies, marketing services providers, analytical boutiques, and other similar firms generally prefer to avoid investing in overhead of any kind, including IT infrastructure. Firms in this category may choose to rely exclusively on the cloud to support project work.

Time to Value. In many cases, executives are less interested in costs and more interested in time to value. This is often true for rapidly growing businesses; it may also be true for firms whose IT organizations are operating at or near capacity. For these firms, cloud offers immediate capacity and an ability to scale up quickly. Speed and time to value are key value propositions stressed by SaaS vendors like Salesforce, which cater to business needs for rapid capability.

Business Unit Autonomy. Business units sometimes choose to work independently of the IT organization. This may be due to real or perceived shortcomings of IT’s ability to support the business effectively, political conflict between executives, or competition for scarce IT resources from other business units. In any case, turning to the cloud offers the business unit executive an opportunity to control the IT resources needed for a mission-critical initiative.

Pilots and POCs. Certain business analytics use cases are very attractive for the cloud. Among these are pilot projects and proof of concepts (POCs): projects designed to determine the viability of a specific solution. Cloud computing is attractive for these projects because an organization can quickly provision a temporary environment without the risk of a sunk infrastructure cost. If the project proves viable, the organization can keep the application in the cloud or port it to an on-premises platform. If the pilot or POC is not successful, the organization simply shuts it down.

Ad Hoc Analytics. Strategic ad hoc projects are also an attractive use case for cloud. Many analyses performed for top executives are not repeatable; the analysis will be performed once and never again. Conventional approaches to data warehousing do not apply to ad hoc analysis, which is surprisingly prevalent in most enterprises. The cloud enables analysts to quickly create a temporary datastore with as much storage and computing power as needed.

Model Training. The model training phase of machine learning is particularly well suited to elastic computing in the cloud, for several reasons:

  • Model training is project-oriented rather than a recurring production activity. (Model scoring, on the other hand, is production oriented.)

  • Machine learning algorithms require a lot of computing power, but for a short time only.

  • Machine learning projects often support development projects performed in advance of IT infrastructure investments.

  • In areas such as marketing, projects are often started with little lead time and require rapid provisioning.

  • To deliver machine learning projects, organizations frequently engage analytic service providers (ASPs), who must account for infrastructure costs.

Many enterprises outsource ad hoc analysis to consulting firms and analytic service providers. Cloud computing is especially important to these firms, because the cloud’s low cost and measured service enables them to explicitly match computing costs to client projects and to expand quickly without capital investment.

Data Movement

Data movement is always a concern when working with Big Data. A process that takes minutes with a small data set can take hours or days when we measure data in petabytes. In the Big Data era, minimizing data movement is a key governing principle.

Nevertheless, in business analytics at least some data movement is inevitable. IT organizations rarely permit production systems that serve as data sources to be used for analytics; in any case, these systems generally lack the necessary tooling. Hence, in most cases each piece of data will be copied and moved at least once, when it is transferred from a source system to an analytic repository.

Moving data to and from the cloud poses even greater concerns than moving data internally. Public networks can be a bottleneck, and security concerns dictate an encrypt/decrypt operation at either end to avoid a data breach. To mitigate the problem, cloud providers and their alliance partners offer a number of products and services.

Dedicated Physical Connections. Cloud providers offer customers the ability to connect directly to the provider’s data center through a dedicated private network connection. Services like this are a good choice for organizations seeking to move regular updates to a cloud-based environment for analytics.

Data Transfer Accelerators. The leading cloud providers operate global networks of data centers and will work with customers to optimize data transfer. For example, an organization with operations in multiple countries may be able to minimize data transfer costs by transferring data locally in each country, then consolidating the data within the cloud provider’s network.

Portable Storage Appliances. Secure transportable storage devices are now available to store up to 80 terabytes (TB); larger files can be split across multiple devices. Organizations seeking to move data to or from the cloud copy data to the secure device, then send to the destination data center by overnight delivery. This method is excellent in scenarios where a mass of data will be moved all at once, as in a database migration or in the early stages of an analysis project.

Storage Gateways. For organizations seeking to implement a hybrid architecture that mixes on-premises and cloud platforms, a storage gateway may be the right solution. Storage gateways (from firms such as Aspera (IBM), CloudBerry, NetApp, and Zerto) reside on customer premises and broker between on-premises and cloud storage, maintaining a catalogue of data and its location. Storage gateways handle compression and encryption.

Database Migration Services. When the organization seeks to migrate or update data from an on-premises relational database to a database in the cloud, it is highly desirable to maintain the structure and metadata. Extract and reload operations take time, because the database administrator must map the structure of the source database to the target database; they are also subject to human error in the mapping process. Tools from vendors like Attunity enable the organization to copy data directly from database to database, on-premises, in the cloud, or both.

Of course, cloud platforms are the logical site for business analytics when the source data is already in the cloud and does not need to be transferred. Some businesses have built their operations around cloud computing; for these organizations, data movement to and from the cloud is not an issue.

Security

Vendors of on-premises software cultivate the perception that the cloud is less secure than on-premises data management. In many organizations, there are executives who believe that their data is most secure when it remains on-premises; they oppose moving data to the cloud.

The data suggest otherwise. On-premises facilities suffered13 ten out of ten of the worst data breaches in 2015. A cybersecurity report issued by the Association of Corporate Counsel reveals14 that employee error is the leading cause of data breaches; a separate analysis performed by privacy and data protection specialists Baker & Hostetler LLP found that employee negligence is the biggest cause of breaches.

Effective security stresses good management policies and practices rather than the physical location of the data. The leading public cloud vendors are very good at managing data center security; they go to great lengths to certify compliance with security standards published by ISO and NIST and as required under HIPAA and other governing legislation. There are no known security breaches15 recorded by the leading cloud providers in a decade of service.

Executives surveyed16 by consultant Technology Business Research (TBR) cited security as their top consideration in cloud decision making. However, respondents also said they believe having their data stored and managed by an expert third party improves overall security. In the same survey, respondents indicated that security is the primary consideration favoring a private cloud; on all other dimensions, respondents rated a public cloud equal to or better than a private cloud.

Personally Identifiable Information (PII) is especially sensitive data, since its unauthorized disclosure directly impacts consumer privacy and exposes the organization to serious consequences. The definition of PII is surprisingly broad, because bad actors can combine PII from multiple sources to create a detailed profile of the prospective victim. A person’s Social Security Number (SSN) is obviously PII; but fraudsters can predict17 a victim’s SSN from date of birth and place of birth.

Personally Identifiable Information (PII) is information that can be used to identify, contact, or locate an individual. In the United States, the National Institute of Standards and Technology publishes standards for what constitutes PII and how to manage it.

While operational systems must capture and retain PII, it is often not needed in business analytics. There are exceptions to that generalization; an analyst may want to apply machine learning techniques to customer surnames to identify ethnicity or geocode a customer’s address to perform spatial analysis. When it is essential to work with PII, good working methods minimize the security risk; we discuss these in Chapter Ten.

Analytics in the Public Cloud

In this section we examine cloud services pertinent to business analytics from Amazon Web Services, Microsoft, and Google. We focus on managed services for storage, compute, Hadoop, relational databases, business intelligence, and machine learning; all three vendors offer many other services, which could be pertinent for some projects. The intent is to cover the most widely used services.

We show pricing information for reference and comparison, but the reader should bear in mind that cloud vendors can and do change prices frequently.

While we focus on managed services, all three vendors offer basic compute and storage services, which enable an organization to implement any licensed software in the cloud.

Amazon Web Services

Amazon Web Services (AWS) offers more than 50 managed services. Of these, the services most pertinent to business analytics are storage services, compute services, Hadoop services, database services, business intelligence services, and machine learning services.

Storage Services

For any cloud platform, data storage is the most fundamental service, for two reasons. First, to serve as an initial staging area for data when it is first uploaded to the cloud; second, to serve as persistent storage after an application finishes processing.

In AWS, most computing instances include some local storage, which is available as long as the instance is available. However, anything saved in the local storage will be lost when the user’s lease on the compute instance expires. Thus, it makes sense to separate long-term file storage from compute instances, which users may want to rent briefly, then release.

Amazon Simple Storage Service (S3) was the first service offered by AWS, in March 2006. It is a low-cost scalable storage service that stores computer files as large as five terabytes. Files can be in any format.

Applications interact with S3 through popular web services interfaces, such as REST, SOAP, and BitTorrent; this enables S3 to serve as the back-end storage for web applications. According to AWS, the S3 service uses the same technology that Amazon.com uses for its ecommerce platform.18

AWS offers three different storage classes at different price points: Standard, Infrequent Access, and Glacier.

  • Standard storage offers immediate access with no minimum file size, no minimum storage duration, and no retrieval fees.

  • Infrequent Access storage offers immediate access with a retrieval fee, a minimum file size, and a 30-day minimum storage. Monthly costs are lower as long as the user does not retrieve files frequently.

  • Glacier storage offers access with up to four hours’ latency with a retrieval fee and a 90-day minimum storage.

AWS charges monthly fees for S3 usage by the gigabyte (GB). Pricing depends on the storage class and total storage used. For example, in the US East region, the price per GB per month for Standard storage up to 1 terabyte (TB) is $0.03; for Infrequent Access storage, the price is $0.0125 per GB per month; for Glacier storage, $0.007. Prices for all three services decline with volume. As is the case for all AWS services, prices may vary by region.19

Compute Services

Amazon Elastic Compute Cloud (EC2) is one of the earliest services offered by AWS and remains a foundation of its cloud platform. EC2 enables users to rent virtual machines to provision their own applications, paying for machine time by the hour.

Users can start and end server sessions as needed, choosing from a wide range of instances, priced according to computing power. AWS’s available server types change constantly, falling into five different categories:

  • General Purpose instances provide a balance of compute, memory, and network resources. Some instances in this category are burstable, which means they offer a baseline capacity with the ability to “burst” temporarily above the baseline.

  • Compute Optimized instances feature the highest performing processors (measured by CPU speed).

  • Memory Optimized instances have larger amounts of Random Access Memory (RAM) per CPU.

  • GPU instances include one or more NVIDIA GPUs.

  • Storage Optimized instances include high I/O instances with SSD backed storage and dense storage instances with very large hard drives per CPU.

Within each category, AWS offers sizes ranging from small instances equivalent to a laptop computer to extra large instances with thousands of cores. Instances are preconfigured with an operating system image; users can choose from among Linux (Amazon, Red Hat, SUSE, or CentOS), Windows (with and without SQL Server), Debian, and other operating systems.

EC2 users can predefine virtual appliances, called Amazon Machine Images (AMIs). These consist of an operating system and any other software needed to run an application. AMIs make it cost effective to run complex software stacks on EC2, since the user need not use valuable instance time installing and configuring software.

AWS offers three main pricing models for EC2: on-demand, reserved, and spot pricing. Under on-demand pricing, the user pays by the hour with no commitment. Pricing varies by region and operating system; in April 2016, rates on Amazon Linux in the US East Region ranged from .0065 to $6.82 per hour.

Reserved instance pricing provides the user with a discount in return for a commitment to use the instance for a defined term of one to three years. The discount is significant. For example, for a large general purpose instance (M4), the user pays $1,173 in advance for a three year term, which is $.0446 per hour; by comparison, the on-demand hourly rate is $0.12 per hour, almost three times higher. (Pricing is on Amazon Linux, US East Region, April 2016.) Of course, that comparison is valid only if the user runs an application on the instance constantly for the entire term of the contract.

The difference between on-demand and reserved instance pricing is comparable to the difference between renting a hotel and renting an apartment. A traveler visiting a city only needs a room for a limited number of nights and is willing to pay a relatively high price per night for the short-term stay. On other nights, the hotel rents the room to other travelers. On the other hand, a person residing in a city needs a place to stay all of the time and is willing to make a fixed commitment in return for exclusive use of the apartment for the term of the lease.

The Spot pricing model works like an auction market. Users bid for unused Amazon EC2 capacity; instances are charged the Spot Price, set by AWS and fluctuating with supply and demand.

Hadoop Services

In Chapter Four we covered the Hadoop ecosystem. Amazon Elastic MapReduce (EMR) is AWS’s managed service offering for Hadoop. AWS first offered EMR in April, 2009.

EMR is elastic in two respects. Users can deploy multiple clusters without limit; clusters can be configured differently and use different instance types while still sharing storage. Users can also resize existing clusters, adding or dropping nodes even while a job is running in the cluster.

AWS offers EMR users the ability to work with different file systems, including S3, HDFS (the “native” Hadoop file system), Amazon DynamoDB (a NoSQL database service), Amazon RDS, and Amazon Redshift. EMR includes commonly used components from the Hadoop ecosystem, including HBase, Hive, Hue, Impala, Pig, Presto, Spark, and Zeppelin.

AWS prices EMR by the instance hour, one instance per cluster regardless of the number of nodes in the cluster. Pricing varies according to the type of instance used for the cluster. Users also pay EC2 costs. Hence, the total charge for a 10-node EMR cluster on large general purpose instances will be the charge for the 10 EC2 instances plus the charge for the EMR instance.

Database Services

Amazon Relational Database Service (RDS) is a distributed database service first offered in October 2009. As of April 2016, AWS offers managed services for MySQL; Oracle Database; Microsoft SQL Server; PostgreSQL; MariaDB; and Amazon Aurora (a high-availability version of MySQL).

Amazon Redshift is a petabyte-scale columnar data warehouse based on technology licensed from data warehouse vendor Actian. Redshift is suitable for SQL analysis on very large volumes of data.

In theory, users can set up their own databases on leased EC2 instances. The AWS services relieve the user of the need to install, configure, provision, and patch the database software, and it simplifies the process of scaling up compute and storage. AWS also provides automated backup and database snapshots.

Business Intelligence Services

AWS users can set up any BI tool in EC2 and query databases in RDS or Redshift. In October 2015, AWS announced20 a public preview of its own business intelligence service branded as Amazon QuickSight. QuickSight offers users the ability to connect to data in Redshift, RDS, EMR, S3, and many other data sources, and it uses a distributed in-memory calculation engine to deliver fast interactive visualization. AWS expects to release the service to production in 2016, priced at a monthly fee per user.

Machine Learning Services

Amazon Machine Learning (ML) is a managed service that works with data stored in Amazon S3 files, Amazon Redshift, or MySQL databases in Amazon Relational Database Service. The service includes tools for data visualization, exploration, and transformation and a limited number of machine learning algorithms.

For prediction, the service supports APIs for batch or real-time scoring. The service does not support model import or export.

AWS prices the service at a set price per hour to analyze data and build models and to separate volume-based prices for batch and real-time prediction.

Marketplace

AWS also hosts a marketplace platform that enables software vendors and other sellers to offer their capabilities. Vendors offer software as Amazon Machine Images, as AWS CloudFormation Stacks, or as Software-as-a-Service.

AWS CloudFormationis a service that helps users model and set up AWS resources. It offers broader capability than Amazon Machine Images.

In most cases, vendors offer trial versions of their software free of license fees. Others charge hourly, monthly, or annual license fees, with AWS handling metering and billing. In still other cases, users must license the software separately through other channels. In all cases, users are responsible for AWS storage and compute charges.

Microsoft Azure

Microsoft has a unique approach to cloud computing that reflects its strength in on-premises computing and in enterprise desktop software. The backbone of Microsoft’s cloud services is Microsoft Azure, a specialized operating system that manages computing and storage resources. (The Microsoft Azure brand applies to both the cloud platform as a whole and to the cloud operating system.)

Storage Services

Microsoft offers several different types of storage for different types of data:

  • Azure Table Storage for structured data

  • Azure Blob (Binary Large OBjects) Storage for documents, videos, backups, and other unstructured text or binary data

  • Azure Queue Storage for messages

  • Azure File Storage for Server Message Block (SMB) files

These storage types are available in four different data redundancy options. The least expensive option retains three copies of the data within a single data center; the most expensive option distributes six copies to two geographically separated data centers, with read access for high availability.

Each of the four storage types carries a separate rate card. Blob storage is the least expensive and file storage the most expensive.

Microsoft also offers a premium storage service based on solid state drive (SSD) storage for I/O intensive workloads.

Compute Services

Like AWS, Microsoft offers virtual computing platforms in a wide range of instance types and operating systems. Microsoft also distinguishes between Basic instances designed for development and test environments, and Standard instances for production environments.

Computing platforms are not limited to the Windows operating system. Microsoft Azure offers instances with Linux, Red Hat Linux, SUSE Linux, CentOS Linux, and Canonical Ubuntu. Not surprisingly, Microsoft offers instances preconfigured with Microsoft applications, like SQL Server and SharePoint.

Hadoop Services

Microsoft Azure HDInsight is a managed service for Hadoop based on Hortonworks’ HDP distribution, with some changes to reflect Microsoft standards and architecture. HDInsight is elastic in a manner similar to AWS’s EMR; users can add or drop clusters, or they can add or drop nodes to existing clusters.

HDInsight includes the analytical components that are bundled with the standard Hortonworks distribution: HBase, Hive, Pig, Phoenix, Spark, and Storm. For an extra charge, users can also access Microsoft R Server, a distributed machine learning engine with R bindings.

Microsoft prices HDInsight per node. Prices vary based on the type of compute instance used for each node.

Database Services

Microsoft Azure SQL Database is a managed Database-as-a-Service offering based on Microsoft SQL Server. The service is comparable to AWS Relational Database Service, but limited to a single core database product. Database functionality is a subset of Microsoft SQL Server functionality.

Pricing for the service is in three tiers at different service levels. The unit of measure is the Database Transaction Unit (DTU), a measure of a database instance’s ability to process transactions.

For petabyte-scale SQL processing, Microsoft Azure offers SQL Data Warehouse a columnar database comparable to AWS Redshift. Unlike Redshift, SQL Data Warehouse decouples compute and storage, so users can shut down the query engine without losing stored data. SQL Data Warehouse also includes a capability to query non-relational sources, such as delimited files, ORC storage, HDFS and Azure Blob Storage.

Microsoft bills SQL Data Warehouse compute resources in Data Warehouse Units (DWU), a measure of query performance. Storage is billed separately at Azure Blob Storage rates.

Business Intelligence Services

Microsoft’s popular BI tools, such as Excel and PowerBI, can readily use Azure data management services as a back end. Users deploy the tools locally and configure them to use the Azure data source; the application performs complex computations in the cloud and transfers a result set to the local user.

Machine Learning Services

Microsoft Azure Machine Learning is an offering from Microsoft that includes a browser-based client, machine learning algorithms in the cloud, APIs for Python and R and a marketplace for applications.

Azure Machine Learning Studio is an interactive drag-and-drop development environment enabling users to build, test, and deploy machine learning applications. The service works with data in a wide range of formats, including text files, Hive tables, SQL tables, R objects, and many others. Azure Machine Learning supports feature engineering and a wide range of algorithms for regression, classification, clustering, and anomaly detection. User can embed custom Python and R modules in a machine learning pipeline and deploy models as a web service.

Microsoft prices Azure Machine Learning at a flat monthly rate per seat, plus hourly charges to use Machine Learning Studio and the Machine Learning API.

Google Cloud Platform

Google Cloud Platform (GCP) is a relative latecomer to the public cloud market, first offering storage services to the public in 2010 and compute services in 2012. Since that time, however, Google has progressively added services and now offers a nearly complete platform for business analytics.

Google does not currently offer a managed service for BI and visualization. However, many such tools can connect to Google BigQuery (discussed later in this chapter) and use it as a data source.

Storage Services

Google Cloud Storage (GS) is a storage service comparable to the AWS S3 and Azure storage services. GS stores objects up to five terabytes.

Like AWS, GS offers storage in three options, priced according to availability. Standard storage, the most expensive, offers immediate access without retrieval fees. Durable Reduced Availability storage is slightly less expensive than Standard storage, with lower guaranteed availability. Nearline storage offers the lowest monthly charge per gigabyte, but Google charges users for each retrieval and guarantees availability at the lowest level.

Unlike Microsoft, Google does not distinguish among types of objects stored.

Compute Services

The Google Compute Engine (GCE) offers virtual machines (“instances”)21 from the same global infrastructure that runs Google’s branded services, such as Gmail and YouTube.

GCE categorizes instances as predefined or custom. Predefined instances have preset virtualized hardware properties at a set price. There are four classes of predefined instances: Standard, Shared Core, High Memory, and High CPU. Standard instances range from 1 to 32 cores; High Memory and High CPU instances range from 2 to 32 cores.

Users specify custom instances to include an even number of virtual cores up to 16 or 32 (depending on the user’s region), and from 0.9 gigabytes (GB) to 6.5 GB of memory per virtual core (in multiples of 256 megabytes). Pricing depends on the number of virtual CPUs and memory.

At a steep discount, Google offers pre-emptible instances. Google warns the user 30 seconds in advance to permit graceful shutdown.

GCE offers instances with Debian, CentOS, CoreOS, SUSE Linux, Ubuntu, Red Hat Linux (RHEL), Free BSD, and Windows. There are extra charges for RHEL, SUSE, and Windows.

While AWS and Microsoft charge for compute services by the hour, Google charges by the minute, with a 10-minute minimum. Sustained use discounts apply to instances used for specified percentages of the billing month. Google charges the same rates in all regions.

Hadoop Services

Google Cloud Dataproc is a managed service for Hadoop and Spark. Google first offered the service in beta in September 2015 and released22 it to general availability in February 2016.

Cloud Dataproc includes core Apache Hadoop (MapReduce, HDFS, and YARN), Spark, Hive, Pig, and connectors to other GCP services, including Cloud Storage, BigQuery (discussed later in this chapter), and BigTable (discussed later in this chapter), all deployed on Debian. Google integrates the components in an image, updating image versions with major or minor releases to reflect releases and patches for any of the components. Users may select older versions to create new clusters for up to 18 months after version release.

As with AWS EMR and Azure HDInsight, Cloud Dataproc is fully elastic. Users can add and drop clusters or resize them as necessary.

Google charges one cent per hour for each virtual CPU in the Cloud Dataproc cluster.23

Database Services

For relatively small relational database applications, Google offers Cloud SQL, a managed service featuring the MySQL database. The service is comparable to the AWS Relational Database Service and Microsoft Azure SQL Database. Google charges an hourly rate per database instance scaled according to the size of the instances upon which the database is deployed. Storage is an extra monthly charge, and there are network charges for egress (traffic leaving the instance).

Google BigTable is a petabyte-scale high performance data management system. It is not a relational database, but a massively scalable hypertable or multi-dimensional datastore. BigTable supports APIs for HBase and Google’s Go programming language, and a connector to Google Cloud Dataflow, Google’s general programming framework.24

BigTable does not support SQL, and is therefore not comparable to AWS Redshift or Microsoft Azure SQL Data Warehouse. For scalable SQL, Google recommends Google BigQuery, an SQL engine that works directly with Google Storage. BigQuery is an implementation of Dremel,25 a Google core technology.26 Google first placed Dremel in production in 2006 and uses it today for many applications.

BigQuery uses columnar storage and a tree architecture for dispatching queries. (We discussed columnar serialization in Chapter One.) Columnar storage enables a very high data compression ratio and minimizes scan time for analytic queries. Google’s query tree architecture enables BigQuery to distribute queries and collect results over thousands of machines.

Google charges users for storage (at a rate equivalent to Google Cloud Storage) and a standard rate of $5 per terabyte (TB) for queries; the first TB is free. Certain computationally intensive queries do not qualify for the standard rate; Google classifies these as high-compute queries and prices them individually. Google does not disclose the computing limit for standard pricing. If the query exceeds the threshold, Google informs the user and provides a cost estimate; the user must expressly opt in to run the query.

While BigQuery provides functionality that is similar in many respects to Redshift and SQL Data Warehouse, Google’s pricing model is quite different. Users of the AWS and Microsoft services determine the computing resources to be used and pay for what they request. Google BigQuery users pay for actual query processing volume, while Google determines the computing resources used for the query.

Machine Learning Services

Google Cloud Machine Learning is a managed service for Deep Learning. Users define models with the TensorFlow framework released to open source by Google in 2015. (We discuss TensorFlow in Chapter Eight.)

Cloud Machine Learning integrates with Cloud Dataflow for preprocessing and works with data held in Google Cloud Storage, BigQuery, and other data sources.

As of April 2016, the service is in Limited Preview, and Google has not released pricing.

Google Cloud Vision API is a managed service that supports label detection, optical character recognition, explicit content detection, facial detection, landmark detection, logo detection, and image properties. The first 1,000 units are free; above that level, Google charges a price per 1,000 units. A unit is one feature, or service, applied to one image.

Google Cloud Speech API uses Deep Learning to convert audio to text. The API recognizes more than 80 languages and dialects and works with data uploaded directly or stored in Google Cloud Storage. The service is in Limited Preview.

Google Translate API offers a simple programing interface for rapid translation of any text into one of more than 90 supported languages. The translation engine detects source text language when the language is not known in advance. Pricing is $20 per million characters.

Google also offers the Google Prediction API, a service that works with uploaded training data in CSV format. Users can elect to work with Google’s “black box” training algorithm or select a technique from a library of hosted models. The service is free for the first six months up to defined usage limits. After the free period has expired, Google charges a base monthly fee per Google Cloud Platform Console project, plus separate fees for model training and prediction. Pricing does not include the cost of Google Cloud Storage services to hold the training data.

The Disruptive Power of the Cloud

The technologies that enable cloud computing aren’t new. It is the cloud business model—elastic, “pay for what you use” computing—that is disrupting the technology industry and, by extension, the leading business analytics providers.

It’s clear that cloud computing is disrupting the computer hardware industry. In 2014, the National Resources Defense Council commissioned a study of data center energy efficiency. The study found27 that servers in the cloud operate at about 65% of capacity, while on-premises servers operate at 12% to 18% of capacity. On-premises servers are used less because the organizations that operate them build capacity to meet peak demand, so the servers are idle most of the time.

If businesses shift peak workloads to the cloud, they don’t buy as many servers. This is already happening, and companies whose businesses depend on selling computer hardware to other businesses are struggling:

  • IBM reports28 a 22% decline in server systems sales.

  • Chipmaker Intel plans29 to slash 12,000 jobs on disappointing sales.

  • Storage vendor EMC reports30 declining product sales.

Interestingly, while Intel’s other businesses are soft, sales to cloud computing vendors are up 9%.31

Cloud computing disrupts the commercial software industry as well, for two reasons. First, many organizations aren’t loyal to their software vendor; they’re stuck. Cloud providers generally offer a variety of software options in each category, including open source and low cost “private label” software under their own brand. Opting to migrate to the cloud often puts an organization’s software choices in play, encouraging switching.

The second source of disruption is elastic computing, and the notion that customers should only pay for what they use. The conventional software licensing model requires the customer to purchase a perpetual license or, at minimum, an annual term license; the cost of the license is sunk whether the customer uses it fully or not. The revenue model for many software vendors requires them to “stuff the channel” by loading up customers with software they will only partially use. If customers pay for what they use—and only what they use—they will pay a lot less.

Many commercial software vendors now support their software running in one of the top three cloud services. However, few offer elastic pricing, precisely because they fear that it will cannibalize their primary revenue model. In other words, commercial software vendors know that they have overlicensed their customers.

With cloud computing, it is now possible to build a complete business analytics platform entirely from services offered by the top three cloud computing vendors. With some technical skill, an analytics team can supplement vendor managed services with open source software to create a platform customized to meet the needs of any project. Moreover, this platform can scale out as needed to meet demand.

In Chapter Ten, we’ll discuss such a platform in more detail, together with information about value-added managed services providers who provide complete business analytics “stacks” in the cloud.

Footnotes

19 All pricing is as of May 2016.

21 Google uses the term “virtual machine” to refer to what AWS calls an “instance”. For consistency, we use the term “instance”.

23 Pricing as of May 2016.

24 Google donated Cloud Dataflow to Apache, where it is incubating as Apache Beam.

26 Dremel is also the foundation of Apache Drill, an open source SQL engine.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset