© Thomas W. Dinsmore 2016

Thomas W. Dinsmore, Disruptive Analytics, 10.1007/978-1-4842-1311-7_5

5. In-Memory Analytics

Satisfying the Need for Speed

Thomas W. Dinsmore

(1)Newton, Massachusetts, USA

“In-memory analytics” is a misnomer: all analytics run in memory and have always done so. Two things distinguish modern in-memory analytics:

  • The ability to persist large amounts of data in memory, so it is immediately available for analysis, without a disk read operation.

  • Scale-out capability, which is the ability to distribute large in-memory workloads over many servers.

The growth of modern in-memory analytics is partly attributable to technical innovations in database design. Database architects have developed ways to address the inherent volatility of in-memory data structures through replication, snapshotting, and other means to ensure fault tolerance and data durability.

Arguably, though, the principal force behind in-memory analytics is the declining cost of memory. While memory remains an order of magnitude more expensive than disk storage, costs have declined by more than four orders of magnitude since 1990, as shown in Figure 5-1.

A367683_1_En_5_Fig1_HTML.jpg
Figure 5-1. Historical cost of memory and storage (Source: McCallum and Blok, hblok.net/storage)

Since it is very unlikely that we need immediate access to all of our data all of the time, it makes good sense to design systems with more durable storage than memory.1 This is a key issue in analytics architecture: how to balance the tradeoff between the speed of in-memory operations with the additional cost.

In the first section of this chapter, we discuss the growth of in-memory technology for relational databases. In the second section, we discuss open source memory-centric processing frameworks, including Apache Spark, memory-based file systems, and memory caching frameworks.

In-Memory Databases

In its simplest form, an analytic operation has three parts:

  • Read data from durable storage.

  • Perform a computation in memory with the data.

  • Write results back to durable storage.

Computers perform analytic computations in random-access memory (RAM ). The actual technology used for memory has changed significantly since the earliest computers, from hard-wired circuitry to single in-line memory modules (SIMMs ), to the dual in-line memory modules (DIMMs) used today. But the principle is the same: all computations take place in some kind of memory.

DIMMs, like most previous memory technologies, are volatile: we lose the information in memory if the machine loses power. To avoid this loss, we save the product of the computation to persistent and durable storage that retains information without power. In the relational database era, that storage is usually a disk drive.

RAM operates much faster than the input/output (I/O) operations that move data back and forth from storage to memory. Thus the total time to complete the analytic operation depends mostly on the time needed to read and write. It takes a fraction of a second to move an item of data from disk to memory; for most conventional transaction processing workloads requiring single-record lookup, this data transfer does not create a performance bottleneck if the database is properly configured and tuned.

On the other hand, for the analysis of large data sets, the cumulative effect of this internal data movement across millions or billions of data items is significant. The problem is even more serious when the size of the data set needed for computations exceeds available memory. In some analysis software, the operation simply fails; in others, the software swaps data from memory to disk, seriously impairing performance.2

Columnar serialization, discussed in Chapter One, mitigates the problem by organizing the stored data in a manner that expedites its retrieval. However, it does not eliminate the I/O bottleneck, which remains measurable at terabyte and petabyte data volumes.

Caching objects in memory is one way to eliminate the bottleneck. If a task requires a series of computations on the same data, we can save time by keeping the data in memory (caching) until we reach the end of the series. At that point, we can either truncate (drop) the objects from memory to make room for the next task, or we can retain them just in case the user decides to run the task again. Either way, however, our ability to use caching depends on the amount of memory available.

While caching reduces the need for subsequent read operations, it does not eliminate the initial read. A more sophisticated form of caching is proactive or predictive: instead of waiting for a process to request data, it anticipates the request (based on historical usage patterns, for example) and loads the data into memory. This approach may not help the ad hoc user on every problem, but it’s helpful for the most frequently used.

Of course, if we maintain objects in a memory cache , we need to be concerned about keeping the data consistent with the disk datastore. For this, we need write-through, read-through, and write-behind capabilities. Write-through means the in-memory cache propagates in-memory updates to the disk datastore. Read-through means that if an operation requests data from the cache that it has not previously loaded, the cache retrieves the item from the disk datastore. Write-behind means the cache updates the disk datastore asynchronously; in other words, it completes the in-memory update for the requesting application and updates the disk datastore in the background.

A full in-memory database copies the entire database into memory, using write-through, read-through , and write-behind operations to maintain consistency with a mirrored database on disk. This mitigates the need for predictive caching, since all of the data resides in memory all of the time. Of course, as with all relational databases, the organization chooses what data to include in the database schema and what to exclude; thus, an in-memory database does not eliminate the need to organize data and set priorities.

Three key constraints limited the deployment of in-memory databases prior to 2010—cost, availability of memory, and durability.

As noted at the beginning of this chapter, the cost of memory has declined radically in absolute terms; between 2003 and 2014, the cost per terabyte of in-memory storage has declined by 95%. This opens up use cases for in-memory databases that were infeasible as recently as ten years ago.

The cost of disk storage has declined by an equal amount, and memory remains two orders of magnitude more expensive than disk storage. Thus, there is still an economic incentive to set priorities for data, maintaining only the most valuable data in memory.

Also, keep in mind that an in-memory database is actually two databases: the durable database on disk and its mirror in memory; the database architect does not choose between disk and memory, but balances the proportions of both. Combined with the extra cost of sophisticated tools to ensure durability, in-memory databases remain much more expensive per terabyte than any other mainstream data management technology.

A second constraint is the availability of memory in sufficient quantities to support large databases. In the 1990s, server vendors shipped machines with a few gigabytes of memory at most; they have increased the maximum supported memory to the terabyte range, but data volumes are expanding much faster than available memory.

To address this constraint, database architects distribute data over many servers. We discussed the general principles of distributed architecture in Chapter Four. Distributed databases date back to the 1970s, but have progressively become mainstream as engineers solve key architectural problems. Growing acceptance of Hadoop and distributed NoSQL databases reflect the increasing acceptance of scale-out architecture for large-scale data management.

The term ACID is an acronym for Atomicity, Consistency, Isolation, and Durability, a set of desirable properties for database transactions.

The third issue for in-memory databases is durability of the data: the “D” of ACID . Database designers use a number of techniques to ensure that organizations do not lose data when a database shuts down:

  • Snapshots or checkpoints, which record the state of the database at a moment in time

  • Transaction logging, which records changes to the database in a journal

  • Non-volatile RAM, memory that retains information when powered down and can securely reproduce the state of memory on shutdown

  • Replication of the data on clustered computers, with automatic failover in the case of node failure

Snapshots are only as good as the last snapshot, and any activity recorded since the last snapshot is lost. To avoid data loss, snapshots must be frequent. Growth of real-time database updates and high-velocity data renders this approach obsolete.

Transaction logging is a reasonable alternative to snapshots. However, in a database with a heavy workload, reconstructing the state of the database on shutdown can take a considerable amount of time.

The cost of non-volatile, or flash storage, has declined even more than volatile memory, to the point that it is now cheaper per terabyte than conventional (DIMM) memory. There are some disadvantages to flash memory for enterprise applications—it is less durable for frequent reads and writes—so its use is still largely experimental. In 2015, researchers at MIT demonstrated3 the promise of this technology with a cluster of flash-based servers, claiming performance and cost effectiveness comparable to RAM-based servers.

High availability through replication is an important byproduct of a distributed scale-out architecture. Hadoop’s original developers envisioned deployment on large clusters of commodity hardware and expected a high incidence of node failures. Consequently, they built redundancy into the Hadoop Distributed File System (HDFS ), with data distributed in three replicates across the cluster.

Most analytic databases work in read-only mode; in other words, users consume and analyze data but rarely create it or write back into the datastore. Hence, many data warehouses operate on a “backup is reload” principle: if the database goes down, it will be restored simply by repeating the load process that created it in the first place. Consequently, the “durability” issue that is so critical for transactional databases is less compelling for analytic datastores.

More memory capacity is now available to support in-memory databases. For scaling up on a single machine, server manufacturers now offer terabytes of memory capacity. For scaling out, distributing in-memory databases across multiple machines is now an option. In-memory databases are not yet able to support the largest databases, but enterprise software vendor SAP has demonstrated4 databases of 100TB on its HANA appliance .

The leading business analytics vendors have largely assimilated in-memory technology.

Oracle, the leading data warehouse platform revenue, positions Oracle Database system as a hybrid technology, using the term System Global Area to refer to the shared-memory realm. For additional license fees, Oracle customers can license Oracle Database In-Memory, a columnar in-memory datastore closely integrated with Oracle Database. When deployed on an Oracle appliance, Oracle mirrors the data on multiple nodes for fault-tolerance and durability. Oracle also markets Coherence, an in-memory data grid acquired in 2007.

Enterprise software vendor SAP has focused its data warehousing strategy around its HANA in-memory columnar database. SAP developed HANA by consolidating numerous acquired technologies, including the TREX columnar search engine, P*TIME in-memory OLTP platform, and MaxDB in-memory caching technology. SAP’s early adoption of in-memory technology, combined with its large base of existing customers, have contributed to its double-digit revenue growth in data warehousing.

Characteristically, IBM has spread its in-memory investments across many platforms and products. IBM BLU Acceleration is a bundle of technologies available for deployment with DB2, Informix, or as a service in the cloud (branded as DashDB). BLU includes a columnar in-memory datastore with data compression and hardware-based CPU acceleration.

Microsoft SQL Server 2014 included a capability to retain entire tables in memory. Branded alternatively as Hekaton and SQL Server In-Memory OLTP, it is designed for transaction processing workloads rather than analytical ones.

Teradata does not offer a complete in-memory database. Instead, it offers what it brands as Teradata Intelligent Memory, a predictive caching capability that tracks data usage and keeps the most heavily used data in memory.

Among startups leveraging in-memory database technology, the most promising build on an open core business model and target new markets. Two startups, MemSQL and VoltDB, stand out in the so-called NewSQL category targeting Hybrid Transactional/Analytical Processing, or HTAP use cases. Both companies offer an open source version of their in-memory database and stress integration with open source Hadoop, Spark, and Kafka. These companies are betting that the future of real-time analytics rests in the open source ecosystem.

MemSQL , founded in 2011and funded by leading venture capitalists, first released its product in June 2012. The company’s eponymous database software is a distributed in-memory platform that uses write-ahead logs and database snapshots to preserve data durability. For advanced analytics, the product supports geospatial functions and an interface to Apache Spark. In its 2015 analysis of in-memory database vendors, Forrester estimates5 that MemSQL had about 50 customers.

VoltDB supports a commercial version of H-Store, an academic project led by MIT’s Michael Stonebraker. Like MemSQL, VoltDB is a distributed in-memory database, with transaction logging and snapshots to ensure data durability. VoltDB 5.0, released in 2015, includes import and export integrations for Kafka and import integration with HP Vertica, as well as support for Apache Hive and Apache Pig.

Other open source in-memory databases include Aerospike, Apache Geode, Hazelcast, MonetDB, and Redis. Proprietary offerings include EXASolution from EXASOL and the Kognitio Analytical Platform from Kognitio.

Apache Spark

Apache Spark is an open source system for fast and general large-scale data processing. It provides a runtime environment for high-performance low-latency execution in several forms, including exploration, stream processing, ad hoc SQL, machine learning, and graph analytics. Spark users with a fault-tolerant and implicitly parallel interface to manipulate distributed data.

The foundation of Spark is an abstract data structure called Resilient Distributed Datasets , or RDDs . RDDs are read-only partitioned collections of records distributed over a cluster of machines. Spark creates RDDs through deterministic operations on stable data or other RDDs. RDDs include information about data lineage together with instructions for data transformation and persistence. They are fault tolerant; if an operation fails it can be reconstructed.

Spark users can either retain the RDD in memory or write the results to persistent storage. This contrasts sharply with MapReduce, which requires the user to write data to storage at the end of each Reduce operation. This persistence in memory makes it possible to write iterative algorithms, query data interactively, or perform streaming operations.

More than 1,000 developers have contributed to Spark (through April 2016). With more than 30,000 commits over the lifetime of the project and more than 10,000 in 2015 alone, Spark is the most active Big Data project today.

Matei Zaharia and colleagues at AMPLab, a collaborative at University of California, launched Spark in 2009 as a research project for machine learning with large data sets. In early 2010, they released the software to open source under a BSD license. The University of California donated the software assets to the Apache Software Foundation, which accepted the project for Incubator status in June 2013. In February 2014, Spark graduated to top-level Apache status .

Databricks, a commercial venture founded in 2013, leads development of Apache Spark. In two rounds of venture funding, Databricks has raised a total of $47 million through early 2016. In the most recent round, in June 2014, a group led by New Enterprise Associates, Andreesen Horowitz, and Data Collective invested $33 million in a Series B round6.

Spark follows the standard Apache governance model. A Project Management Committee (PMC) comprised of 35 committers (as of April 2016) oversees development of the project. Databricks employees hold 13 of the 35 seats; other entities represented include University of California, Berkeley; Cloudera; Yahoo; IBM; Intel; and eight other organizations. As of April 2016, there are 44 Spark committers, of whom 17 are Databricks employees, and 5 are affiliated with University of California, Berkeley.

Spark development follows the Apache voting process, where changes to the code are approved through consensus of committers. The team uses a review-then-commit model, where at least one committer other than the patch author reviews and approves the patch before it is merged, and any committer may vote against it.

The PMC has designated individual committers as maintainers for certain modules to ensure consistent design for public APIs and complex components. Maintainers for impacted modules review patches before code merge.

As is the case for all Apache projects, the reference code is available directly from the Apache Software Foundation. All major Hadoop distributors include Spark in their distributions: Cloudera, Hortonworks, MapR, IBM, and Amazon Web Services. In September 2015, Cloudera announced what it calls the One Platform Initiative, under which it plans to make Spark the primary computing platform in its Hadoop release, relegating MapReduce to secondary status .

Databricks has certified additional distributions from a number of vendors, including:

  • BlueData, a private cloud vendor

  • DataStax, distributor of the Cassandra NoSQL datastore

  • Guavus, an operational intelligence vendor

  • Huawei, a telecommunications solutions vendor

  • Lightbend, formerly Typesafe, developers and distributors of the Scala language

  • Oracle, for the Oracle Big Data Appliance

  • SAP, an enterprise software vendor

  • SequoiaDB, a NoSQL datastore distributor

  • Stratio, a Big Data platform vendor

  • Transwarp, a Shanghai-based Big Data vendor

Spark users can deploy the software on a single machine; in a free-standing cluster; in Hadoop, running under YARN; on Apache Mesos, a distributed resource manager; on cloud platforms, including Amazon Web Services, Microsoft Azure, Google Compute, and in OpenStack; and in Docker or Kubernetes containers. For users who prefer not to provision and install the software themselves, there are a number of providers offering Spark as a managed service, including Altiscale, Amazon Web Services, Databricks, Google, IBM, Microsoft, and Qubole .

Spark has no native file system. Instead, it includes adapters that enable it to work with many data platforms :

  • Hadoop Distributed File System (HDFS)

  • Cloud datastores, such as AWS S3 and Redshift

  • Relational databases, such as MySQL, PostgreSQL, and any JDBC-compliant RDBMS

  • Common Hadoop formats, such as ORC, Parquet, and Avro files

  • NoSQL datastores, such as Cassandra, Cloudant, Couchbase, HBase, MongoDB, and SequoiaDB

  • Streaming sources, such as Apache Kafka

  • In-memory file systems, such as Alluxio

  • Search engines, such as Elasticsearch

  • Connectors to applications, such as Salesforce.com and SAS

  • Mainframe data

  • ESRI Magellan geo-spatial libraries

  • Miscellaneous data sources, such as cookie data sets, Google spreadsheets, and many others

Capabilities

The Spark Core includes foundation capabilities , including task dispatching, scheduling, and basic input/output. Working through the Spark APIs, users invoke parallel operations on RDDs by passing functions to Spark, which schedules execution in parallel on the cluster.

Each operation takes one or more RDDs as input and produces new RDDs. Spark keeps track of the lineage of RDDs, so it can reconstruct any operation if one or more machines fail. Spark uses lazy evaluation, which means that it postpones evaluation of an expression until it is needed; this improves performance and minimizes memory usage, because it avoids needless calculations.

Users interact with Spark through programming interfaces for Scala, Java, Python, and R. The Spark functions available in each API are somewhat different; the Scala API is the most highly developed, while the R API is least developed.

In addition to the Spark core for distributed data processing, the Spark project includes four libraries:

  • Spark SQL, a set of tools for working with structured data

  • Spark Streaming, for streaming analytics

  • Spark MLLib, for machine learning

  • GraphX, for graph-parallel processing

The project also includes the Spark Packages library, which includes more than 200 additional packages7 contributed by third-party developers .

SQL Processing

Spark SQL is a component of Apache Spark that supports SQL-like processing of structured data. Respondents to a survey8 of Spark users by startup Databricks in 2015 report that they use Spark SQL and supporting DataFrames more than any other Spark component.

In 2011, developers at the University of California at Berkeley’s AMPLab began work on a SQL engine called Shark . The Shark project started with the Hive code base and replaced the execution engine with Spark’s in-memory processing. This approach yielded significant improvements to query runtime; however, the team found Hive’s large code base to be unwieldy and difficult to optimize.

The Spark development team introduced9 Spark SQL as an alpha component in May 2014. In July 2014, the development team announced10 that they would abandon further development of Shark and focus all future effort on Spark SQL, which graduated11 from Alpha in March 2015, with Spark Release 1.3.

Spark SQL enables users to combine the concise and declarative syntax of SQL with the power of procedural programming languages. It accomplishes this through two components: the DataFrame API, which supports relational (SQL) operations, and the Catalyst optimizer, an engine that converts SQL expressions to efficient Spark operations.

DataFrames are distributed collections of structured data with named columns; they are an abstraction for selecting, filtering, aggregating, and plotting structured data. Spark users manipulate DataFrames with Spark’s procedural API or with a relational (SQL) API .12

Spark users create DataFrames from existing Spark Resilient Distributed Datasets (RDDs), from Hive tables or directly from data sources. Spark supports native integration with Parquet data sets, JavaScript Object Notation (JSON) data sets, Hive tables, or relational databases through Java Database Connectivity (JDBC) .

Users interact with DataFrames directly with SQL, with a programming tool, with another Spark component, or with an external BI tool. For SQL users, Spark SQL currently supports HiveQL syntax, including UDFs and UDAFs. Other Spark components, such as the machine learning library, can create and use DataFrames. BI tools like Tableau, Zoomdata, and Qlik can interact with DataFrames through a standard JDBC connector.

The Catalyst optimizer, built into the Scala programming language, converts the logic of a SQL expression into an optimal physical plan for execution in Spark. Separating the logical and physical plans enables Spark’s developers and third parties to readily add new data sources as well as new language bindings. The Catalyst optimizer also ensures consistent performance across language APIs. The optimizer itself is easy to extend and enhance by adding new optimization rules and code-generation methods.

For streaming in SQL, developers at Intel contributed13 an open source library that works with Spark SQL. This library supports time-based windowing aggregation and joins. As of late 2015, it supports the Scala interface only.

Streaming Analytics

Spark Streaming is an extension of Spark, added in 2013, that supports fault-tolerant processing of live data streams with high throughput.

Spark Streaming ingests data from streaming sources, processes the data, and pushes it to target systems. Streaming sources can include Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, among others. Processing includes complex transformations expressed as high-level functions such as map, reduce, join, and window. Target systems can include file systems, databases, BI systems for live visualization, or other Spark libraries.

To process streaming data, Spark Streaming divides the data into microbatches, which it processes through the Spark engine. Users can define the duration of the batch window down to half a second. Spark Streaming provides high-level abstractions called DStreams or discretized streams , represented as a sequence of Resilient Distributed Datasets (RDDs) , which Spark holds in memory.

Examples of Spark Streaming’s transformations include:

  • Map: Returns a new DStream by passing each element of the source DStream through a specified function.

  • Filter: Returns a new DStream by selecting only those records of the source DStream for which a function evaluates as true.

  • Union: Returns a new DStream that contains the union of the elements in the source DStream and a second specified DStream.

  • Count: Returns a new DStream by counting the number of elements in each RDD of the source DStream.

  • Reduce: Returns a new DStream by aggregating the elements in each RDD of the source DStream using an associative function .

Transformations create new DStreams and RDDs without altering the source DStreams and RDDs. Thus, Spark can reconstruct the data with all changes in the event of a system or node failure.

Spark 2.0, released in spring 2016, embraces a new approach to streaming with Structured Streaming, which handles low latency, interactive and batch elements in a single API. Structured Streaming defines a stream as a high-level concept, similar to a table in SQL. The user simply specifies a stream as the target of an operation; behind the scenes, the Spark optimizer routes the query to the appropriate static data or data stream, as required. While a query against a finite table ends when all records are processed, a query against a stream runs until it is terminated by the user.

Spark Streaming, like other Spark libraries, supports Scala, Java, and Python APIs.

Machine Learning

Spark has two native machine learning libraries and a set of third-party libraries under Spark Packages. The two native libraries are:

  • MLlib, the original API built directly on Spark RDDs

  • ML, a higher-level API built on Spark DataFrames

There is some functional overlap between the two libraries. The ML API, first introduced in Spark 1.2, is easier to use and recommended for new users. While the Spark team plans to continue to support MLlib, all new algorithms will be contributed to the ML library.

We cover the details of Spark’s machine learning capabilities in Chapter Eight .

Graph Analytics

Spark includes the GraphX graph engine. GraphX supports widely used graph analytics such as Page Rank, Connected Components, and Triangle Counting.

Like Apache Giraph, GraphX is inspired by the Google Pregel paper and uses the Bulk Synchronous Processing Model for graph analytics. Unlike Giraph, which is implemented in Java on MapReduce, GraphX runs on Spark. The cadence of development on GraphX is slower than it is for other Spark libraries.

Spark in Action

To illustrate how organizations adopt and use Spark, we present 10 brief examples of Spark in action .

  • Barclays, a leading multinational banking and financial service company, uses Spark to support an “insights engine”: an application that combines hundreds of queries to compute Key Performance Indicators (KPIs) for a business banking client. With 1,296 queries against 700 million records for each of 275,000 clients, Barclays could not run this analysis in its Teradata data warehouse. In Spark, the analysis runs in 30 minutes.14

  • BlackRock, a leading investment and risk advisory firm, manages or advises asset and derivative portfolios valued at more than $8 trillion. Following the 2008 financial crisis, issuers disclose more detailed information about assets backed by mortgages, credit cards, and other types of consumer debt; however, this data tends to be very noisy and conventional data quality tools do not scale. Blackrock uses a Spark-based framework to create, run, and manage data quality tests on terabyte-scale data sets .15

  • Comcast collects significant amounts of data about its customers, including usage clickstreams and contact events such as telesales and e-mails. The volume, variety, and velocity of this data make conventional machine learning algorithms impractical. Comcast uses Spark to detect anomalies in customer activity that may indicate service interruptions.16

  • Goldman Sachs has invested its core competency in analytics to build powerful data computation frameworks into its intra-company platform. Today, Goldman actively uses Spark to build data pipelines, manipulate data for reports, and tap streaming data. Working with Spark and R, users manipulate and reduce data for local analysis, plotting, and reporting, while other users run Python-based simulations in Spark .17

  • MediaMath, a digital media buying service, uses Monte Carlo simulations built into Spark to test digital advertisement lift and effectiveness. Cookies are randomly assigned to test or control, and those in test are exposed to ads while those in control are not. Spark provides MediaMath with the processing power to quickly test millions of trials over tens of millions of simulated consumers.18

  • Netflix used Apache Spark to build a customer simulator enabling it to “back-test” strategies and changes to its recommendation engine. The simulator allows the data science team to evaluate the impact of improvements to the algorithm, or simply to try new ideas. By using the simulator, testing places no stress on production services, and the data science team does not have to wait for history to accumulate.19

  • NBC Universal stores hundreds of terabytes of media files for international cable TV distribution; efficient management of this online resource is necessary to support distribution to international clients. The company uses Spark’s machine learning library to predict future demand for each item based on a combination of measures. Based on these predictions, the company moves media with low predicted demand to low-cost offline storage. The predictions from machine learning are far more effective than arbitrary rules based on single measures, such as file age. As a result, NBC Universal reduces its overall storage costs while maintaining client satisfaction .20

  • Novartis uses Spark to analyze data captured in assays, or laboratory experiments conducted to test hypotheses about the biology of a disease. Due to advances in biotechnology, a single screening assay used today can produce trillions of data points. Using Spark together with a visualization tool and Cassandra NoSQL datastore, Novartis performs complex analytics: normalization against controls, reduction of highly correlated features, multi-parametric classification, and more. The end result: faster analysis, shorter experimental cycles, and reduced time to discovery.21

  • Viacom, a global media company, found that the traditional data warehousing approach was too slow for its business. Traditional data warehousing calls for developers to structure and model data before it is released to end users. Instead, Viacom now uses Spark and Databricks running on Amazon Web Services to deliver “just-in-time” data warehouses. In this approach, end users define structure in the context of problems they needs to solve; data quality issues are resolved as they surface; developers and end users collaborate to answer business questions interactively. The end result: faster time to value and increased engagement of business users with the data .22

  • The Weather Company (TWC) aggregates weather information from government agencies and distributes it to end users. Every day, it handles about 30 billion API requests from 120 million active mobile users, who generate 360 petabytes of traffic. TWC uses Spark Streaming, Cassandra, Parquet, and Spark SQL, all operating in the AWS cloud, to provide executives and managers with a real-time self-service platform for business intelligence and data science.23

Apache Arrow

In Chapter Four, we discussed the Apache Drill project, a schema-free SQL query engine built on the Google Dremel framework. The Apache Software Foundation accepted Drill as an Incubator project in September 2012.

While working toward a first release in September 2013, Drill’s developers identified a need to represent complex columnar data in memory. There were existing methods to represent structured data with a predefined schema in memory, and the Apache Parquet project had developed a way to represent complex columnar data on disk, but neither format met the needs of Drill’s developers. To address this gap, the Drill team developed a data structure called Value Vectors.

Recognizing that Value Vectors meet the needs of other data processing engines, in February 2016, the Apache Software Foundation announced Apache Arrow as a top-level project, bypassing the standard Incubator process. Committers to the project include developers from other Apache projects such as Calcite, Cassandra, Drill, Hadoop, HBase, Ibis, Impala, Kudu, Pandas, Parquet, Phoenix, Spark, and Storm.

Apache Arrow enables execution engines like Spark to take advantage of the latest operations included in modern processors, for fast analytical data processing. Columnar layout of data also allows for a better use of CPU caches by placing all data relevant to a column operation in as compact of a format as possible.

A standard format allows applications to share data seamlessly. At present, each execution engine represents data in memory in its own way, so any handoff from one engine to another requires a time-consuming conversion. Standardizing the way that engines represent data in memory simplifies integration and speeds processing.

Apache Arrow software is available under the Apache License v2.0.

Dremio, a startup led by Jacques Nadeau, chair of the Apache Drill and Apache Arrow Project Management Committees, leads development. In September 2015, Dremio announced24 an initial funding round of $10 million .

Alluxio

Alluxio (previously named Tachyon) is a distributed storage system that supports fault-tolerant data sharing at the speed of memory across cluster jobs. The Alluxio software resides between computation frameworks (including Spark, MapReduce, Flink, Zeppelin, HBase, and Presto) and storage systems (such as HDFS, Amazon S3, GlusterFS, OpenStack Swift, and NFS). Existing MapReduce and Spark programs run much faster on top of Alluxio without code changes.

A series of papers published by Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica of University of California’s AMPLab define the theoretical framework for Alluxio.25 Haoyuan Li built the first version during Christmas of 2012 and released the code to open source in April 2013. Since then, Alluxio has attracted more than 200 contributors from over 50 companies and almost 200,000 lines of code, mostly written in Java.

Alluxio reports production deployments with hundreds of machines. Examples of Alluxio in action include:

  • Baidu, the largest Chinese language search engine, runs Spark SQL queries 30 times faster with Alluxio.26

  • Barclays found that Alluxio reduced runtime for a complex Spark workflow from hours to seconds.27

In March, 2015, Li formed Alluxio Inc. (then named Tachyon Nexus) to commercialize the software and announced an initial funding round of $7.5 million from Andreessen Horowitz .28

Apache Ignite

Apache Ignite is an open source project based on a code base donated to the Apache Software Foundation by GridGain Systems in October 2014. GridGain remains engaged in the project, with company executives holding the PMC Chair and multiple seats on the PMC. Operating on an open core business model, GridGain continues to offer commercially licensed value-added versions of the software.

Ignite combines a fault-tolerant ACID-compliant in-memory key-value store with tools for managing data. An embedded SQL engine supports ANSI-99 syntax. The project also includes:

  • A fault-tolerant framework for implicitly parallel program execution.

  • Scalable and fault-tolerant processing for continuous never-ending streams of data.

  • High-performance cluster-wide messaging to exchange data among nodes.

Ignite can serve as a database cache, enabling users to keep the most frequently accessed data in memory. For this purpose, it offers write-through, read-through, and write-behind capability.

The Ignite File System (IGFS) is a distributed in-memory file system that works like HDFS, but in memory. IGFS splits data from each file into separate data blocks and stores them in a distributed in-memory cache, using a hashing function to determine file location. IGFS can be deployed by itself, or on top of HDFS, in which case it becomes a caching layer for HDFS (similar to Alluxio), with write-through and read-through capabilities. Apache Ignite includes an in-memory implementation of MapReduce.

Ignite provides an implementation of Spark RDDs which enable Spark jobs to share objects in memory, within the same Spark application or between different Spark applications. Running Spark SQL queries using the Ignite RDDs is faster than running them directly with the data, primarily because Ignite indexes the data in memory .

Apache Ignite runs standalone, in a cluster, within Docker containers, and Apache Mesos and under YARN. It has native integration with Amazon Web Services cloud and the Google Compute Engine.

GridGain Systems offers Professional and Enterprise Editions of Apache Ignite, for which it provides technical support. The Professional Edition is a binary build of the software, and it includes bug fixes not yet released in the open source version. The Enterprise Edition includes a capability to port in-memory objects across platforms, a tool to manage and monitor the environment, network segmentation, a recoverable local store, rolling production updates, and data center replication.

The New In-Memory Analytics

Consistent with the theory of disruptive innovation, the industry leaders in data warehousing have largely assimilated in-memory technology and incorporated it into their own offerings. As the cost of memory has declined absolutely, in-memory databases in various forms have become mainstream, embedded in the offerings of Oracle, SAP, IBM, Microsoft, and Teradata.

Startups featuring in-memory databases licensed under a commercial model, including EXASOL and Kognitio , have not disrupted the market. (Both vendors have been in business for at least 15 years.) On the other hand, NewSQ L vendors MemSQL and VoltDB, both of which operate under an open core business model, have succeeded in tapping new Hybrid Transactional/Analytical Processing (HTAP) use cases and demonstrate commensurate growth.

The greatest potential for market disruption comes from Apache Spark and related open source projects, which bring the power and speed of in-memory analytics to Hadoop and NoSQL. This disruptive power is not lost on the open source community, which has responded by making Spark the most active project in Big Data today.

Footnotes

1 Solid state devices and flash memory present a new opportunity for systems architecture. As of 2016, use of SSD and flash in analytic datastores is an emerging technology.

7 234 packages as of mid-June 2016.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset