Chapter 5. Spark

Background

Apache Spark is an open source cluster computing framework originally developed at UC Berkeley in the AMPLab. Spark is a fast and flexible alternative to both stream and batch processing systems like Storm and MapReduce, and can be integrated as a part of batch processing, stream processing, machine learning, and more. A recent survey of 2,100 developers revealed that 82% would choose Spark to replace MapReduce.

Characteristics of Spark 

Spark is a versatile distributed data processing engine, providing a rich language for data scientists to explore data. It comes with an ever-growing suite of libraries for analytics and stream processing.

Spark Core consists of a programming interface and a distributed execution environment. On top of this core platform, the Spark developer community has built several libraries including Spark Streaming, MLlib (for machine learning), Spark SQL, and GraphX (for graph analytics) (Figure 5-1). As of version 1.3, Spark SQL was repackaged as the DataFrame API. Beyond acting as a SQL server, the DataFrame API is meant to provide a general purpose library for manipulating structured data.

Figure 5-1. Spark data processing framework

The Spark execution engine keeps data in memory and has the ability to schedule jobs distributed over many nodes. Integrating Spark with other in-memory systems, like an in-memory database, facilitates efficient and quick operations.

By design, Spark is stateless—there is no persistent data storage. As such, Spark relies on other systems for serving, storing, and tracking changes to data. Spark can be used with a variety of external storage options including, most commonly, databases and filesystems. Different external data stores suit different use cases.

Understanding Databases and Spark 

A common point of confusion is the relationship between Spark and databases. While there is some overlapping functionality, there are fundamental differences in design and functionality that distinguish the two. The most significant difference has already been mentioned: Spark is not a persistent data store.

Table 5-1 illustrates the similarities and differences between Spark and a relational database.

Table 5-1. Comparison between Spark and a relational database
  Relational database Spark
Programming language SQL Scala and libraries
Execution environment SQL engine, query optimizer Distributed job scheduler
Persistent data storage Yes Relies on external databases and/or file systems
Data mutability Transactional INSERT, UPDATE, DELETE Datasets are immutable

Augmenting Spark with a real-time operational database opens a wide array of new use cases. With this setup, Spark can access live production data, and result sets from Spark can immediately be put to use in the database to support mission-critical applications. Pairing Spark with a real-time database enables companies to go from a static view to a dynamic view of operational metrics.

Spark’s distributed, in-memory execution environment is one of its core innovations. In-memory data processing eliminates the disk I/O bottleneck, and the distributed architecture reduces CPU contention by enabling parallelized execution. Using Spark with a disk-optimized or single server database offsets the benefits of the Spark architecture (Figure 5-2).

Figure 5-2. High throughput connectivity between an in-memory database and Spark

Other Use Cases

There are additional use cases for Spark beyond real-time streaming, for example, advanced analytics of operational data. Data scientists are often hindered by a lengthy and complex ETL process that limits instant access to fresh data. When Spark is connected to an operational database, fresh data can be loaded in Spark for analysis, then a simple write returns the results to the database, providing immediate query access to valuable real-time data.

Combining Spark with an operational database also enables businesses to go to production and iterate faster than ever by taking the results produced in Spark and putting them to immediate use.

Conclusion

Spark is an exciting technology that is changing the way businesses process and analyze data. More broadly, it reflects the trend toward scale-out, memory-optimized data processing systems. With use cases ranging from stream processing to machine learning, Spark also exemplifies the benefits of versatile, multipurpose infrastructure.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset