Introducing Spark andKafka | 119
focus, specically on batch processing. This over-specication led to an explosion of specialized
libraries, each attempting to solve a different problem. So, if you want to process streaming
data at scale, then you would have to use another complimentary library called Storm. Apache
Storm is a free and open source, scalable, fault-tolerant, distributed real-time computation sys-
tem. Storm makes it easy to reliably process unbounded streams of data, doing for real-time
processing what Hadoop does for batch processing. Again, you may nd it easier to query your
data using something like Hive.
So, along came Spark’s generalized abstractions for big data computing, bringing the big data
pipeline into one cohesive unit just as the smartphone in our analogy. Spark’s aim is to be a
unified platform for big data, an SDK for the many different means of processing, and it all orig-
inates from the core library. This means that, any of its extension libraries automatically gains
from improvements to the core, such as performance boosts. Due to the core being so general-
ized, extending it is fairly simple and straightforward. If you want to query your data, just use
Spark’s SQL library. If you want to stream, there’s also a streaming library to help you out. Even
machine learning is made more straightforward with MLlib, and GraphX does the same for graph
computation.
On top of the ability to have shared knowledge across libraries, this common base also means
that the libraries do not need to be built from the bottom up, which results in a minimal code
footprint for each library. It lessens the possible bugs and any other liabilities that code typically
brings. Even the core library, which all the other libraries are built on top of, is fairly small.
6.1.2 Spark Programming Languages
There is more than one language you can possibly use for writing a Spark application. Spark itself
is written in Scala. The other most obvious choice is Java, as any Scala library, such as Spark,
isalso compatible with Java, albeit through a more verbose syntax. Even such verbosity is toned
down due to an effort by the Spark developers to keep clean APIs, resulting in an uncluttered Java
API, which even supports Java 8. Spark also supports Python.
6.1.3 Understanding Spark Architecture
The Spark documentation denes Resilient Distributed Dataset or RDD as the collection of
elements partitioned across the nodes of the cluster that can be operated on, in parallel. From a
user perspective, an RDD can be thought of as a collection, similar to a list or an array.
RDDs are collection of records which are immutable, partitioned, fault tolerant, may be
created by coarse grained operations, lazily evaluated and can be persisted. We will examine
these characteristics in more detail shortly.
Behind the scenes, the work is distributed across a cluster of machines so that the computa-
tions against the RDD can be run in parallel, reducing the processing time by orders of magni-
tude. This distribution of work and data also means that even if one point fails, the rest of the
system continues processing, while the failure can be restarted immediately elsewhere.
Failures across large clusters are inevitable, but the RDD’s design was built with resiliency in
mind. The design that makes fault-tolerance easy, is due to the fact that most functions in Spark
are lazy. Instead of immediately executing a function’s instructions, the instructions are stored for
later use in what is referred to as a DAG or Directed Acyclic Graph.
M06 Big Data Simplified XXXX 01.indd 119 5/17/2019 2:49:07 PM