6.1.2 Spark Programming Languages

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Introducing Spark andKafka | 119

focus, specically on batch processing. This over-specication led to an explosion of specialized

libraries, each attempting to solve a different problem. So, if you want to process streaming

data at scale, then you would have to use another complimentary library called Storm. Apache

Storm is a free and open source, scalable, fault-tolerant, distributed real-time computation sys-

tem. Storm makes it easy to reliably process unbounded streams of data, doing for real-time

processing what Hadoop does for batch processing. Again, you may nd it easier to query your

data using something like Hive.

So, along came Spark’s generalized abstractions for big data computing, bringing the big data

pipeline into one cohesive unit just as the smartphone in our analogy. Spark’s aim is to be a

unified platform for big data, an SDK for the many different means of processing, and it all orig-

inates from the core library. This means that, any of its extension libraries automatically gains

from improvements to the core, such as performance boosts. Due to the core being so general-

ized, extending it is fairly simple and straightforward. If you want to query your data, just use

Spark’s SQL library. If you want to stream, there’s also a streaming library to help you out. Even

machine learning is made more straightforward with MLlib, and GraphX does the same for graph

computation.

On top of the ability to have shared knowledge across libraries, this common base also means

that the libraries do not need to be built from the bottom up, which results in a minimal code

footprint for each library. It lessens the possible bugs and any other liabilities that code typically

brings. Even the core library, which all the other libraries are built on top of, is fairly small.

6.1.2 Spark Programming Languages

There is more than one language you can possibly use for writing a Spark application. Spark itself

is written in Scala. The other most obvious choice is Java, as any Scala library, such as Spark,

isalso compatible with Java, albeit through a more verbose syntax. Even such verbosity is toned

down due to an effort by the Spark developers to keep clean APIs, resulting in an uncluttered Java

API, which even supports Java 8. Spark also supports Python.

6.1.3 Understanding Spark Architecture

The Spark documentation denes Resilient Distributed Dataset or RDD as the collection of

elements partitioned across the nodes of the cluster that can be operated on, in parallel. From a

user perspective, an RDD can be thought of as a collection, similar to a list or an array.

RDDs are collection of records which are immutable, partitioned, fault tolerant, may be

created by coarse grained operations, lazily evaluated and can be persisted. We will examine

these characteristics in more detail shortly.

Behind the scenes, the work is distributed across a cluster of machines so that the computa-

tions against the RDD can be run in parallel, reducing the processing time by orders of magni-

tude. This distribution of work and data also means that even if one point fails, the rest of the

system continues processing, while the failure can be restarted immediately elsewhere.

Failures across large clusters are inevitable, but the RDD’s design was built with resiliency in

mind. The design that makes fault-tolerance easy, is due to the fact that most functions in Spark

are lazy. Instead of immediately executing a function’s instructions, the instructions are stored for

later use in what is referred to as a DAG or Directed Acyclic Graph.

M06 Big Data Simplified XXXX 01.indd 119 5/17/2019 2:49:07 PM

120 | Big Data Simplied

This graph of instructions continues to grow through a series of calls to Transformations.

Transformation in an RDD creates a new RDD by performing computations on the source RDD.

Let us look at some typical examples of Transformations on RDDs with corresponding outputs.

Consider the RDDs rdd = {1, 2, 3, 3) and rdd1 = {3, 4, 5}

Now look at the following Transformations.

Transformation Syntax Output

Filter

valﬁlrdd = rdd.ﬁlter(x=> x!=1)

{2,3,3}

Distinct

valdisrdd = rdd.distinct()

{1,2,3}

Union

valunrdd = rdd.union(rdd1)

{1,2,3,3,3,4,5}

Map

valmaprdd = rdd.map(x=> x+1)

{2,3,4,4}

Intersection

valinrdd = rdd.intersection(rdd1)

{3}

Thus, the DAG ends up being a build-up of the functional lineage that will be sent to the workers,

which will actually use the instructions to compute the nal output for your Spark application.

So, this laziness is great, but work has to be done at some point. The set of methods that trig-

ger computations are called Actions. Action forces the actual evaluation of the Transformation.

Actions trigger the DAG execution and result in action against the data, whether returning to

the Driver program, or saving to some persistent storage system, as the case may be.

Let us look at some examples of Actions as stated below:

• reduce(func): Aggregate the elements of the dataset using a function func (which takes two

arguments and returns one). Obviously, the function should be commutative and associative

so that it can be computed correctly in parallel.

• collect(): Return all the elements of the dataset as an array to the Driver program. This is

usually useful after a Filter or other Transformation that returns a sufficiently small subset of

the data.

• count(): Returns the number of elements in the dataset.

Refer to Figure 6.1. Here, RDDs are created from a log le and then Transformations are designed

to lter out all entries with the word ‘error’ creating a lineage of RDDs. Here, the count() action

returns the number of elements in the dataset with the word ‘error’.

Now, it should be noted that while an RDD is immutable, meaning that once you create

it you can no longer mutate it, the same does not apply to the data driving it. Each Action

will trigger a fresh execution of the DAG. If the underlying data changes, so will the final

results. By default, each transformed RDD will be recomputed each time you run an Action

on it. However, you may also persist an RDD in memory, in which case Spark will keep the

elements around on the cluster for much faster access. If different queries are run on the same

set of data repeatedly, then the particular dataset can be kept in memory for better execution

times. There is also support for persisting RDDs on disk or replicated across multiple nodes

in the cluster.

Refer Figure 6.2 and let us now examine how a Spark job gets executed.

M06 Big Data Simplified XXXX 01.indd 120 5/17/2019 2:49:07 PM

Introducing Spark andKafka | 121

When a client submits a Spark application code, the Driver converts the code containing

Transformations and Actions into the DAG. It then converts the logical DAG into a physical exe-

cution plan with set of stages. After creating the physical execution plan, it creates small physical

execution units referred to as Tasks under each stage. The tasks are then bundled to be sent to the

Spark cluster. The Driver program then talks to the Cluster Manager and negotiates for resources.

The Cluster Manager then launches Executors on the Worker nodes. At this point, the Driver

sends Tasks to the Cluster Manager based on data distribution. Before Executors begin execution,

they register themselves with the Driver program so that the Driver has a holistic view of all the

Executors. Now the Executors start executing various Tasks assigned by the Driver program.

FIGURE 6.1 Transformations and action

Transformations

RDD

Action

Value

ﬁlterError.cont()

ﬁlterError = textFile.ﬁlter(f => f.contains(“error”))

textFile = sc.textFile(“/path/to/input_log_ﬁle.txt”)

FIGURE 6.2 Execution of a spark job

Worker node

Executor

Cache

Task

Driver program

SparkContextCluster manager

Task

Worker node

Executor

Cache

Task Task

M06 Big Data Simplified XXXX 01.indd 121 5/17/2019 2:49:07 PM

122 | Big Data Simplied

Atany point of time when the Spark application is running, the Driver program monitors the set

of Executors that are running the Tasks. The Driver program also schedules future Tasks based on

data placement by tracking the location of cached data. When the main() method of the Driver

program exits or if it calls the stop () method of the Spark Context, all the Executors are termi-

nated and resources are released from the Cluster Manager.

Once all the nodes have completed their tasks, then the next stage of the DAG can be triggered,

repeating until the entire graph has been completed, and, if before the application can complete,

a chunk of data is lost, then the DAG scheduler can find a new node and restart the transforma-

tion from an appropriate point, returning to synchronization with the rest of the nodes.

RDD Creation in Spark: Resilient Distributed Datasets (RDD) is the fundamental data structure

of Spark. RDDs are immutable and fault tolerant in nature. These are distributed collections of

objects. The datasets are divided into a logical partition, which is further computed on different

nodes over the cluster. Thus, RDD is just the way of representing dataset distributed across mul-

tiple machines, which can be operated around in parallel. RDDs are called resilient because they

have the ability to always recompute an RDD. Spark RDDs in depth here.

There are three ways to create an RDD in Spark.

• Parallelizing the already existing collection in driver program.

• Referencing a dataset in an external storage system (For example, HDFS, Hbase, shared file

system).

• Creating RDD from already existing RDDs.

Parallelized Collection (Parallelizing): In the initial stage when we learn Spark, RDDs are generally cre-

ated by parallelized collection, i.e., by taking an existing collection in the program and passing

it to SparkContext’s parallelize() method. This method is used in the initial stage of learning

Spark since it quickly creates our own RDDs in Spark shell and performs operations on them.

The elements of the collection are copied to form a distributed data set that can be operated on

in parallel.

Launch Spark in YARN Mode:

spark-shell –master yarn –deploy-mode client

M06 Big Data Simplified XXXX 01.indd 122 5/17/2019 2:49:08 PM

Introducing Spark andKafka | 123

Here, in the above example, ‘studata’ is simply a variable and sparkContext is the object pro-

vided by Spark itself. The function sortByKey() is called over the collection RDD ‘studata’ and

the list is sorted over the subject name, i.e., the Key here.

RDD from External Datasets (Referencing a Dataset): Using Text le:

Using JSON le:

RDD from Existing RDD: Transformation mutates one RDD into another RDD and thus, transforma-

tion is the way to create an RDD from already existing RDD. This creates the difference between

Apache Spark and Hadoop MapReduce. Transformation acts as a function that intakes an RDD

and produces one. The input RDD does not get changed because RDDs are immutable in nature,

but it produces one or more RDD by applying operations.

M06 Big Data Simplified XXXX 01.indd 123 5/17/2019 2:49:09 PM

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 6.1.2 Spark Programming Languages

Create new playlist

Sign In

Sign Up

Table of Contents for
6.1.2 Spark Programming Languages