Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 6. Scaling Up

In this chapter, we will cover the following recipes:

Building the Uber JAR
Submitting jobs to the Spark cluster (local)
Running the Spark standalone cluster on EC2
Running the Spark job on Mesos (local)
Running the Spark job on YARN (local)

Introduction

In this chapter, we'll be looking at how to bundle our Spark application and deploy it on various distributed environments.

As we discussed earlier in Chapter 3, Loading and Preparing Data – DataFrame the foundation of Spark is the RDD. From a programmer's perspective, the composability of RDDs such as a regular Scala collection is a huge advantage. RDD wraps three vital (and two subsidiary) pieces of information that help in reconstruction of data. This enables fault tolerance. The other major advantage is that while the processing of RDDs could be composed into hugely complex graphs using RDD operations, the entire flow of data itself is not very difficult to reason with.

Other than optional optimization attributes, such as data location, an RDD at its core wraps only three vital pieces of information:

The dependent/parent RDD (empty if not available)
The number of partitions
The function that needs to be applied to each element of the RDD

Spark spawns one task per partition. So, a partition is the basic unit of parallelism in Spark.

The number of partitions could be any of these:

Dictated by the number of blocks in the case of reading files
A number set by the spark.default.parallelism parameter (set while starting the cluster)
A number set by calling repartition or coalesce on the RDD

So far, we have just run our Spark application in the self-contained single JVM mode. While the programs work just fine, we have not yet exploited the distributed nature of the RDDs.

Note

As always, all the code snippets for this chapter can be downloaded from https://github.com/arunma/ScalaDataAnalysisCookbook/tree/master/chapter6-scalingup.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 6. Scaling Up

Create new playlist

Sign In

Sign Up

Chapter 6. Scaling Up

Introduction

Note

Table of Contents for
6. Scaling Up