Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Introducing Apache Spark

Hadoop and MR have been around for 10 years and have proven to be the best solution to process massive data with high performance. However, MR lacked performance in iterative computing where the output between multiple MR jobs had to be written to HDFS. In a single MR job, it lacked performance because of the drawbacks of the MR framework.

Let's take a look at the history of computing trends to understand how computing paradigms have changed over the last two decades.

The trend has been to Reference the URI when the network was cheaper (in 1990), Replicate when storage became cheaper (in 2000), and Recompute when memory became cheaper (in 2010), as shown in Figure 2.5:

Figure 2.5: Trends of computing

Note

So, what really changed over a period of time?

Tape is dead, disk has become tape, and SSD has almost become the disk. Now, caching data in RAM is the current trend.

Let's understand why memory-based computing is important and how it provides significant performance benefits.

Figure 2.6 indicates the data transfer rates from various mediums to the CPU. Disk to CPU is 100 MB/s, SSD to CPU is 600 MB/s, and over the network to CPU is 1 MB to 1 GB/s. However, RAM to CPU transfer speed is astonishingly fast, which is 10 GB/s. So, the idea is to cache all or partial data in-memory so that higher performance can be achieved:

Figure 2.6: Why memory?

Spark history

Spark started in 2009 as a research project in the UC Berkeley RAD Lab, which later became the AMPLab. The researchers in the lab had previously been working on Hadoop MapReduce and observed that MR was inefficient for iterative and interactive computing jobs. Thus, from the beginning, Spark was designed to be fast for interactive queries and iterative algorithms, bringing in ideas such as support for in-memory storage and efficient fault recovery.

In 2011, the AMPLab started to develop higher-level components on Spark such as Shark and Spark Streaming. These components are sometimes referred to as the Berkeley Data Analytics Stack (BDAS).

Spark was first open sourced in March 2010 and transferred to the Apache Software Foundation in June 2013.

In February 2014, it became a top-level project at the Apache Software Foundation. Spark has since become one of the largest open source communities in Big Data. Now, over 250 contributors in over 50 organizations are contributing to Spark development. The user base has increased tremendously from small companies to Fortune 500 companies. Figure 2.7 shows you the history of Apache Spark:

Figure 2.7: The history of Apache Spark

What is Apache Spark?

Let's understand what Apache Spark is and what makes it a force to reckon with in Big Data analytics:

Apache Spark is a fast enterprise-grade large-scale data processing engine, which is interoperable with Apache Hadoop.
It is written in Scala, which is both an object-oriented and functional programming language that runs in a JVM.
Spark enables applications to distribute data reliably in-memory during processing. This is the key to Spark's performance as it allows applications to avoid expensive disk access and performs computations at memory speeds.
It is suitable for iterative algorithms by having every iteration access data through memory.
Spark programs perform 100 times faster than MR in-memory or 10 times faster on disk (http://spark.apache.org/).
It provides native support for Java, Scala, Python, and R languages with interactive shells for Scala, Python, and R. Applications can be developed easily, and often 2 to 10 times less code is needed.
Spark powers a stack of libraries, including Spark SQL and DataFrames for interactive analytics, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for real-time analytics. You can combine these features seamlessly in the same application.
Spark runs on Hadoop, Mesos, standalone cluster managers, on-premise hardware, or in the cloud.

What Apache Spark is not

Hadoop provides HDFS for storage and MR for compute. However, Spark does not provide any specific storage medium. Spark is mainly a compute engine, but you can store data in-memory or on Tachyon to process it.

Spark has the ability to create distributed datasets from any file stored in the HDFS or other storage systems supported by Hadoop APIs (including your local filesystem, Amazon S3, Cassandra, Hive, HBase, Elasticsearch, and others).

It's important to note that Spark is not Hadoop and does not require Hadoop to run it. It simply has support for storage systems implementing Hadoop APIs. Spark supports text files, sequence files, Avro, Parquet, and any other Hadoop InputFormat.

Note

Does Spark replace Hadoop?

Spark is designed to interoperate with Hadoop. It's not a replacement for Hadoop, but it's a replacement for the MR framework on Hadoop. All Hadoop processing frameworks (Sqoop, Hive, Pig, Mahout, Cascading, and Crunch) using MR as an engine now use Spark as an additional processing engine.

MapReduce issues

MR developers faced challenges with respect to performance and converting every business problem to an MR problem. Let's understand the issues related to MR and how they are addressed in Apache Spark:

MR creates separate JVMs for every Mapper and Reducer. Launching JVMs takes a considerable amount of time.
MR code requires a significant amount of boilerplate coding. The programmer needs to think and design every business problem in terms of Map and Reduce, which makes it a very difficult program. One MR job can rarely do a full computation. You need multiple MR jobs to finish the complete task, and need to design and keep track of optimizations at all levels. Hive and Pig solve this problem. However, they are not suitable for all use cases.
An MR job writes the data to disk between each job and hence is not suitable for iterative processing.
A higher level of abstraction, such as Cascading and Scalding, provides better programming of MR jobs. However, it does not provide any additional performance benefits.
MR does not provide great APIs either.

MR is slow because every job in an MR job flow stores the data on disk. Multiple queries on the same dataset will read data separately and create high disk I/O, as shown in Figure 2.8:

Figure 2.8: MapReduce versus Apache Spark

Spark takes the concept of MR to the next level to store intermediate data in-memory and reuses it, as needed, multiple times. This provides high performance at memory speeds, as shown in Figure 2.8.

Note

If I have only one MR job, does it perform the same as Spark?

No, the performance of the Spark job is superior to the MR job because of in-memory computations and its shuffle improvements. The performance of Spark is superior to MR even when the memory cache is disabled. A new shuffle implementation (sort-based shuffle instead of hash-based shuffle), new network module (based on netty instead of using block manager to send shuffle data), and new external shuffle service make Spark perform the fastest petabyte sort (on 190 nodes with 46 TB RAM) and terabyte sort. Spark sorted 100 TB of data using 206 EC2 i2.8xlarge machines in 23 minutes. The previous world record was 72 minutes, set by a Hadoop MR cluster of 2,100 nodes. This means that Spark sorted the same data 3 times faster using 10 times fewer machines. All the sorting took place on disk (HDFS) without using Spark's in-memory cache (https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html).

To summarize, here are the differences between MR and Spark:

	MR	Spark
Ease of use	It is not easy to code and use	Spark provides a good API, which is easy to code and use
Performance	Performance is relatively poor when compared with Spark	In-memory performance
Iterative processing	Every MR job writes the data to disk and the next iteration reads from the disk	Spark caches the data in-memory
Fault tolerance	It's achieved by replicating the data in HDFS	Spark achieves fault tolerance by Resilient Distributed Dataset (RDD) Lineage, which is explained in Chapter 3, Deep Dive into Apache Spark
Runtime architecture	Every Mapper and Reducer runs in a separate JVM	Tasks are run in a preallocated executor JVM, which is explained in Chapter 3, Deep Dive into Apache Spark
Shuffle	Stores data on disk	It stores data in-memory and on disk
Operations	Map and Reduce	Map, Reduce, Join, Cogroup, and many more
Execution model	Batch only	Batch, Interactive, and Streaming
Natively supported programming languages	Java only	Java, Scala, Python, and R

Spark's stack

Spark's stack components are Spark Core, Spark SQL, Datasets and DataFrames, Spark Streaming, Structured Streaming, MLlib, GraphX, and SparkR as shown in Figure 2.9:

Figure 2.9: The Apache Spark ecosystem

Here is a comparison of Spark components with Hadoop Ecosystem components:

Spark	Hadoop Ecosystem
Spark Core	MapReduce Apache Tez
Spark SQL, Datasets and DataFrames	Apache Hive Apache Impala Apache Tez Apache Drill
Spark Streaming Structured Streaming	Apache Storm Apache Storm Trident Apache Flink Apache Apex Apache Samza
Spark MLlib	Apache Mahout
Spark GraphX	Apache Giraph
SparkR	RMR2 RHive

To understand the Spark framework at a higher level, let's take a look at these core components of Spark and their integrations:

Feature	Details
Programming languages	Java, Scala, Python, and R. Scala, Python, and R shell for quick development.
Core execution engine	Spark Core: Spark Core is the underlying general execution engine for the Spark platform that all other functionality is built on top of. It provides Java, Scala, Python, and R APIs for the ease of development. Tungsten: It provides Memory Management and Binary Processing, Cache-aware Computation, and Code generation.
Frameworks	Spark SQL, Datasets, and DataFrames: Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called Datasets and DataFrames and can also act as a distributed SQL query engine. Spark Streaming: Spark Streaming enables us to build scalable and fault-tolerant streaming applications. It integrates with a wide variety of data sources, including File Systems, HDFS, Flume, Kafka, and Twitter. Structured Streaming: Structured Streaming is a new paradigm shift in streaming computing that enables building continuous applications with end-to-end exactly once guarantee and data consistency, even in case of node delays and failures. MLlib: MLlib is a machine learning library used to create data products or extract deep meaning from the data. MLlib provides high performance because of in-memory caching of data. GraphX: GraphX is a graph computation engine with graph algorithms to build graph applications. SparkR: SparkR overcomes the R's single-threaded process issues and memory limitations with Spark's distributed in-memory processing engine. SparkR provides a distributed DataFrame based on DataFrame API and Distributed Machine Learning using MLlib.
Off-heap storage	Tachyon: Reliable data sharing at memory-speed within and across cluster frameworks/jobs. Spark's default `OFF_HEAP` (experimental) storage is Tachyon.
Cluster resource managers	Standalone: By default, applications are submitted to the standalone mode cluster and each application will try to use all the available nodes and resources. YARN: YARN controls the resource allocation and provides dynamic resource allocation capabilities. Mesos: Mesos has two modes—coarse-grained and fine-grained. The coarse-grained approach has a static number of resources just like the standalone resource manager. The fine-grained approach has dynamic resource allocation just like YARN.
Storage	HDFS, S3, and other filesystems with the support of Hadoop InputFormat.
Database integrations	HBase, Cassandra, Mongo DB, Neo4J, and RDBMS databases.
Integrations with streaming sources	Flume, Kafka and Kinesis, Twitter, Zero MQ, and File Streams.
Packages	http://spark-packages.org/ provides a list of third-party data source APIs and packages.
Distributions	Distributions from Cloudera, Hortonworks, MapR, and DataStax.
Notebooks	Jupyter and Apache Zeppelin.
Dataflows	Apache NiFi, Apache Beam, and StreamSets.

The Spark ecosystem is a unified stack that provides you with the power of combining SQL, streaming, and machine learning in one program. The advantages of unification are as follows:

No need of copying or ETL of data between systems
Combines processing types in one program
Code reuse
One system to learn
One system to maintain

An example of unification is shown in Figure 2.10:

Figure 2.10: The unification of the Apache Spark ecosystem

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Introducing Apache Spark

Create new playlist

Sign In

Sign Up

Introducing Apache Spark

Note

Spark history

What is Apache Spark?

What Apache Spark is not

Note

MapReduce issues

Note

Spark's stack

Table of Contents for
Introducing Apache Spark