Why Hadoop plus Spark?

Apache Spark shines better when it is combined with Hadoop. To understand this, let's take a look at Hadoop and Spark features.

Hadoop features

Feature	Details
Unlimited scalability	Stores unlimited data by scaling out HDFS Effectively manages cluster resources with YARN Runs multiple applications along with Spark Thousands of simultaneous users
Enterprise grade	Provides security with Kerberos authentication and ACLs authorization Data encryption High reliability and integrity Multi-tenancy
Wide range of applications	Files: Structured, semi-structured, and unstructured Streaming sources: Flume and Kafka Databases: Any RDBMS and NoSQL database

Feature

Details

Unlimited scalability

Stores unlimited data by scaling out HDFS

Effectively manages cluster resources with YARN

Runs multiple applications along with Spark

Thousands of simultaneous users

Enterprise grade

Provides security with Kerberos authentication and ACLs authorization

Data encryption

High reliability and integrity

Multi-tenancy

Wide range of applications

Files: Structured, semi-structured, and unstructured

Streaming sources: Flume and Kafka

Databases: Any RDBMS and NoSQL database

Feature	Details
Easy development	No boilerplate coding Multiple native APIs such as Java, Scala, Python, and R REPL for Scala, Python, and R
Optimized performance	Caching Optimized shuffle Catalyst Optimizer
Unification	Batch, SQL, machine learning, streaming, and graph processing
High level APIs	DataFrames, Data sets and Data Sources APIs

When both frameworks are combined, we get the power of enterprise-grade applications with in-memory performance, as shown in Figure 2.11:

Figure 2.11: Spark applications on the Hadoop platform

The following are frequent questions that practitioners raise about Spark:

My dataset does not fit in-memory. How can I use Spark?
Spark's operators spill the data to disk if it does not fit in-memory, allowing it to run on data of any size. Likewise, cached datasets that do not fit in-memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level. By default, Spark will recompute the partitions that don't fit in-memory. The storage level can be changed as MEMORY_AND_DISK to spill partitions to disk.
Figure 2.12 shows you the performance difference between fully cached and on disk:
Figure 2.12: Spark performance: Fully cached versus disk
How does fault recovery work in Spark?
Spark's built-in fault tolerance based on the RDD lineage will automatically recover from failures. Figure 2.13 shows you the performance over failure in the 6^th iteration in a k-means algorithm:
Figure 2.13: Fault recovery performance

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.