Apache Spark shines better when it is combined with Hadoop. To understand this, let's take a look at Hadoop and Spark features.
Feature |
Details |
---|---|
Easy development |
Multiple native APIs such as Java, Scala, Python, and R REPL for Scala, Python, and R |
Optimized performance |
Caching Optimized shuffle Catalyst Optimizer |
Unification |
Batch, SQL, machine learning, streaming, and graph processing |
High level APIs |
DataFrames, Data sets and Data Sources APIs |
When both frameworks are combined, we get the power of enterprise-grade applications with in-memory performance, as shown in Figure 2.11:
The following are frequent questions that practitioners raise about Spark:
Spark's operators spill the data to disk if it does not fit in-memory, allowing it to run on data of any size. Likewise, cached datasets that do not fit in-memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level. By default, Spark will recompute the partitions that don't fit in-memory. The storage level can be changed as MEMORY_AND_DISK to spill partitions to disk.
Figure 2.12 shows you the performance difference between fully cached and on disk:
Spark's built-in fault tolerance based on the RDD lineage will automatically recover from failures. Figure 2.13 shows you the performance over failure in the 6th iteration in a k-means algorithm: