Introduction to Apache Spark

Before we get going on creating any kind of a pipeline, we should take a minute to familiarize ourselves with what Spark is and what it offers us.

Spark, built for both speed and ease of use, is a superfast open source engine that was designed with the large-scale processing of data in mind.

Through the advanced Directed Acyclic Graph (DAG) execution engine that supports cyclic data flow and in-memory computing, programs and scripts can run up to 100 times faster than Hadoop MapReduce in memory or 10 times faster on disk.

Spark consists of the following components:

  • Spark Core: This is the underlying engine of Spark, utilizing the fundamental programming abstraction called Resilient Distributed Datasets (RDDs). RDDs are small logical chunks of data Spark uses as "object collections".
  • Spark SQL: This provides a new data abstraction called DataFrames for structured data processing using a distributed SQL query engine. It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data.
  • MLlib: This is Spark's built-in library of algorithms for mining big data, common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, and dimensionality reduction, as well as underlying optimization primitives that best support Spark.
  • Streaming: This extends Spark's fast scheduling capability to perform real-time analysis on continuous streams of new data.
  • GraphX: This is the graph processing framework for the analysis of graph structured data.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset