Hadoop and Spark

Hadoop is a venerable technology now; the grand old man of distributed computing technologies. We won't spend too much time dwelling on Hadoop's internals, but a brief introduction is required for this chapter for it to make sense to folks who are not from a big-data background:

The MapReduce programming paradigm is what really matters to a user. It defines a map and reduces tasks using the MapReduce API, and submits them to that part of the Hadoop ecosystem:

When a job gets triggered on the corresponding cluster, this brings YARN into play. This involves prioritizing among different jobs and sharing resources such as compute capacity:

YARN is the acronym for Yet Another Resource Negotiator, and it plays the role of a scheduler and resource allocator on the Hadoop cluster. YARN will figure out where and how to run the job:

This process also involves copying over the JAR files (Java archives) to each of the nodes in the cluster; this is essentially moving compute to be where the storage is. This tight coupling between storage and compute is a key feature of Hadoop, and loosening this coupling is a key insight that major cloud services such as Dataproc exploit. Once the job is entirely run, the results are collected and stored back in HDFS:

This also gives us an important insight that we should remember. Hadoop, at heart, is a batch processing system. The user interacts with MapReduce by defining MapReduce tasks. The results end up being on a distributed filesystem. It's not the most convenient way, or the most intuitive way, of interacting with parallel computing programs. Partly to compensate for this slightly abstract nature, and partly to compensate for the batch nature of Hadoop, a whole bunch of tools have sprung up around Hadoop:

Hive: Hive serves as a SQL-like wrapper of top of the distributed file system HDFS
HBase: HBase is a columnar data store that is created for Hadoop data
Pig: Pig is a transformation tool that helps get data semi structured or unstructured data into HDFS
Kafka: Kafka helps deal with streaming data
Spark: Spark is a really powerful computing engine and represents possibly the hottest big-data technology today

There are several other elements in the Hadoop ecosystem, as it has come to be known, and, collectively, these constitute nothing less than an entire big data suite.

Perhaps the most important and noteworthy of these is Spark. Spark is a powerful big data and machine-learning engine, which can be used in a variety of programming languages such as Python, Scala, Java, and R. The most common technologies used to work with Spark are Python (Python on Spark is often just called PySpark) and Scala (in which Spark is actually written). Spark need not be run on Hadoop, but it is often used on top of Hadoop, making use of YARN and HDFS but not MapReduce:

Table of Contents for Hadoop and Spark

Create new playlist

Sign In

Sign Up

Table of Contents for
Hadoop and Spark