Distributed computations

Sometimes, your application may need to work with data that cannot be managed easily with classical databases. You might also need to perform a variety of computations/projections on the data, and maintaining an index for each column is not feasible/efficient. Google pioneered a new way of solving such computations by distributing computation—so that the same code can work with different data on a set of commodity machines. This is called MapReduce.

MapReduce defines the overall processing in two parts/functions, both of which take key/value pairs as input and give them out as output. The functions are as follows:

Map: Takes an input key/value pair and generates a set of intermediate key/value pairs (which can be empty). For example, the Map function might create a histogram (mapping of word to its count) within a page of a document.
Reduce: Gets the intermediate keys and a list of all associated values that were generated in the Map step—that is, all of the intermediate values that have the same key are aggregated into a list. In the word count example, each Reduce instance will get the word as the key and a list of count in each page of the document.

The flow is shown as follows:

The Google framework had the Google File System (GFS) and a distributed NoSQL database, called BigTable, as supporting casts in the MapReduce framework. The distributed MapReduce computation is implemented using a master/slave architecture—a master assigns tasks and controls execution on a set of slave machines.

The preceding diagram illustrates the word count example we discussed. The input document is stored in GFS and split into chunks. The master kicks things off by sending the code of the map and reduce functions to all the workers. It then spawns Map tasks on the worker, which generate the intermediate KV pairs in files. Then, the Reduce tasks are spawned, which take the intermediate files and write the final output.

The Apache foundation has created an open source version of this paradigm, called Hadoop. It uses the Hadoop Distributed File System (HDFS), instead of GFS and YARN. Yet Another Resource Negotiator (YARN) is Hadoop's cluster execution scheduler. It spawns/allocates a number of containers (processes) on the machines in the cluster and allows one to execute arbitrary commands on them. HBase is the equivalent for BigTable. There are many higher-level projects on top of Hadoop—for example, Hive can run SQL-like queries by converting them into MapReduce functions.

Though the MapReduce framework has a simple programming model and has met its objective of crunching big data easily, it's not applicable if you have real-time response requirements. There are two reasons for this:

Hadoop was initially designed for batch processing. Things such as scheduling, code transfers, and process (Mappers/Reducers) setup and teardown mean that even the smallest computations do not finish in less than seconds.
The HDFS is designed for high throughput data I/O rather than high-performance. Data blocks in HDFS are very large and the IOPS are about 100 to 200 MB.

One way of optimizing on the disk IO is to store the data in memory. With a cluster of machines, the memory size of an individual machine is no longer a constraint.

In-memory computing does not mean that the entire dataset should be in memory; even caching frequently used data can significantly improve the overall job execution time. Apache Spark is built on this model. Its primary abstraction is called Resilient Distributed Dataset (RDD). This is essentially a batch of events that can be processed together. In a model similar to Hadoop, there is a main program (driver) that coordinates computation by sending tasks to executor processes running on multiple worker nodes:

Reference: https://spark.apache.org/docs/latest/cluster-overview.html

Table of Contents for Distributed computations

Create new playlist

Sign In

Sign Up

Table of Contents for
Distributed computations