MapReduce

There are a number of concepts to perform data mining and general computation on big data. One of the most popular is the MapReduce model, which can be used for general computation on arbitrarily large datasets.

MapReduce originates from Google, where it was developed with distributed computing in mind. It also introduces fault tolerance and scalability improvements. The "original" research for MapReduce was published in 2004, and since then there have been thousands of projects, implementations, and applications using it.

While the concept is similar to many previous concepts, MapReduce has become a staple in big data analytics.

Intuition

MapReduce has two main steps: the Map step and the Reduce step. These are built on the functional programming concepts of mapping a function to a list and reducing the result. To explain the concept, we will develop code that will iterate over a list of lists and produce the sum of all numbers in those lists.

There are also shuffle and combine steps in the MapReduce paradigm, which we will see later.

To start with, the Map step takes a function and applies it to each element in a list. The returned result is a list of the same size, with the results of the function applied to each element.

To open a new IPython Notebook, start by creating a list of lists with numbers in each sublist:

a = [[1,2,1], [3,2], [4,9,1,0,2]]

Next, we can perform a map, using the sum function. This step will apply the sum function to each element of a:

sums = map(sum, a)

While sums is a generator (the actual value isn't computed until we ask for it), the above step is approximately equal to the following code:

sums = []
for sublist in a:
    results = sum(sublist)
    sums.append(results)

The reduce step is a little more complicated. It involves applying a function to each element of the returned result, to some starting value. We start with an initial value, and then apply a given function to that initial value and the first value. We then apply the given function to the result and the next value, and so on.

We start by creating a function that takes two numbers and adds them together.

def add(a, b):
  return a + b

We then perform the reduce. The signature of reduce is reduce(function, sequence, and initial), where the function is applied at each step to the sequence. In the first step, the initial value is used as the first value, rather than the first element of the list:

from functools import reduce
print(reduce(add, sums, 0))

The result, 25, is the sum of each of the values in the sums list and is consequently the sum of each of the elements in the original array.

The preceding code is equal to the following:

initial = 0
current_result = initial
for element in sums:
    current_result = add(current_result, element)

In this trivial example, our code can be greatly simplified, but the real gains come from distributing the computation. For instance, if we have a million sublists and each of those sublists contained a million elements, we can distribute this computation over many computers.

In order to do this, we distribute the map step. For each of the elements in our list, we send it, along with a description of our function, to a computer. This computer then returns the result to our main computer (the master).

The master then sends the result to a computer for the reduce step. In our example of a million sublists, we would send a million jobs to different computers (the same computer may be reused after it completes our first job). The returned result would be just a single list of a million numbers, which we then compute the sum of.

The result is that no computer ever needed to store more than a million numbers, despite our original data having a trillion numbers in it.

A word count example

The implementation of MapReduce is a little more complex than just using a map and reduce step. Both steps are invoked using keys, which allows for the separation of data and tracking of values.

The map function takes a key and value pair and returns a list of key+value pairs. The keys for the input and output don't necessarily relate to each other. For example, for a MapReduce program that performs a word count, the input key might be a sample document's ID value, while the output key would be a given word. The input value would be the text of the document and the output value would be the frequency of each word:

from collections import defaultdict
def map_word_count(document_id, document):

We first count the frequency of each word. In this simplified example, we split the document on whitespace to obtain the words, although there are better options:

    counts = defaultdict(int)
    for word in document.split():
        counts[word] += 1

We then yield each of the word, count pairs. The word here is the key, with the count being the value in MapReduce terms:

    for word in counts:
        yield (word, counts[word])

By using the word as the key, we can then perform a shuffle step, which groups all of the values for each key:

def shuffle_words(results):

First, we aggregate the resulting counts for each word into a list of counts:

    records = defaultdict(list)

We then iterate over all the results that were returned by the map function;

    for results in results_generators:
        for word, count in results:
            records[word].append(count)

Next, we yield each of the words along with all the counts that were obtained in our dataset:

    for word in records:
        yield (word, records[word])

The final step is the reduce step, which takes a key value pair (the value in this case is always a list) and produces a key value pair as a result. In our example, the key is the word, the input list is the list of counts produced in the shuffle step, and the output value is the sum of the counts:

def reduce_counts(word, list_of_counts):
    return (word, sum(list_of_counts))

To see this in action, we can use the 20 newsgroups dataset, which is provided in scikit-learn:

from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(subset='train')
documents = dataset.data

We then apply our map step. We use enumerate here to automatically generate document IDs for us. While they aren't important in this application, these keys are important in other applications;

map_results = map(map_word_count, enumerate(documents))

The actual result here is just a generator, no actual counts have been produced. That said, it is a generator that emits (word, count) pairs.

Next, we perform the shuffle step to sort these word counts:

shuffle_results = shuffle_words(map_results)

This, in essence is a MapReduce job; however, it is only running on a single thread, meaning we aren't getting any benefit from the MapReduce data format. In the next section, we will start using Hadoop, an open source provider of MapReduce, to start to get the benefits from this type of paradigm.

Hadoop MapReduce

Hadoop is a set of open source tools from Apache that includes an implementation of MapReduce. In many cases, it is the de facto implementation used by many. The project is managed by the Apache group (who are responsible for the famous web server).

The Hadoop ecosystem is quite complex, with a large number of tools. The main component we will use is Hadoop MapReduce. Other tools for working with big data that are included in Hadoop are as follows:

  • Hadoop Distributed File System (HDFS): This is a file system that can store files over many computers, with the goal of being robust against hardware failure while providing high bandwidth.
  • YARN: This is a method for scheduling applications and managing clusters of computers.
  • Pig: This is a higher level programming language for MapReduce. Hadoop MapReduce is implemented in Java, and Pig sits on top of the Java implementation, allowing you to write programs in other languages—including Python.
  • Hive: This is for managing data warehouses and performing queries.
  • HBase: This is an implementation of Google's BigTable, a distributed database.

These tools all solve different issues that come up when doing big data experiments, including data analytics.

There are also non-Hadoop-based implementations of MapReduce, as well as other projects with similar goals. In addition, many cloud providers have MapReduce-based systems.

Hadoop MapReduce
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset