Chapter 8. Final Thoughts and the Future of Design Patterns

At the time of this book’s writing, MapReduce is moving quickly. New features and new systems are popping up every day and new users are out in droves. More importantly for the subject of MapReduce Design Patterns, a growing number of users brings along a growing number of experts. These experts are the ones that will drive the community’s documentation of design patterns not only by sharing new ones, but also by maturing the already existing ones.

In this chapter, we’ll discuss and speculate what the future holds for MapReduce design patterns. Where will they come from? What systems will benefit from design patterns? How will today’s design patterns change with the technology? What trends in data will affect the design patterns of today?

Trends in the Nature of Data

MapReduce systems such as Hadoop aren’t being used just for text analysis anymore. Increasing number of users are deploying MapReduce jobs that analyze data once thought to be too hard for the paradigm. New design patterns are surely to arise to deal with this to transform a solution from pushing the limits of the system to making it daily practice.

Images, Audio, and Video

One of the most obvious trends in the nature of data is the rise of image, audio, and video analysis. This form of data is a good candidate for a distributed system using MapReduce because these files are typically very large. Retailers want to analyze their security video to detect what stores are busiest. Medical imaging analysis is becoming harder with the astronomical resolutions of the pictures. Unfortunately, as a text processing platform, some artifacts remain in MapReduce that make this type of analysis challenging. Since this is a MapReduce book, we’ll acknowledge the fact that analyzing this type of data is really hard, even on a single node with not much data, but we will not go into more detail.

One place we may see a surge in design patterns is dealing with multidimensional data. Videos have colored pixels that change over time, laid out on a two-dimensional grid. To top it off, they also may have an audio track. MapReduce follows a very straightforward, one-dimensional tape paradigm. The data is in order from front to back and that is how it is analyzed. Therefore, it’s challenging to take a look at 10-pixel by 10-pixel by 5-second section of video and audio as a “record.” As multidimensional data increases in popularity, we’ll see more patterns showing how to logically split the data into records and input splits properly. Or, it is possible that new systems will fill this niche. For example, SciDB, an open-source analytical database, is specifically built to deal with multi-dimensional data.

Streaming Data

MapReduce is traditionally a batch analytics system, but streaming analytics feels like a natural progression. In many production MapReduce systems, data is constantly streaming in and then gets processed in batch on an interval. For example, data from web server logs are streaming in, but the MapReduce job is only executed every hour.

This is inconvenient for a few reasons. First, processing an hour’s worth of data at once can strain resources. Because it’s coming in gradually, processing it as it arrives will spread out the computational resources of the cluster better. Second, MapReduce systems typically depend on a relatively large block size to reduce the overhead of distributed computation. When data is streaming in, it comes in record by record. These hurdles make processing streaming data difficult with MapReduce.

As in the previous section about large media files, this gap is likely to be filled by a combination of two things: new patterns and new systems. Some new operational patterns for storing data of this nature might crop up as users take this problem more seriously in production. New patterns for doing streaming-like analysis in the framework of batch MapReduce will mature. Novel systems that deal with streaming data in Hadoop have cropped up, most notably the commercial product HStreaming and the open-source Storm platform, recently released by Twitter.

Note

The authors actually considered some “streaming patterns” to be put into this book, but none of them were anywhere near mature enough or vetted enough to be officially documented.

The first is an exotic RecordReader. The map task starts up and streams data into the RecordReader instead of loading already existing data from a file. This has significant operational concerns that make it difficult to implement.

The second is splitting up the job into several one-map task jobs that get fired off every time some data comes in. The output is partitioned into k bins for future “reducers.” Every now and then, a map-only job with k mappers starts up and plays the role of the reducer.

The Effects of YARN

YARN (Yet Another Resource Negotiator) is a high-visibility advancement of Hadoop MapReduce that is currently in version 2.0.x and will eventually make it into the current stable release. Many in the Hadoop community cannot wait for it to mature, as it fills a number of gaps. At a high level, YARN splits the responsibilities of the JobTracker and TaskTrackers into a single ResourceManager, one NodeManager per node, and one ApplicationMaster per application or job. The ResourceManager and NodeManagers abstract away computational resources from the current map-and-reduce slot paradigm and allow arbitrary computation. Each ApplicationMaster handles a framework-specific model of computation that breaks down a job into resource allocation requests, which is in turn handled by the ResourceManager and the NodeManagers.

What this does is separate the computation framework from the resource management. In this model, MapReduce is just another framework and doesn’t look any more special than a custom frameworks such as MPI, streaming, commercial products, or who knows what.

MapReduce design patterns will not change in and of themselves, because MapReduce will still exist. However, now that users can build their own distributed application frameworks or use other frameworks with YARN, some of the more intricate solutions to problems may be more natural to solve in another framework. We’ll see some design patterns that will still exist but just aren’t used very much anymore, since the natural solution lies in another distributed framework. We will likely eventually see ApplicationMaster patterns for building completely new frameworks for solving a type of problem.

Patterns as a Library or Component

Over time, as patterns get more and more use, someone may decide to componentize that pattern as a built-in utility class in a library. This type of progression is seen in traditional design patterns, as well, in which the library parameterizes the pattern and you just interact with it, instead of reimplementing the pattern. This is seen with several of the custom Hadoop MapReduce pieces that exist in the core Hadoop libraries, such as TotalOrderPartitioner, ChainReducer, and MultipleOutputs.

This is very natural from a standpoint of code reuse. The patterns in this book are presented to help you start solving a problem from scratch. By adding a layer of indirection, modules that set up the job for you and offer several parameters as points of customization can be helpful in the long run.

How You Can Help

If you think you’ve developed a novel MapReduce pattern that you haven’t seen before and you are feeling generous, you should definitely go through the motions of documenting it and sharing it with the world.

There are a number of questions you should try to answer. These were some of the questions we considered when choosing the patterns for this book.

Is the problem you are trying to solve similar to another pattern’s target problem?

Identifying this is important for preventing any sort of confusion. Chapter 5, in particular, takes this question seriously.

What is at the root of this pattern?

You probably developed the pattern to solve a very specific problem and have custom code interspersed throughout. Developers will be smart enough to tailor a pattern to their own problem or mix patterns to solve their more complicated problems. Tear down the code and only have the pattern left.

What is the performance profile?

Understanding what kinds of resources a pattern will use is important for gauging how many reducers will be needed and in general how expensive this operation will be. For example, some people may be surprised how resource intensive sorting is in a distributed system.

How might have you solved this problem otherwise?

Finding some examples outside of a MapReduce context (such as we did with SQL and Pig) is useful as a metaphor that helps conceptually bridge to a MapReduce-specific solution.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset