Chapter 1. Introduction to Parallel Programming
This chapter provides a general introduction to parallel computing and places the subject in the context of OpenCL programming.
Keywords Grain of computation, introduction, multi-core, many-core, parallel programming models

Introduction

Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core microprocessors, central processing units (CPUs), digital signal processors, reconfigurable hardware (FPGAs), and graphic processing units (GPUs). Presented with so much heterogeneity, the process of developing efficient software for such a wide array of architectures poses a number of challenges to the programming community.
Applications possess a number of workload behaviors, ranging from control intensive (e.g., searching, sorting, and parsing) to data intensive (e.g., image processing, simulation and modeling, and data mining). Applications can also be characterized as compute intensive (e.g., iterative methods, numerical methods, and financial modeling), where the overall throughput of the application is heavily dependent on the computational efficiency of the underlying hardware. Each of these workload classes typically executes most efficiently on a specific style of hardware architecture. No single architecture is best for running all classes of workloads, and most applications possess a mix of the workload characteristics. For instance, control-intensive applications tend to run faster on superscalar CPUs, where significant die real estate has been devoted to branch prediction mechanisms, whereas data-intensive applications tend to run fast on vector architectures, where the same operation is applied to multiple data items concurrently.

OpenCL

The Open Computing Language (OpenCL) is a heterogeneous programming framework that is managed by the nonprofit technology consortium Khronos Group. OpenCL is a framework for developing applications that execute across a range of device types made by different vendors. It supports a wide range of levels of parallelism and efficiently maps to homogeneous or heterogeneous, single- or multiple-device systems consisting of CPUs, GPUs, and other types of devices limited only by the imagination of vendors. The OpenCL definition offers both a device-side language and a host management layer for the devices in a system. The device-side language is designed to efficiently map to a wide range of memory systems. The host language aims to support efficient plumbing of complicated concurrent programs with low overhead. Together, these provide the developer with a path to efficiently move from algorithm design to implementation.
OpenCL provides parallel computing using task-based and data-based parallelism. It currently supports CPUs that include x86, ARM, and PowerPC, and it has been adopted into graphics card drivers by both AMD (called the Accelerated Parallel Processing SDK) and NVIDIA. Support for OpenCL is rapidly expanding as a wide range of platform vendors have adopted OpenCL and support or plan to support it for their hardware platforms. These vendors fall within a wide range of market segments, from the embedded vendors (ARM and Imagination Technologies) to the HPC vendors (AMD, Intel, NVIDIA, and IBM). The architectures supported include multi-core CPUs, throughput and vector processors such as GPUs, and fine-grained parallel devices such as FPGAs.
Most important, OpenCL's cross-platform, industrywide support makes it an excellent programming model for developers to learn and use, with the confidence that it will continue to be widely available for years to come with ever-increasing scope and applicability.

The goals of this book

This book is the first of its kind to present OpenCL programming in a fashion appropriate for the classroom. The book is organized to address the need for teaching parallel programming on current system architectures using OpenCL as the target language, and it includes examples for CPUs, GPUs, and their integration in the accelerated processing unit (APU). Another major goal of this text is to provide a guide to programmers to develop well-designed programs in OpenCL targeting parallel systems. The book leads the programmer through the various abstractions and features provided by the OpenCL programming environment. The examples offer the reader a simple introduction and more complicated optimizations, and they suggest further development and goals at which to aim. It also discusses tools for improving the development process in terms of profiling and debugging such that the reader need not feel lost in the development process.
The book is accompanied by a set of instructor slides and programming examples, which support the use of this text by an OpenCL instructor. Please visit http://heterogeneouscomputingwithopencl.org/ for additional information.

Thinking parallel

Most applications are first programmed to run on a single processor. In the field of high-performance computing, classical approaches have been used to accelerate computation when provided with multiple computing resources. Standard approaches include “divide-and-conquer” and “scatter–gather” problem decomposition methods, providing the programmer with a set of strategies to effectively exploit the parallel resources available in high-performance systems. Divide-and-conquer methods iteratively break a problem into subproblems until the subproblems fit well on the computational resources provided. Scatter–gather methods send a subset of the input data set to each parallel resource and then collect the results of the computation and combine them into a result data set. As before, the partitioning takes account of the size of the subsets based on the capabilities of the parallel resources. Figure 1.1 shows how popular applications such as sorting and a vector–scalar multiply can be effectively mapped to parallel resources to accelerate processing.
B9780123877666000244/f01-01-9780123877666.jpg is missing
Figure 1.1
(A) Simple sorting and (B) dot product examples.
The programming task becomes increasingly challenging when faced with the growing parallelism and heterogeneity present in contemporary parallel processors. Given the power and thermal limits of complementary metal-oxide semiconductor (CMOS) technology, microprocessor vendors find it difficult to scale the frequency of these devices to derive more performance and have instead decided to place multiple processors, sometimes specialized, on a single chip. In doing so, the problem of extracting parallelism from an application is left to the programmer, who must decompose the underlying algorithms in the applications and map them efficiently to a diverse variety of target hardware platforms.
In the past 5 years, parallel computing devices have been increasing in number and processing capabilities. GPUs have also appeared on the computing scene and are providing new levels of processing capability at very low cost. Driven by the demand for real-time three-dimensional graphics rendering, a highly data-parallel problem, GPUs have evolved rapidly as very powerful, fully programmable, task and data-parallel architectures. Hardware manufacturers are now combining CPUs and GPUs on a single die, ushering in a new generation of heterogeneous computing. Compute-intensive and data-intensive portions of a given application, called kernels, may be offloaded to the GPU, providing significant performance per watt and raw performance gains, while the host CPU continues to execute nonkernel tasks.
Many systems and phenomena in both the natural world and the man-made world present us with different classes of parallelism and concurrency:
• Molecular dynamics
• Weather and ocean patterns
• Multimedia systems
• Tectonic plate drift
• Cell growth
• Automobile assembly lines
• Sound and light wave propagation
Parallel computing, as defined by Almasi and Gottlieb (1989), is “a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently (i.e., in parallel).” The degree of parallelism that can be achieved is dependent on the inherent nature of the problem at hand (remember that there exists significant parallelism in the world), and the skill of the algorithm or software designer is to identify the different forms of parallelism present in the underlying problem. We begin with a discussion of two simple examples to demonstrate inherent parallel computation: vector multiplication and text searching.
Our first example carries out multiplication of the elements of two arrays A and B, each with N elements, storing the result of each multiply in a corresponding array C. Figure 1.2 shows the computation we would like to carry out. The serial C++ program for code would look as follows:
B9780123877666000244/f01-02-9780123877666.jpg is missing
Figure 1.2
Multiplying two arrays: This example provides for parallel computation without any need for communication.
for (i=0; i<N; i++)
C[i] = A[i] * B[i];
This code possesses significant parallelism but very little arithmetic intensity. The computation of every element in C is independent of every other element. If we were to parallelize this code, we could choose to generate a separate execution instance to perform the computation of each element of C. This code possesses significant data-level parallelism because the same operation is applied across all of A and B to produce C. We could also view this breakdown as a simple form of task parallelism where each task operates on a subset of the same data; however, task parallelism generalizes further to execution on pipelines of data or even more sophisticated parallel interactions. Figure 1.3 shows an example of task parallelism in a pipeline to support filtering of images in frequency space using an FFT.
B9780123877666000244/f01-03-9780123877666.jpg is missing
Figure 1.3
Filtering a series of images using an FFT shows clear task parallelism as a series of tasks operate together in a pipeline to compute the overall result.
Let us consider a second example. The computation we are trying to carry out is to find the number of occurrences of a string of characters in a body of text (Figure 1.4). Assume that the body of text has already been parsed into a set of N words. We could choose to divide the task of comparing the string against the N potential matches into N comparisons (i.e., tasks), where each string of characters is matched against the text string. This approach, although rather naïve in terms of search efficiency, is highly parallel. The process of the text string being compared against the set of potential words presents N parallel tasks, each carrying out the same set of operations. There is even further parallelism within a single comparison task, where the matching on a character-by-character basis presents a finer-grained degree of parallelism. This example exhibits both data-level parallelism (we are going to be performing the same operation on multiple data items) and task-level parallelism (we can compare the string to all words concurrently).
B9780123877666000244/f01-04-9780123877666.jpg is missing
Figure 1.4
An example of both task-level and data-level parallelism. We can have parallel tasks that count the occurrence of string in a body of text. The lower portion of the figure shows that the string comparison can be broken down to finer-grained parallel processing.
Once the number of matches is determined, we need to accumulate them to provide the total number of occurrences. Again, this summing can exploit parallelism. In this step, we introduce the concept of “reduction,” where we can utilize the availability of parallel resources to combine partials sums in a very efficient manner. Figure 1.5 shows the reduction tree, which illustrates this summation process in log N steps.
B9780123877666000244/f01-05-9780123877666.jpg is missing
Figure 1.5
After all string comparisons are completed, we can sum up the number of matches in a combining network.

Concurrency and parallel programming models

Here, we discuss concurrency and parallel processing models so that when attempting to map an application developed in OpenCL to a parallel platform, we can select the right model to pursue. Although all of the following models can be supported in OpenCL, the underlying hardware may restrict which model will be practical to use.
Concurrency is concerned with two or more activities happening at the same time. We find concurrency in the real world all the time—for example, carrying a child in one arm while crossing a road or, more generally, thinking about something while doing something else with one's hands.
When talking about concurrency in terms of computer programming, we mean a single system performing multiple tasks independently. Although it is possible that concurrent tasks may be executed at the same time (i.e., in parallel), this is not a requirement. For example, consider a simple drawing application, which is either receiving input from the user via the mouse and keyboard or updating the display with the current image. Conceptually, receiving and processing input are different operations (i.e., tasks) from updating the display. These tasks can be expressed in terms of concurrency, but they do not need to be performed in parallel. In fact, in the case in which they are executing on a single core of a CPU, they cannot be performed in parallel. In this case, the application or the operating system should switch between the tasks, allowing both some time to run on the core.
Parallelism is concerned with running two or more activities in parallel with the explicit goal of increasing overall performance. For example, consider the following assignments:
step 1)A = B + C
step 2)D = E + G
step 3)R = A + D
The assignments of A and D in steps 1 and 2 (respectively) are said to be independent of each other because there is no data flow between these two steps (i.e., the variables E and G on the right side of step 2 do not appear on the left side step 1, and vice versa, the variables B and C on the right sides of step 1 do not appear on the left side of step 2.). Also the variable on the left side of step 1 (A) is not the same as the variable on the left side of step 2 (D). This means that steps 1 and 2 can be executed in parallel (i.e., at the same time). Step 3 is dependent on both steps 1 and 2, so cannot be executed in parallel with either step 1 or 2.
Parallel programs must be concurrent, but concurrent programs need not be parallel. Although many concurrent programs can be executed in parallel, interdependencies between concurrent tasks may preclude this. For example, an interleaved execution would still satisfy the definition of concurrency while not executing in parallel. As a result, only a subset of concurrent programs are parallel, and the set of all concurrent programs is itself a subset of all programs. Figure 1.6 shows this relationship.
B9780123877666000244/f01-06-9780123877666.jpg is missing
Figure 1.6
Parallel and concurrent programs are subsets of programs.
In the remainder of this section, some well-known approaches to programming concurrent and parallel systems are introduced with the aim of providing a foundation before introducing OpenCL in Chapter 2.

Threads and Shared Memory

A running program may consist of multiple subprograms that maintain their own independent control flow and that are allowed to run concurrently. These subprograms are defined as threads. Communication between threads is via updates and access to memory appearing in the same address space. Each thread has its own pool of local memory—that is, variables—but all threads see the same set of global variables. A simple analogy that can be used to describe the use of threads is the concept of a main program that includes a number of subroutines. The main program is scheduled to run by the operating system and performs necessary loading and acquisition of system and user resources to run. Execution of the main program begins by performing some serial work and then continues by creating a number of tasks that can be scheduled and run by the operating system concurrently using threads.
Each thread benefits from a global view of memory because it shares the same memory address space of the main program. Threads communicate with each other through global memory. This can require synchronization constructs to ensure that more than one thread is not updating the same global address.
A memory consistency model is defined to manage load and store ordering. All processors see the same address space and have direct access to these addresses with the help of other processors. Mechanisms such as locks/semaphores are commonly used to control access to shared memory that is accessed by multiple tasks. A key feature of the shared memory model is the fact that the programmer is not responsible for managing data movement, although depending on the consistency model implemented in the hardware or runtime system, some level of memory consistency may have to be enforced manually. This relaxes the requirement to specify explicitly the communication of data between tasks, and as a result, parallel code development can often be simplified.
There is a significant cost to supporting a fully consistent shared memory model in hardware. For multiprocessor systems, the hardware structures required to support this model become a limiting factor. Shared buses become bottlenecks in the design. The extra hardware required typically grows exponentially in terms of its complexity as we attempt to add additional processors. This has slowed the introduction of multi-core and multiprocessor systems at the low end, and it has limited the number of cores working together in a consistent shared memory system to relatively low numbers because shared buses and coherence protocol overheads become bottlenecks. More relaxed shared memory systems scale further, although in all cases scaling shared memory systems comes at the cost of complicated and expensive interconnects.
Most multi-core CPU platforms support shared memory in one form or another. OpenCL supports execution on shared memory devices.

Message-Passing Communication

The message-passing communication model enables explicit intercommunication of a set of concurrent tasks that may use memory during computation. Multiple tasks can reside on the same physical device and/or across an arbitrary number of devices. Tasks exchange data through communications by sending and receiving explicit messages. Data transfer usually requires cooperative operations to be performed by each process. For example, a send operation must have a matching receive operation.
From a programming perspective, message-passing implementations commonly comprise a library of hardware-independent routines for sending and receiving messages. The programmer is responsible for explicitly managing communication between tasks. Historically, a variety of message-passing libraries have been available since the 1980s. MPI is currently the most popular message-passing middleware. These implementations differ substantially from each other, making it difficult for programmers to develop portable applications.

Different Grains of Parallelism

In parallel computing, granularity is a measure of the ratio of computation to communication. Periods of computation are typically separated from periods of communication by synchronization events. The grain of parallelism is constrained by the inherent characteristics of the algorithms constituting the application. It is important that the parallel programmer selects the right granularity in order to reap the full benefits of the underlying platform because choosing the right grain size can help to expose additional degrees of parallelism. Sometimes this selection is referred to as “chunking,” determining the amount of data to assign to each task. Selecting the right chunk size can help provide for further acceleration on parallel hardware. Next, we consider some of the trade-offs associated with identifying the right grain size.
• Fine-grained parallelism
• Low arithmetic intensity.
• May not have enough work to hide long-duration asynchronous communication.
• Facilitates load balancing by providing a larger number of more manageable (i.e., smaller) work units.
• If the granularity is too fine, it is possible that the overhead required for communication and synchronization between tasks can actually produce a slower parallel implementation than the original serial execution.
• Coarse-grained parallelism
• High arithmetic intensity.
• Complete applications can serve as the grain of parallelism.
• More difficult to load balance efficiently.
Given these trade-offs, which granularity will lead to the best implementation? The most efficient granularity is dependent on the algorithm and the hardware environment in which it is run. In most cases, if the overhead associated with communication and synchronization is high relative to the time of the computation task at hand, it will generally be advantageous to work at a coarser granularity. Fine-grained parallelism can help reduce overheads due to load imbalance or memory delays (this is particularly true on a GPU, which depends on zero-overhead fine-grained thread switching to hide memory latencies). Fine-grained parallelism can even occur at an instruction level (this approach is used in very long instruction word (VLIW) and superscalar architectures).

Data Sharing and Synchronization

Consider the case in which two applications run that do not share any data. As long as the runtime system or operating system has access to adequate execution resources, they can be run concurrently and even in parallel. If halfway through the execution of one application it generated a result that was subsequently required by the second application, then we would have to introduce some form of synchronization into the system, and parallel execution—at least across the synchronization point—becomes impossible.
When writing concurrent software, data sharing and synchronization play a critical role. Examples of data sharing in concurrent programs include
• the input of a task is dependent on the result of another task—for example, in a producer/consumer or pipeline execution model; and
• when intermediate results are combined together (e.g., as part of a reduction, as in our word search example shown in Figure 1.4).
Ideally, we would only attempt to parallelize portions of an application that are void of data dependencies, but this is not always possible. Explicit synchronization primitives such as barriers or locks may be used to support synchronization when necessary. Although we only raise this issue here, later chapters revisit this question when support for communication between host and device programs or when synchronization between tasks is required.

Structure

The remainder of the book is organized as follows:
Chapter 2 presents an introduction to OpenCL, including key concepts such as kernels, platforms, and devices; the four different abstraction models; and developing your first OpenCL kernel. Understanding these different models is critical to fully appreciate the richness of OpenCL's programming model.
Chapter 3 presents some of the architectures that OpenCL does or might target, including x86 CPUs, GPUs, and APUs. The text includes discussion of different styles of architectures, including single instruction multiple data and VLIW. This chapter also covers the concepts of multi-core and throughput-oriented systems, as well as advances in heterogeneous architectures.
Chapter 4 introduces basic matrix multiplication, image rotation, and convolution implementations to help the reader learn OpenCL by example.
Chapter 5 discusses concurrency and execution in the OpenCL programming model. In this chapter, we discuss kernels, work items, and the OpenCL execution and memory hierarchies. We also show how queuing and synchronization work in OpenCL such that the reader gains an understanding of how to write OpenCL programs that interact with memory correctly.
Chapter 6 shows how OpenCL maps to an example architecture. For this study, we choose a system comprising an AMD Phenom II CPU and an AMD Radeon HD6970 GPU. This chapter allows us to show how the mappings of the OpenCL programming model for largely serial architectures such as CPUs and vector/throughput architectures such as GPUs differ, giving some idea how to optimize for specific architectural styles.
Chapter 7 presents a case study that accelerates a convolution algorithm. Issues related to memory space utilization and efficiency are considered, as well as work item scheduling, wavefront occupancy, and overall efficiency. These techniques are the foundations necessary for developing high-performance code using OpenCL.
Chapter 8 presents a case study targeting video processing, utilizing OpenCL to build performant image processing effects that can be applied to video streams.
Chapter 9 presents another case study examining how to optimize the performance of a histogramming application. In particular, it highlights how careful design of workgroup size and memory access patterns can make a vast difference to performance in memory-bound applications such as histograms.
Chapter 10 discusses how to leverage a heterogeneous CPU–GPU environment. The target application is a mixed particle simulation (as illustrated on the cover of this book) in which work is distributed across both the CPU and the GPU depending on the grain size of particles in the system.
Chapter 11 shows how to use OpenCL extensions using the device fission and double precision extensions as examples.
Chapter 12 introduces the reader to debugging and analyzing OpenCL programs. The right debugging tool can save a developer hundreds of wasted hours, allowing him or her instead to learn the specific computer language and solve the problem at hand.
Chapter 13 provides an overview and performance trade-offs of WebCL. WebCL is not yet a product, but imagine what could be possible if the web were powered by OpenCL. This chapter describes example OpenCL bindings for JavaScript, discussing an implementation of a Firefox plug-in that allows web applications to leverage the powerful parallel computing capabilities of modern CPUs and GPUs.
Reference
Almasi, G.S.; Gottlieb, A., Highly Parallel Computing. (1989) Benjamin Cummings, Redwood City, CA.
Further reading and relevant websites
Chapman, B.; Jost, G.; van der Pas, R.; Kuck, D.J., Using OpenMP: Portable Shared Memory Parallel Programming. (2007) MIT Press, Cambridge, MA.
Duffy, J., Concurrent Programming on Windows. (2008) Addison-Wesley, Upper Saddle River, NJ.
Gropp, W.; Lusk, E.; Skjellum, A., Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press Scientific and Engineering Computation Series. (1994) MIT Press, Cambridge, MA.
Herlihy, M.; Shavit, N., The Art of Multiprocessor Programming. (2008) Morgan Kaufmann, Burlington, MA.
Khronos Group, OpenCL, www.khronos.org/opencl.
Mattson, T.G.; Sanders, B.A.; Massingill, B.L., Patterns for Parallel Programming. (2004) Addison-Wesley, Upper Saddle River, NJ.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset