Chapter 10
Parallel Advisor–Driven Design
What's in This Chapter?
Using Parallel Advisor
Surveying the application
Adding annotations
Assessing suitability
Checking for correctness
Moving from annotations to parallel implementations
This chapter introduces a parallel development cycle that uses Intel Parallel Advisor. Advisor helps programmers become more productive, because it reveals the potential costs and benefits of parallelism by modeling (simulating) this behavior before programmers actually implement the parallelism in their code.
The problem that Advisor helps you solve is to parallelize existing C/C++ programs to obtain parallel speedup. Advisor's value is increased productivity; it enables you to quickly and easily experiment with where to add parallelism so that the resulting program is both correct and demonstrates effective performance improvement. The experiments are performed by modeling the effect of the parallelism, without adding actual parallel constructs.
Advisor is a time-tested methodology for successfully parallelizing code, along with a set of tools to provide information about the program. Advisor has several related personas:
The objective of parallelization is to find the parallel program lurking within your serial program. The parallelism may be hiding due to the serial program being over-constrained — for example, having read-write global variables that cause no problems for serial code but inhibit parallelism.
Advisor is not an automatic parallelization tool. It is aimed at code that is larger and messier than simple loop nests. Instead, it guides you through the set of decisions you must make, and provides data about your program at each step. In summary, Advisor provides a lightweight methodology that allows you to easily experiment with parallelism in different places.
Your parallel experiments with Advisor may all fail, which can be a blessing in disguise — you can avoid wasting time trying to parallelize an inherently serial algorithm. You may need to investigate alternative algorithms that can be parallelized, or just leave your program serial and investigate serial optimizations.
Who can use Advisor?
The key technology in Advisor is the use of parallel modeling of the serial program. You don't actually add parallelism to your code — you just indicate where you want to add it and the Advisor tools model how that parallel code would behave. This is a huge advantage over having to immediately add parallel constructs. Your still-serial program doesn't crash or produce incorrect results because of incorrect and likely nondeterministic parallel execution (such as unprotected data sharing among tasks). Test suites generate identical results, because your serial program will not show the nondeterminism caused by parts of the program running in different orders due to parallelism. This also enables you to refactor your program to remove data-sharing errors and make it parallel-ready, while it is still serial.
Advisor does have some disadvantages, compared with plunging ahead and immediately adding parallel constructs:
Intel Parallel Advisor guides you through a series of steps (see Figure 10.1). In practice, programmers usually move back and forth between some of the steps until they achieve good results.
The Advisor Workflow tab guides you through these steps, highlighting the current step in blue (see Figure 10.2). The Start buttons are used to launch each analysis, and the Update buttons are used to re-run an analysis tool. You can view the results by pressing the blue right arrow button.
The following five basic steps help you find hidden parallel programs:
Now that you have a parallel program, you can apply the rest of Parallel Studio.
You can follow several strategies for investigating multiple parallel region (site) opportunities:
Advisor provides copious documentation, which you can access in one of the following ways:
This chapter uses the NQueens example program that ships with Advisor to demonstrate how Advisor works. Listing 10.1 shows the two functions, setQueen() and solve(), that are the focus of the analysis.
Listing 10.1: The setQueen() and solve() functions
void setQueen(int queens[], int row, int col) { int i = 0; for (i=0; i<row; i++) { // vertical attacks if (queens[i]==col) return; // diagonal attacks if (abs(queens[i]-col) == (row-i) ) return; } // column is ok, set the queen queens[row]=col; if (row==g_nsize-1) { nrOfSolutions ++; } else { // try to fill next row for (i=0; i<g_nsize; i++) setQueen(queens, row+1, i); } } void solve(int size) { g_nsize = size; for(int i=0; i<g_nsize; i++) { // create separate array for each recursion int* pNQ = new int[g_nsize]; // try all positions in first row setQueen(pNQ, 0, i); delete pNQ; } }
The NQueens program computes the number of ways you can place n queens on an nxn chessboard with none being attacked. It prints the result and the elapsed time. The program's default value for n is 13. The NQueens algorithm proceeds in the following way. The loop in the solve() function places a queen in each of the size columns of the first row, and then calls the setQueen() function to place queens in the remaining rows. The setQueen() function tries a queen in each column of the next row. If it doesn't “fit,” setQueen() goes to the next column. If more rows exist, it calls itself recursively on the next row; otherwise, a solution has been found and the nrOfSolutions global variable is incremented — and in these cases setQueen() also goes on to the next column.
You can find the nqueens_Advisor.zip file that ships with Advisor in the Samples<locale> folder in the Parallel Studio 2011 install folder, usually C:Program FilesIntelParallel Studio 2011. Unzip the file into a writable folder. Start Visual Studio 2005, 2008, or 2010, and open the solution file nqueens_Advisor queens_Advisor.sln in that folder; for VS 2008 or 2010, the .sln file will be converted — follow the wizard's directions.
Figure 10.3 shows the Advisor toolbar, which appears in the Visual Studio toolbar area. It provides one of the several ways of invoking Advisor and the Advisor tools.
You should start by opening the Workflow tab. In addition to using the toolbar, you can start the three analysis tools from the Workflow tab either by clicking the corresponding button or by selecting VS Tools ⇒ Intel Parallel Advisor 2011.
Recall the discussion of Amdahl's Law in Chapter 1, “Parallelism Today,” which says that parallel speedup is limited by the execution time of the portion of the program that remains serial. The obvious conclusion is that you need to discover where your serial program spends the most time and focus there in order to find the most effective parallel speedup.
This is what the Survey tool helps you do: it runs and profiles the program to show where the program spends its time.
Your goal in this step is to find candidate parallel regions. You make the decisions — the Survey tool provides timing information and helps you navigate your program. You may already have candidate regions in mind, but run a Survey analysis anyway so that you have quantitative data about how much time is spent in each portion of the program.
If you were doing serial optimization, you would find hotspots that have the highest Self Time and reduce the time there (that is, by reducing the number of executed instructions). Looking elsewhere will not help serial execution time!
In contrast, with parallel optimization you don't need to focus just on a hotspot — you can also look along the chain of loops and function calls from the application's entry point to the hotspot for candidate parallel regions that have high Total Time — time spent there and in called functions (including the hotspot). This is because the objective of parallel optimization is to distribute the execution time (the executed instructions) over as many tasks/cores as possible. The parallel program typically executes more instructions than the serial program (due to task overhead), but it consumes less elapsed time because the work is spread among multiple tasks at the same time on multiple cores.
To run a Survey analysis, begin by building a release configuration of your program. For best results, turn on debug information so that the Survey tool can access symbols, and turn off inlining so that all functions in the source-level call chain appear in the Survey Report. Survey analysis has low overhead — it allows the program to execute at nearly full speed — so employ a data set that exercises the program the way it is normally used. Start the Survey analysis using the Advisor toolbar, Workflow tab, or the Tools ⇒ Intel Parallel Advisor 2011 menu.
The Survey Report for NQueens has several columns (see Figure 10.4):
The basic strategy is to look along hot call/loop chains in the Function Call Sites and Loops column from the upper left toward the lower right for candidate parallel regions:
Qsort(array) { Partition array into [less_eq_array, "center" element, greater_array]; Qsort(less_eq_array); Qsort(greater_array); }
Double-clicking a loop or function call in the Survey Report takes you to the Survey Source window, which shows the source code to help you determine if this is a good parallel site (see Figure 10.5). The information displayed includes:
Double-click in the Survey Source window to enter the Visual Studio editor on the corresponding file. Return to the Survey Report from the editor by selecting the My Advisor Results tab for the current Visual Studio project, or click the arrow icon in the “1. Survey Target” section of the Workflow tab. To return from Survey Source to the Survey Report, click the Survey Report button or the arrow icon.
When you start a Survey analysis, it runs the current program. Occasionally it takes a sample of where the program is executing, computing the call chain and also noting locations along the chain that are in a loop. When the program completes, the analysis scales the samples to determine the Self Time and the Total Time, sorts the call/loop chains by highest Total Time, and displays the Survey Report. Because the Survey Report employs coarse sampling, there is usually minimal slowdown of the program. The coarse sampling is sufficient because the Survey Report is trying to identify high-frequency events: hotspots and hot call chains.
C:Program Files (x86)IntelParallel Studio 2011Samplesen_US queens_Advisor.zip
C:Program FilesIntelParallel Studio 2011Samplesen_US queens_Advisor.zip
You communicate to Advisor where you want to try candidate parallel regions by adding annotations to your program. This section describes the parallel model that annotations simulate, the common annotations and parallel constructs they can represent, and how to add them to your program. Recall that Advisor is an inexpensive way to try parallelism in different places. Annotations are cheap — feel free to experiment!
Advisor's Suitability and Correctness tools run your serial program and model how it would behave if it were parallel as specified by the annotations — that is, they pretend it is running in parallel.
Advisor tools model fork-join parallelism as expressed by the following Advisor annotations:
Fork-join parallelism is sufficient to model Intel Cilk Plus, OpenMP, and most of the parallel algorithms in Intel Threading Building Blocks (TBB). Following are some examples of Advisor annotations for parallel regions:
ANNOTATE_SITE_BEGIN(big_loop); for (i = 0; i < n; i++) { ANNOTATE_TASK_BEGIN(loop); Statement1; … Statementk; ANNOTATE_TASK_END(loop); } ANNOTATE_SITE_END(big_loop);
// Qsort sorts the array a in place, and uses modeled recursive parallelism void Qsort(array a){ // If a is small enough, sort it directly and return. // Otherwise, pick an element e from array a. // Rearrange the elements within a so that it is partitioned in 3 parts // a == [elements <= e; e; elements > e] // Let array less_eq_qsort be a reference to the first partition of a // Let array greater_qsort be a reference to the last partition of a // Recursively apply Qsort to each of these array references, in parallel. ANNOTATE_SITE_BEGIN(qsort); ANNOTATE_TASK_BEGIN(qsort_low); Qsort(less_eq_array); ANNOTATE_TASK_END(qsort_low); ANNOTATE_TASK_BEGIN(qsort_high); Qsort(greater_array); ANNOTATE_TASK_END(qsort_high); ANNOTATE_SITE_END(qsort); }
// Inner loop nest (simplified) from Ray Tracing sample program tachyon_Advisor. // Two nested loops on y and x, each inner iteration renders // one pixel in a rectangular grid. // Processing one pixel is independent of every other pixel, so they // can all be done in parallel. This is modeled using nested parallelism. ANNOTATE_SITE_BEGIN(allRows); for (int y = starty; y < stopy; y++){ ANNOTATE_TASK_BEGIN(eachRow); ANNOTATE_SITE_BEGIN(allColumns); for (int x = startx; x < stopx; x++) { ANNOTATE_TASK_BEGIN(eachColumn); color_t c = render_one_pixel (x, y, …); put_pixel(c); ANNOTATE_TASK_END(eachColumn); } ANNOTATE_SITE_END(allColumns); ANNOTATE_TASK_END(eachRow); } ANNOTATE_SITE_END(allRows);
Lock annotations can be used to pretend to protect access to shared data by multiple tasks. Note that you usually add lock annotations only after you have run the Correctness tool and have found cases of unprotected data sharing that need to be fixed.
The following example shows how to protect the incrementing of a shared variable inside a task using lock annotations:
ANNOTATE_LOCK_ACQUIRE(0); // zero is a convenient address shared_variable ++; ANNOTATE_LOCK_RELEASE(0);
Although the preceding examples show paired site and task annotations that match statically in the source code, the paired annotations actually must match at execution time, because they have their parallel modeling effect at run time. So, if multiple execution paths are exiting such a region, it is necessary to have multiple “closing” annotations (two lock-releases in this case):
static int my_lock; ANNOTATE_LOCK_ACQUIRE(&my_lock); if (shared_variable == 0) { ANNOTATE_LOCK_RELEASE(&my_lock); return; } shared_variable ++; ANNOTATE_LOCK_RELEASE(&my_lock);
Some other special-purpose annotations are explained in the Advisor documentation.
Advisor has some features to simplify adding annotations to your code in the editor. Note that you make the decisions about parallel regions; Advisor helps you generate the correct syntax. To add annotations, follow these steps:
Recall that if the flow of control can leave a region by different paths (for example, a return), it may be necessary to have multiple ending annotations. The Annotation Wizard does not handle this case, so you will need to recognize this situation and insert the additional *END annotation by hand.
Annotations are actually C/C++ macros that expand into calls to null functions with special names; the Advisor tools recognize the names and model the corresponding behavior. And because annotations are just macros, you can employ any C/C++ compiler to build your annotated program.
Every source file using annotations needs to include the file advisor-annotate.h, which defines the annotation macros:
#include "advisor-annotate.h"
The Annotation Wizard in the editor can help with this step. This include file is located in the directory $(ADVISOR_2011_DIR)/include, so you also need to add this include path to the Additional Include Directories in Build Configurations under Properties ⇒ C/C++ ⇒ General for all projects and configurations using annotations.
Suitability analysis provides coarse-grained speedup estimates for the annotated code. The purpose of the performance information is to guide your decisions about these sites:
In either case, you have made progress with a small expenditure of effort because you are using modeling.
You can answer other questions. Does the performance match your expectations from the Survey Report? Are there parallelization-related performance issues (for example, overhead items)?
If you have fixed correctness issues by adding locks or restructuring the code (on the previous iteration through the Advisor workflow), the projected parallel performance may have changed since the last time you ran the Suitability analysis. So, you need to run it again after modifying your annotations or your code.
To run a Suitability analysis, begin by building a release configuration of your program (similar to a Survey analysis, but the program now has annotations) and use the same data set. Start the Suitability analysis from the Advisor toolbar, Workflow tab, or from the Tools menu. The Suitability tool runs the program, analyzing what its performance characteristics might be. There is typically less than a 10 percent slowdown compared to normal program execution. However, if many task instances have a small number of executed instructions, the modeling overhead could be higher and the accuracy of the estimates may suffer. For example, if the average time for tasks is less than 0.0001 seconds (displayed in the Selected Site pane), the instrumentation overhead in the Suitability tool may cause the predicted speedups to be too small.
The Suitability Report for NQueens appears in Figure 10.8. It displays the following panes of information. All performance data consists of modeled estimates about how the program might behave if it were parallel.
Double-clicking a site or task name displays the corresponding source code in the Suitability Source window. Return to the Suitability Report by clicking the Suitability Report.
A summary of all your annotations is provided in the Summary Report. This is described in the later section “Replacing Annotations.” An example appears in Figure 10.13.
This section describes the meaning and effect of the parallel choice boxes in the Selected Site pane of the Suitability Report.
Figure 10.9 shows the Selected Site pane for a program with lock annotations. In the scalability graph, the balls indicating current estimated gain are in the red, meaning no speedup. However, the bars reach into the green and indicate that there is a range of performance depending on the parallel choices listed to the right. In particular, Advisor shows that a 5.35x speedup can be achieved if you select Reduce Lock Contention, and also recommends that you do so.
Figure 10.10 shows the result of clicking the Reduce Lock Contention box. The balls in the graph are now in the green, representing very good speedup. By clicking the box, you have agreed to take some action(s) to reduce lock contention when you convert to actual parallel constructs. Note that Advisor only predicts the effect of reducing lock contention — you have the responsibility of implementing that decision later when you add parallel code!
You have multiple ways to use the Suitability Report to determine what parallel performance your program might have, and what you might change to achieve improvements. First look at the Maximum Program Gain, and then for each site examine the scalability graph and the parallel choices. Is the program gain what you expected? Change the number of CPUs to check the scalability or to match the number of CPUs on your target platform. Answer the same questions about the gain for each site, and study the scalability graph for each site.
If a site's speedup is low, click it and examine its Selected Site pane:
You can also experiment with the sensitivity of the performance by varying the model parameters and the parallel choices, looking for significant changes in the results. This Sensitivity analysis is fast because all the results have been precomputed — Suitability analysis is not run again.
When you start a Suitability analysis, it runs the current program, keeping track of site, task, and lock annotations, and the time spent in each. It then models what the performance of the program would be if it were run in parallel as specified by the annotations, and for all combinations of modeling parameters and parallel choices. It then displays the coarse-grained estimates in the Suitability Report.
Here is a more detailed description of the Suitability analysis:
A key component of parallel modeling is the task scheduler. It has a queue of tasks that are ready to “execute.” The scheduler assigns tasks to cores as the cores complete other tasks. The simulator keeps track of the simulated elapsed time for the sites, tasks, and locks. Note that the simulation does not take into account cache or memory effects from tasks running on different cores. The only inter-task performance impacts are from locks.
The simulation is run for every combination of number of CPUs, threading model, and the five parallel choices, and then the results are saved. When you change one of the values in the Suitability Report, the new result is displayed immediately because it has been precomputed. The reason for building the execution tree is that it is used multiple times for the simulations.
The Target CPU Number affects how many cores are available for the scheduler to allocate to tasks. The Threading Model affects the overheads of individual site, task, and lock operations. The parallel choices have different impacts. For example, the option “fix task overhead” is modeled by having the simulator use zero for task overhead. For the option “fix lock contention,” the simulator never makes a task wait for a lock. (Normally, the simulator causes a task to wait for the lock to be free and records the additional simulated elapsed time for that task.)
You have run the Suitability analysis and are feeling good because you have found some sites that are projected to provide parallel speedups. Now it's time for a reality check; if you parallelize your program in these locations, will there be data-sharing problems or deadlocks that will cause the parallel program to be incorrect? The purpose of checking correctness is to predict if these issues will occur.
Not only does correctness modeling tell you if errors exist, but it also helps you navigate to all of the source locations participating in a data-sharing error or a deadlock. You need this in order to fix the problem.
Or, you may decide that the correctness errors are too difficult to fix or will take too much development time relative to the projected speedup for a parallel site. So, if the return on investment (ROI) is too small, abandon this site and remove its annotations. You have been able to quickly experiment with this site, and now you can go on to other sites.
To run a Correctness analysis, begin by building a debug configuration on your program, making sure that the build configuration uses the dynamic runtime library (Configuration Properties ⇒ C/C++ ⇒ Code Generation ⇒ Runtime Library is /MD or /MDd). Correctness needs optimization off so that all memory references are retained in the generated code, and retained in their original program order, because the modeling tracks all the loads and stores. Correctness modeling causes a significant slowdown of the program, such as 100 times slower. Thus, you should use a reduced input data set to minimize the run time. However, the reduced data set should cause the program to traverse all the paths within the sites. For example, if the Survey or Suitability input data set causes a “parallel” loop to execute one million iterations, it is probably sufficient for correctness modeling if the reduced data set causes the loop to execute only a few iterations. Start the Correctness analysis using the Advisor toolbar, Workflow tab, or Tools ⇒ Intel Parallel Advisor 2011 menu.
As mentioned, performing a Correctness analysis can cause a significant expansion of execution time. So when the Correctness tool is running your program, it displays each “observation” as the program runs. If enough error observations have occurred, you can stop the program by clicking the red Stop button on the Advisor toolbar, or by closing your program's window. A Correctness Report will be created for these observations, even though the program has not run to completion.
The Correctness Report for NQueens displays several panes of information (see Figure 10.11):
You can navigate to the Correctness Source window by double-clicking the corresponding line in the Correctness Report. Figure 10.12 shows the Correctness Source window for the P1 memory reuse. The following panes of information appear:
Double-click a snippet in the Correctness Source window to enter the Visual Studio editor on the corresponding file. Return to the Correctness Report from the editor by selecting the My Advisor Results tab for the current VS project, or click the arrow in the “4. Check Correctness” section of the Workflow tab. To return from the Correctness Source window to the Correctness Report, click either the Correctness Report button or the arrow.
Correctness analysis discovers the following four problem categories that you need to understand and fix (or abandon the site). The components of the Correctness Report attempt to assist you in deciphering the cause of the problem.
static int temp; … ANNOTATE_SITE_BEGIN(big_loop); for (i = 0; i < n; i++) { ANNOTATE_TASK_BEGIN(loop); temp = a[i]; b[i] = … temp …; ANNOTATE_TASK_END(loop); } ANNOTATE_SITE_END(big_loop);
ANNOTATE_SITE_BEGIN(big_loop); for (i = 0; i < n; i++) { ANNOTATE_TASK_BEGIN(loop); int temp; temp = a[i]; b[i] = … temp …; ANNOTATE_TASK_END(loop); } ANNOTATE_SITE_END(big_loop);
static int counter = 0; … ANNOTATE_SITE_BEGIN(big_loop); for (i = 0; i < n; i++) { ANNOTATE_TASK_BEGIN(loop); … counter++ ; … ANNOTATE_TASK_END(loop); } ANNOTATE_SITE_END(big_loop);
static int counter = 0; static int my_lock; … ANNOTATE_SITE_BEGIN(big_loop); for (i = 0; i < n; i++) { ANNOTATE_TASK_BEGIN(loop); … ANNOTATE_LOCK_ACQUIRE(&my_lock); counter++ ; ANNOTATE_LOCK_RELEASE(&my_lock); … ANNOTATE_TASK_END(loop); } ANNOTATE_SITE_END(big_loop);
ANNOTATE_LOCK_ACQUIRE(&lock1); counter++; ANNOTATE_LOCK_RELEASE(&lock1); … ANNOTATE_LOCK_ACQUIRE(&lock2); // protected by different lock counter++; ANNOTATE_LOCK_RELEASE(&lock2); … // not protected by any lock counter++;
//Region 1 ANNOTATE_LOCK_ACQUIRE(&lock1); ANNOTATE_LOCK_ACQUIRE(&lock2); … ANNOTATE_LOCK_RELEASE(&lock2); ANNOTATE_LOCK_RELEASE(&lock1); … //Region 2 ANNOTATE_LOCK_ACQUIRE(&lock2); ANNOTATE_LOCK_ACQUIRE(&lock1); … ANNOTATE_LOCK_RELEASE(&lock1); ANNOTATE_LOCK_RELEASE(&lock2);
There are several approaches to using the Correctness Report and Correctness Source window to find, understand, and fix sharing problems that would occur if your program were parallel.
Diagnose in detail what is causing each problem by exploring the corresponding source locations and call stacks. The problem statement and observation code snippets in the Correctness Report may be sufficient for discovering the error. For example, if you are incrementing a global counter, you need a lock.
In other cases, the Correctness Source window provides more details about what leads to the occurrence of the problem. One complication is that you have to comprehend the distinct code that two tasks might be executing at the same time, which can cause the interference. Another is that the object being shared might be a parameter, so it may have different names in the two tasks. This is where the call stack is handy; it enables you to examine the source code at different levels of the stack so that you can track how an object is passed through multiple function calls.
Decide if there are too many hard problems to fix for this site, in which case you can either change the location of the site and tasks or abandon the site altogether. Otherwise, fix the problems by employing your understanding of each problem, picking a strategy to fix it, and using the source locations to enter the editor at the appropriate places to make the required source changes.
Rebuild the modified program and run a fresh Correctness analysis to verify that your changes do in fact fix the identified problems and do not introduce new problems. (And after converting your program to parallel constructs, use Intel Parallel Inspector XE to determine if any other classes of memory-sharing problems exist.) Now return to the Suitability analysis step to see what impact these changes may have on performance.
There is a case of a potential data-sharing problem that Correctness analysis cannot distinguish from the safe usage of a local variable. The potential error is not reported because it would also report errors on the safe case, thus causing false positives. This is one reason you should always run Intel Parallel Inspector XE after adding parallel constructs — Inspector can distinguish these two cases.
The following code fragment demonstrates both a data-sharing issue and a safe usage:
void foo(…) { int relatively_global = 0; … ANNOTATE_SITE_BEGIN(big_loop); for (i = 0; i < n; i++) { ANNOTATE_TASK_BEGIN(loop); int relatively_local = 0; … relatively_local++ ; //safe relatively_global++ ; //unprotected sharing! … ANNOTATE_TASK_END(loop); } ANNOTATE_SITE_END(big_loop);
The relatively_global variable is local to the foo function but global relative to the tasks in the loop. All the tasks share the object, so when it is incremented in the tasks in the parallel program, there is a data-communication error. In contrast, the relatively_local variable is declared within the tasks, and when the program is parallel, each task will have its own copy. So, incrementing relatively_local will not cause a sharing problem.
The issue is that in the serial program, the compiler creates both variables as local stack variables of the foo function. Therefore, the Correctness tool cannot distinguish the two different cases. The design choice was to report either both as errors or neither as errors. The decision was made to avoid annoying false positives and rely on Inspector to catch any true sharing errors. Note that this situation arises only when the task is in a function and the variable declaration (for example, relatively_global) occurs in the same function or the calling function.
When you start a Correctness analysis, it runs the current program, tracking all memory references and annotations that occur. It models which references to the same object could occur in different tasks at the same time if the program were run in parallel, taking into consideration the constraints of which tasks can run at the same time, and lock regions. It then combines related observations into problems and displays them in the Correctness Report.
Here is a more detailed description of Correctness analysis:
When you have a site or sites with good predicted performance and the correctness issues have been resolved, you can convert your parallel-ready program to a true parallel program. First, choose a parallel programming model, such as one of the Intel Parallel Building Blocks, or some other approach. (See Chapter 7, “Implementing Parallelism,” for descriptions of parallel models and how to use them.) Then replace each Advisor annotation with the corresponding parallel construct. This section shows some of these mappings; Advisor documentation contains a more complete set of mappings for Intel Threading Building Blocks and Intel Cilk Plus.
Figure 10.13 shows the Summary Report, which you can display either by clicking the Summary button at the top of the Advisor window or by clicking the arrow icon for the “5. Add Parallel Framework” step in the Workflow tab.
The Summary Report provides a high-level overview of the progress on sites, suitability, and correctness in the program. It shows the kind and location of every annotation in the program. For each site, the report displays the estimated speedup of the site and the entire program (if Suitability analysis has been run) and the number of correctness problems (if Correctness analysis has been run). Figure 10.13 shows the Summary Report for NQueens before the data-sharing problems have been fixed (there are still two errors). The bottom of the report shows the modeling assumptions used (for example, eight CPUs), which you compare against the speedups.
An ROI comparison can be performed from the Summary Report. For a program with numerous parallel sites, you can use the Summary Report to balance the amount of speedup against the amount of development work needed to fix the correctness problems for a site, and then compare the sites to each other to prioritize sites where you can expect the best ROI.
The Summary Report is also the natural place to start when you are moving to parallel constructs, because all the annotations in the program are listed here. Navigate to each annotation so that it can be replaced by a parallel construct, double-click a line for an annotation in the Summary Report to take you into the Visual Studio editor on the file at the line containing that annotation, and then insert the corresponding parallel construct.
This section shows simple mappings from annotations representing loop parallelism and task parallelism to Intel Threading Building Blocks (Intel TBB) and Intel Cilk Plus. It also demonstrates how to replace lock annotations with the Intel TBB spin_mutex for both Intel TBB and Intel Cilk Plus.
ANNOTATE_SITE_BEGIN(big_loop); for (i = 0; i < n; i++) { ANNOTATE_TASK_BEGIN(loop); Statement; ANNOTATE_TASK_END(loop); } ANNOTATE_SITE_END(big_loop);
#include <tbb/tbb.h> … tbb::parallel_for(0, n, [&](int i) {statement;} );
#include <cilk/cilk.h> … cilk_for (i = 0; i < n; i++) { Statement; }
ANNOTATE_SITE_BEGIN(qsort); ANNOTATE_TASK_BEGIN(qsort_low); Qsort(less_eq_array); ANNOTATE_TASK_END(qsort_low); ANNOTATE_TASK_BEGIN(qsort_high); Qsort(greater_array); ANNOTATE_TASK_END(qsort_high); ANNOTATE_SITE_END(qsort);
#include <tbb/tbb.h> … tbb::parallel_invoke( [&] { Qsort(less_eq_array);}, [&] { Qsort(greater_array);} );
#include <cilk/cilk.h> … // version 1 for function calls cilk_spawn Qsort(less_eq_array); Qsort(greater_array); cilk_sync; // version 2 for general statements wrapped in lambda expressions cilk_spawn [&] {statement-1}(); statement-2 cilk_sync;
static int my_lock; … ANNOTATE_LOCK_ACQUIRE(&my_lock); shared_variable ++; ANNOTATE_LOCK_RELEASE(&my_lock);
#include "tbb/spin_mutex.h" … static tbb::spin_mutex my_mutex; … { // Declare my_lock in its own scope; on scope exit // the destructor will unlock it. tbb::spin_mutex::scoped_lock my_lock(my_mutex); shared_variable ++; }
You could avoid using locks altogether by declaring shared_variable to be a Cilk Plus reducer. If you look at the 3_nqueens_cilk project, you will see how to do this.
Intel Parallel Advisor is a unique tool that helps you add parallelism to your programs. This chapter has demonstrated how to use Advisor effectively:
You should now understand the value of parallel modeling: