OpenCL Global Memory
Issues related to memory in terms of temporal and spatial locality were discussed in
Chapter 3. Obtaining peak performance from an OpenCL program depends heavily on utilizing memory efficiently. Unfortunately, efficient memory access is highly dependent on the particular device on which the OpenCL program is running. Access patterns that may be efficient on the GPU may be inefficient when run on a CPU. Even when we move an OpenCL program to GPUs from different manufacturers, we can see substantial differences. However, there are common practices that will produce code that performs well across multiple devices.
In all cases, a useful way to start analyzing memory performance is to judge what level of throughput a kernel is achieving. A simple way to do this is to calculate the memory bandwidth of the kernel:
EB is the effective bandwidth;
Br is the number of bytes read from global memory;
Bw is the number of bytes written to global memory; and
T is the time required to run the kernel.
T can be measured using profiling tools such as the AMD Stream Profiler (which is discussed in
Chapter 11).
Br and
Bw can often be calculated by multiplying the number of bytes each work item reads or writes by the global number of work items. Of course, in some cases, this number must be estimated because we may branch in a data-dependent manner around reads and writes.
Once we know the bandwidth measurement, we can compare it with the peak bandwidth of the execution device and determine how far away we are from peak performance: The closer to peak, the more efficiently we are using the memory system. If our numbers are far from peak, then we can consider restructuring the memory access pattern to improve utilization.
Spatial locality is an important consideration for OpenCL memory access. Most architectures on which OpenCL runs are vector based at some level (whether SSE-like vector instructions or lanewise programmed hardware-vectorized such as AMD IL or NVIDIA PTX), and their memory systems benefit from issuing accesses together across this vector. In addition, localized accesses offer caching benefits.
On most modern CPUs, there is a vector instruction set; the various versions of SSE and the AVX are good examples. For efficient memory access, we want to design code such that full, aligned, vector reads are possible using these instruction sets. Given the small vector size, the most efficient way to perform such vector reads is to give the compiler as much information as possible by using vector data types such as
float4. Such accesses make good use of cache lines, moving data between the cache
and registers as efficiently as possible. However, on these CPUs, caching helps cover some of the performance loss from performing smaller, unaligned, or more randomly addressed reads.
Figures 6.10 and
6.11 provide a simple example of the difference between a single contiguous read and a set of four random reads. Not only do the narrower reads hit multiple cache lines (creating more cache misses if they do not hit in the cache) but they also cause less efficient transfers to be passed through the memory system.
GPU memory architectures differ significantly from CPU memory architectures, as discussed in
Chapter 3. GPUs use multithreading to cover some level of memory latency and are biased in favor of ALU capability rather than caching and sophisticated out-of-order logic. Given the large amounts of compute resources available on typical GPUs, it becomes increasingly important to provide high bandwidth to the memory system if we do not want to starve the GPU. Many modern GPU architectures, particularly high-performance desktop versions such as the latest AMD Radeon and NVIDIA GeForce designs, utilize a wide-SIMD architecture. Imagine the loss of efficiency in
Figure 6.11 scaled to a 64-wide hardware vector, as we see in the AMD Radeon HD6970 architecture.
Efficient access patterns differ even among these architectures. For an x86 CPU with SSE, we would want to use 128-bit float4 accesses, and we would want as many accesses as possible to fall within cache lines to reduce the number of cache misses.
For the AMD Radeon HD6970 GPU architecture, 16 consecutive work items will issue a memory request at the same time. These requests will be delayed in the memory system if they cannot be efficiently serviced. For peak efficiency, each of these 16 should issue a 128-bit read such that the reads form a contiguous 256-byte memory region so that the memory system can create a single large memory request. To achieve reasonable portability across different architectures, a good general solution is to compact the memory accesses as effectively as possible, allowing the wide vector machines (AMD and NVIDIA GPUs) and the narrow vector machines (x86 CPUs) to both use the memory system as efficiently as possible. To achieve this, we should access memory across a whole workgroup starting with a base address aligned to workgroupSize * loadSize, where loadSize is the load issued per work item, and which should be reasonably sized—preferably 128 bits on x86 CPUs and AMD GPU architectures and expanding to 256 bits on AVX-supported architectures.
Further complications arise when dealing with the specifics of different memory systems, such as reducing conflicts on the off-chip links to DRAM. For example, let us consider the way in which the AMD Radeon HD6970 architecture allocates its addresses.
Figure 6.12 shows that the low 8 bits of the address are used to select the byte within the memory bank; this gives us the cache line and sub-cache line read
locality. If we try to read a column of data from a two-dimensional array, we already know that we are inefficiently using the on-chip buses. It also means that we want multiple groups running on the device simultaneously to access different memory channels and banks. Each memory channel is an on-chip memory controller corresponding to a link to an off-chip memory (
Figure 6.13). We want accesses across the device to be spread across as many banks and channels in the memory system as possible, maximizing concurrent data access. However, a vector memory access from a single wavefront that hits multiple memory channels (or banks) occupies those channels, blocking access from other wavefronts and reducing overall memory throughput. Optimally, we want a given wavefront to be contained with a given channel and bank, allowing multiple wavefronts to access multiple channels in parallel. This will allow data to stream in and out of memory efficiently.
To avoid using multiple channels, a single wavefront should access addresses from within a 64-word region, which is achievable if all work items read 32 bits from consecutive addresses. The worst possible situation is if each work item in multiple wavefronts reads an address with the same value above bit 8: Each one hits the same channel and bank, and accesses are serialized, achieving a small fraction of peak bandwidth. More details on this subject for AMD architectures can be found in AMD's OpenCL programming guide (
Advanced Micro Devices, Incorporated, 2011). Similar information is provided to cover the differences in competing architectures from the respective vendors—for example, NVIDIA's CUDA programming guide (
NVIDIA Corporation, 2010).
Local Memory as a Software-Managed Cache
Most OpenCL-supporting devices have some form of cache support. Due to their graphics-oriented designs, many GPUs have read-only data caches that enable some amount of spatial reuse of data.
The easiest way to guarantee this use on the widest range of devices is to use Images (discussed in
Chapter 5). Images map data sets to the texture read hardware and, assuming that complicated filtering and two-dimensional access modes are not needed, improve memory efficiency on the GPU. However, GPU caches are small compared with the number of active thread contexts reading data. Programmer-controlled scratchpad memory in the local address space is an efficient approach for caching read data with less overhead from wasted space than hardware-controlled caches, better power efficiency, and higher performance for a given area. It is also
useful as a way to exchange data with other work items in the same workgroup with a very low and, more important, guaranteed access latency.
The code in
Figure 5.11 shows a simple example of this approach. The code loads a range of data from A into C and then accesses multiple values from it, avoiding a second read from DRAM. At the same time, the code loads a single value from
aLocalArray just once and reuses it across all work items in the group, thereby considerably reducing the memory traffic. In some CPU hardware implementations, some of this reuse occurs in the cache, whereas sometimes all of it occurs in the cache (especially on x86 processors). Given that we have knowledge of the underlying memory access patterns, we can control how much reuse of data is present in the application.
Of course, there are trade-offs when considering how best to optimize data locality. In some cases, the overhead of the extra copy instructions required to move data into local memory and then back out into the ALU (possibly via registers) will sometimes be less efficient than simply reusing the data out of cache. Moving data into local memory is most useful when there are large numbers of reads and writes reusing the same locations, when the lifetime of a write is very long with a vast number of reads using it, or when manual cache blocking offers a way to correct for conflict misses that can often be problematic in two-dimensional data access patterns.
In the case of read/write operations, the benefit of local memory becomes even more obvious, particularly given the wide range of architectures with read-only caches. Consider, for example, the following relatively naive version of a prefix sum code:
void localPrefixSum(
__local unsigned *prefixSums,
// Run through levels of tree halving sizes of the element set
// performing reduction phase
for( int level = numElements/2; level > 0; level /= 2 ) {
barrier(CLK_LOCAL_MEM_FENCE);
for( int sumElement = get_local_id(0);
sumElement += get_local_size(0) ) {
int ai = offset*(2*sumElement+1)-1;
int bi = offset*(2*sumElement+2)-1;
prefixSums[bi] = prefixSums[ai] + prefixSums[bi];
barrier(CLK_LOCAL_MEM_FENCE);
// Need to clear the last element
if( get_local_id(0) == 0 ) {
prefixSums[ numElements-1 ] = 0;
// Push values back down the tree
for( int level = 1; level < numElements; level *= 2 ) {
barrier(CLK_LOCAL_MEM_FENCE);
for( int sumElement = get_local_id(0);
sumElement += get_local_size(0) ) {
int ai = offset*(2*sumElement+1)-1;
int bi = offset*(2*sumElement+2)-1;
unsigned temporary = prefixSums[ai];
prefixSums[ai] = prefixSums[bi];
prefixSums[bi] = temporary + prefixSums[bi];
Although the previous code is not optimal for many architectures, it does effectively share data between work items using a local array. The data flow of the first loop
(level = numElements>>1 to 0) is shown in
Figure 6.14. Note that each iteration of the loop updates a range of values that a different work item will need to use on the next iteration. Note also that the number of work items collaborating on the calculation decreases on each iteration. The inner loop masks excess work items off to avoid diverging execution across the barrier. To accommodate such behavior, we insert barrier operations to ensure synchronization between the work items and so that we can guarantee that the data will be ready for the execution of the next iteration.
The prefix sum code discussed previously uses local memory in a manner that is inefficient on most wide SIMD architectures, such as high-end GPUs. As mentioned
in the discussion on global memory, memory systems tend to be banked to allow a large number of access ports without requiring multiple ports at every memory location. As a result, scratchpad memory hardware (and caches, similarly) tends to be built such that each bank can perform multiple reads or concurrent reads and writes (or some other multiaccess configuration), whereas multiple reads will be spread over multiple banks. This is an important consideration when we are using wide SIMD hardware to access memory. An AMD GPU can issue four instructions, including up to two load instructions simultaneously through the VLIW unit in each of the 16 SIMD lanes on a given clock cycle. As a result, it performs as many as 32 reads or writes from the local memory buffer (LDS) on a single cycle when local memory instructions are issued in a VLIW packet. If each bank supports a single access port, then we can only achieve this throughput if all accesses target different memory banks because each bank can only provide one value. On this architecture, because each work item can perform two reads or writes per cycle, the local data share is constructed of 32 banks. Similar rules arise on competing architectures; NVIDIA's Fermi architecture, for example, also has a 32-banked local memory.
The problem for local memory is not as acute as that for global memory. In global memory, we saw that widely spread accesses would incur latency because they might cause multiple cache line misses. In local memory, at least on architectures with true scratchpads, the programmer knows when the data is present because he or she put it there manually. The only requirement is that the 16 accesses we issue as part of that read or write instruction hit different banks.
Figure 6.15 shows a simplification for comparison—step 2 of the prefix sum in
Figure 6.14 accessing a local memory with eight memory banks, where each work item can perform a single local memory operation per cycle. In this case, our local memory buffer can return up to eight values per cycle from memory. What result do we obtain when performing the set of accesses necessary for step 1 of the prefix sum?
Note that our 16-element local memory (necessary for the prefix sum) is spread over two rows. Each column is a bank, and each row is an address within a bank. Assuming (as is common in many architectures) that each bank is 32 bits wide, our memory address would break down as shown at the top of
Figure 6.15. Two consecutive memory words will reside in separate banks. As with global memory, a SIMD vector that accesses consecutive addresses along its length will efficiently access the local memory banks without contention. In
Figure 6.15, however, we see a different behavior. Given the second access to local memory, the read from
prefixSums[bi] in
prefixSums[bi] = prefixSums[ai] + prefixSums[bi];
tries to read values from locations 3, 7, 11, and 15. As shown in
Figure 6.15, 3 and 11 both sit in bank 3; 7 and 15 both sit in bank 7. There is no possible way to read two rows from the same bank simultaneously, so these accesses will be serialized on GPUs by the hardware, incurring a read delay. For good performance, we might wish to restructure our code to avoid this conflict. One useful technique is to add
padding to the addresses, and an example of this is shown in
Figure 6.16. By shifting
addresses after the first set (aligning to banks), we can change evenly strided accesses to avoid conflicts. Unfortunately, this adds address computation overhead, which can be more severe than the bank conflict overhead; hence, this trade-off is an example of architecture-specific tuning.
Local memory should be carefully rationed. Any device that uses a real scratchpad region that is not hardware managed will have a limited amount of local memory. In the case of the AMD Radeon HD6970 GPU, this space is 32 kB, following OpenCL minimum requirements. It is important to note that this 32 kB is shared between all workgroups executing simultaneously on the core. Also, because the GPU is a latency hiding throughput device that utilizes multithreading on each core, the more workgroups that can fit, the better the hardware utilization is likely to be. If each workgroup uses 16 kB, then only two can fit on the core. If these workgroups contain a small number of wavefronts (one or two), then there will only barely be enough hardware threads to cover latency. Therefore, local memory allocation will be needed to balance efficiency gains from sharing and efficiency losses from reducing the number of hardware threads to one or two on a multithreaded device.
The OpenCL API includes calls to query the amount of local memory the device uses, and this can be used to parameterize kernels before the programmer compiles or dispatches them. The first call in the following code queries the type of the local memory so that it is possible to determine if it is dedicated or in global memory (which may or may not be cached; this can also be queried), and the second call returns the size of each local memory buffer:
cl_device_local_mem_type type;
err = clGetDeviceInfo(
CL_DEVICE_LOCAL_MEM_TYPE,
sizeof(cl_device_local_mem_type),
err = clGetDeviceInfo(
CL_DEVICE_LOCAL_MEM_SIZE,