4

–––––––––––––––––––––––

Multicore and Many-Core Computing

Ioannis E. Venetis

Multicore and future many-core processors (which are generally considered to be processors that include over 100 cores) have brought a great deal of changes in the way that programmers and users view computer systems. Concepts previously known to a relatively small and specialized community, like parallelism and synchronization, have now forcefully entered the everyday life of many more people. A common question is why we moved to this new technology. Uniprocessor systems have served us well for the last few decades. A vast number of tools have been developed over the years, which made programming and use of these systems easy and intuitive. Improvements in execution speed have made it possible to implement and use applications that were previously impossible to use. In this chapter, we will try to understand why this well-established development model of processors and applications had to be changed. We will analyze the main obstacles that appeared in the development of uniprocessor systems and how they led to the development of multicore systems. Furthermore, we will discuss current trends in the architecture of multicore systems and the obstacles that have to be overcome and will lead to many-core systems. Finally, we will cover the impact that changes made in the hardware have in the development of software and how this leads to the introduction of programming models that are tailored toward multicore systems.

4.1   INTRODUCTION

The first ever microprocessor was introduced from Intel in 1971. The 4004 was a simple 4-bit, in-order microprocessor clocked at 740 kHz, primarily intended for calculators and for other calculator-like products like cash registers and banking teller equipment. Its 2300 transistors seem nowadays like a very small number, but at that time, it was a huge achievement that allowed a great innovation. For the first time, a complete CPU had been put on a single chip. About a year later, Intel introduced the 8008 microprocessor, which included 3500 transistors, and some implementations were clocked at 800 kHz. The additional transistors were used to extend the microprocessor to an 8-bit design. Continuing improvements in technology allowed Intel to introduce the 8080 in 1974, another 8-bit microprocessor that consisted of 6000 transistors and was clocked up to 2 MHz. In 1977, it introduced the 8085 with 6500 transistors and a clock rate of up to 6 MHz. The first 16-bit microprocessor from Intel was introduced in 1978. The 8086 included 29,000 transistors and was running at a clock speed of up to 10 MHz. This processor was the basis for the well-known series of “x86” processors from Intel. The list continues with many more processors, up to the latest single-core processor, the Core 2 Solo SU3500, with 410 million transistors running at 1.4 GHz.

Other lines of microprocessors followed a similar trend. The Alpha developed by Digital Equipment Corporation (DEC) started with the EV4 in 1992. It contained 1.68 million transistors and was initially clocked at 100 MHz. The last processor of this line was produced in 2004 containing 130 million transistors and was running at 1.3 GHz. Similarly, the first processor of the MIPS architecture was introduced in 1985. The R2000 contained 110,000 transistors and was clocked at 8 MHz. The R16000 introduced in 2002 was the last single-core processor of the series. It contained 7.2 million transistors and was clocked up to 1 GHz.

From the above, it is obvious that the main improvement that led to faster uniprocessor systems over the past decades was the increase in the number of transistors that could be integrated into a single chip and the increase in clock speed. Software development companies started to rely on this fact. They were continuously identifying application areas for which older hardware was not powerful enough, but current hardware would achieve the lowest acceptable limits of execution time. Hence, more demanding applications pushed the existing hardware components to their limits. Games, financial applications, databases, and simulations are a few examples of such applications. Computer architects tried to respond to the increasing demands of applications and improved their designs, building on top of the advancements in integrated circuit technology. Naturally, software developers improved or added more functionality to their existing applications but also further expanded their application areas to take advantage of the improved performance of the new systems. This led to a cycle of hardware and software improvements that served the computing industry well for decades.

Alongside with the above developments, new fields in computer science emerged. Data structures to efficiently represent data that are required in different situations and improved algorithms that perform fewer operations to complete the same calculations are typical examples. In addition, improvements in tools to create applications were also impressive. Compilers, integrated development environments (IDEs), programming languages, debuggers, and profilers helped programmers to create software faster, with fewer errors, and with better user interfaces. Naturally, all these tools targeted serial computing systems since this was the dominating architecture. Although parallel systems were available, they were targeted toward specific applications with extremely large computations. Hence, their development and improvements in ease of programming were quite slower. As we will see, however, this has now started to change.

Ironically, the two main pillars on which the hardware‒software cycle rested for so many years, the increase of the number of transistors in a chip and the increase in clock speed, are the reasons for which this model seems currently to fade out, giving its place to multicore architectures. This has profound implications at many levels of the hardware and software industry. On one hand, a larger area of the hardware design space is currently being explored in order to pinpoint the best choices for the emerging multicore architectures. On the other hand, programming these new systems is not as straightforward as uniprocessor systems, and programmers have to familiarize themselves with new concepts. Despite these difficulties, it seems that currently, this is a one-way path that we have to walk in order to continue to improve our computing systems.

4.1.1   Implications of Moore's Law

In 1965, Intel cofounder Gordon E. Moore described a trend about the number of transistors that can be placed on an integrated circuit [1]. The observation, now known as Moore's law, states that the number of transistors that can fit on a given area of a chip doubles approximately every 2 years. The law has since been used in the semiconductor industry to guide long-term planning and to set targets for research and development. Figure 4.1 shows the number of transistors for different architectures throughout time.

In the early stages of integrated circuits and single chip processors, Moore's law has been an invaluable tool. It was now possible to predict the number of transistors that a future implementation of a processor could include and to start designing it far ahead, including features that would be possible to implement with the predicted numbers. Over the years, there have been many interesting ideas that could increase the performance of a processor but could not be implemented due to the limited number of available transistors. Moore's law and the advancements in technology behind it allowed these ideas to evolve and to be included one after the other in processors.

Figure 4.1

FIGURE 4.1. Number of transistors over time for four different architectures. The graph illustrates that the number of transistors increases more or less exponentially over time. The rate of increase is not dependent on the specifi c architecture of the processor.

An example of such an advancement was the use of a cache memory hierarchy, which dates back to the early 1960s. Due to scarcity and the high cost of semiconductor memories, computers of the time used several types of memory to fetch data and instructions into the fastest available memory type, before the processor actually needed the data. These techniques were further developed, and in 1968, the first documented use of a data cache appeared in the IBM System/360 Model 85 [2]. Later on, the first level of cache was integrated into the same chip with the processor, and soon, the second level followed. Moore's law allowed the expansion of caches into larger and larger sizes. This improved the execution speed of applications since it was possible for them to store more of their frequently used data in a fast memory. Recently, however, the size of the cache has not increased as rapidly as in the past. The reason for this change was the realization that every application has a maximum working set, that is, data that have to be accessed over a period of time in order to complete a calculation [3]. If all data of the working set reside in the cache, then there is no meaning to add larger caches to a system since applications will not be able to exploit it. As a result, even though larger caches could beintegrated into current processor architectures due to increasing numbers of available transistors, the benefit would be minimal.

A second example in this category is the exploitation of instruction-level parallelism (ILP). A multitude of techniques have been developed to execute multiple instructions of a program at the same time. The first and probably best known is the introduction of the pipeline in processors. Instructions are divided into smaller basic operations. Each operation is handled at one level of the pipeline and the results are handed over to the next level. When this happens, the previous level can start executing the corresponding operation of the next instruction in the program. As a result, multiple instructions are executing concurrently, although they are at different stages of their execution. Naturally, the introduction of the pipeline required the addition of more logic into the chip, which was possible by the increased number of transistors. Some processors reached up to more than 20 stages in their pipelines. Obviously handling the logic of such large pipelines requires a large amount of transistors. Eventually, however, it became clear that increasing the size of the pipeline did not contribute to better performance of the processor. In the same way that caches reached their point of diminishing returns, pipelines also eventually reached it.

In the same category of exposing more ILP fits also branch prediction. The idea behind branch prediction is to record the history of branches and to determine when they are taken and when not. Using this statistical analysis, processors try to predict whether the next branch will be taken or not. In order to keep the history of already executed branches, branch-prediction buffers are required. A considerable amount of bits has to be used for this purpose. Again, the increase in the number of transistors made this technique applicable. But again, adding more and more entries to the buffer had, after some point, no more effect in the performance of applications.

Another important idea that is worth mentioning is multiple issue of instructions. Instead of reading one after another instruction from memory and feeding it into the pipeline of the processor for execution, it is possible to fetch more instructions at once. If these instructions are of a different kind (e.g., an instruction that operates on integer numbers, another one that operates on floating-point numbers, and another one that accesses memory), then these could potentially be executed simultaneously. In order to achieve this, there have to be separate functional units that can execute all of these instructions at the same time. Moreover, since the mix of instructions might not always be the same as in the example, there should be more replicas of the same functional units to handle this case. Not surprisingly, all this adds again to the logic required for the processor to operate correctly and to output the results in the order that is expected by the program. As you might already have guessed, again there is a limit in this method. It is not by chance that most processors issue two, three, or four instructions at most at any point in time. There are a few exceptions, like the Intel Itanium processor that issues up to six instructions, but this is due to the fact that it belongs to another category of processors.

The list could continue with more techniques that improve performance of processors [4, 5]. However, the discussions about these techniques would all end with the same observation. Adding transistors to enhance each one of them leads to a point where more transistors do not improve performance anymore. But this observation creates a question that is essential for the computer industry. Moore's law still holds and more transistors will be available, at least for the near future. What should we do with these transistors, if all techniques that we have incorporated into our processors have reached their limit? I assume from the discussion up to this point that the answer is already in the mind of the reader. Since we have twice as many transistors every 2 years, why do we not simply put two identical processors on the same chip? This would double the computing power of a processor and would allow applications to run faster. And this is exactly the path that most, if not all, hardware vendors currently follow. And until some radical changes happen, it is foreseen that more and more cores will be added to processors.

4.1.2   Implications of Frequency Scaling

The use of a signal that synchronizes execution of operations across all functional units within a processor, the clock as it is commonly known, prevailed as a method to operate processors. Synchronizing all functional units makes their design much easier and also simplifies the interaction among them. Now every unit can propagate its results to the next unit only at specific points in time. However, since all units are bound through the clock in order to propagate their results, this has to be done at the speed of the slowest unit. As a result, the slowest unit determines the clock speed, which in order determines the overall speed of a processor. It comes as no surprise that a significant effort has been invested in increasing the speed of the clock. However, this is not as straightforward as it sounds. First, the clock signal requires time to travel from one point to another and requires more time to get to the farthest parts of the chip. Although the differences in time of the signal arriving at different parts of the chip are small, they have important effects. Second, after the signal arrives, transistors require some time to change their state. In the meantime no further signal from the clock should arrive. Despite these problems, improvements were steadily made and the clock frequency increased orders of magnitude compared to the first processors.

Recently, the increase in clock speed has been substantially slower on each new generation of processors. Some could think that we have hit some physical limit and that it is not possible anymore to go faster. This is not the case, however. Several companies assure us that they could easily build processors clocked at 10 GHz. But why are they not producing them? The answer lies in the way that transistors operate. Although simplified, the main idea is as follows. Whenever a transistor has to change its state, some power has to be consumed. If we start increasing the rate at which these changes happen, that is, we increase the clock rate, then we must do it more forcefully in order to make electrons move faster, that is, we need higher voltage. The power required to change the state of a transistor depends on the square of the voltage and the frequency. As a result, power consumption increases drastically as we move toward higher frequencies, making processors inefficient power-wise. Moreover, the power consumed is transformed into heat, which must be dissipated to the environment; otherwise processors would melt from their own temperature. Hence, cooling systems are required to be attached to processors, which also require power to operate.

Let us have a look at the Teraflops Research Chip, a research prototype multicore processor from Intel. It contains 80 cores and is actually very power efficient. However, it is interesting to see how power consumption changes when it runs at different frequencies. At 3.16 GHz, the processor requires 0.95 V to operate, consumes 62 W, and achieves a performance of 1.01 Tflops. At 5.7 GHz it requires 1.35 V, consumes 265 W and achieves 1.81 Tflops. As we can see, a 1.8 times increase in the clock speed results in a more than four times increase in power consumption. Let us now assume that we are given the low-frequency version of the processor and we are asked to propose a method to double its performance. One way would be to make a few more improvements and actually double the frequency at which the processor operates. From the teraflops achieved at each frequency, we would expect that this would also double the performance. Although this seems to hold in our case, be aware that this is not a general conclusion as many factors hinder doubling the performance when doubling the clock speed. As we have seen, this would now more than quadruple power consumption. If we were asked to double the performance again and would double the clock speed, this would lead to an exponential growth in power consumption over generations of processors of this architecture. This would make our processors extremely unattractive. But is there another way to double performance? Assume that, instead of doubling the clock speed, we take advantage of Moore's law and incorporate double the cores into the same chip. Now, instead of quadrupling power consumption, we would only double it. As a result, we would have less severe problems cooling our system. However, doubling the number of cores requires now applications to be written in a way that exploits all available cores and doubles their performance. The difficulty has now moved to programmers of applications.

4.2   ARCHITECTURAL OPTIONS FOR MULTICORE SYSTEMS

Multicore systems are a relatively new technology. As with every new technology, it is not certain what the best options and design decisions are to create the fastest and most easily programmable processor of this kind. As a result, researchers and development companies are proposing and exploring a number of different options for several architectural aspects of multicore processors. In this section, we will discuss the architectural features of processors that are affected most in this aspect. But first, we will try to clarify a common misunderstanding, which is the difference between a multicore processor and a processor that supports simultaneous multi-threading (SMT). The misunderstanding stems from the fact that both technologies advertise that they support execution of multiple threads simultaneously. As we will see, not only are these two technologies not the same but they also differ in a fundamental way.

4.2.1   SMT and Multicore

As we have seen earlier, issuing multiple instructions and dynamically scheduling them for execution is currently a typical way to improve performance on a uniprocessor system. In this case, however, the exploitation of multiple functional units that are possibly available in the processor is subject to the dependencies among the instructions that have been issued. Furthermore, a cache miss might completely stall the execution of the application until the required data or instruction has been fetched from the main memory. The left part of Figure 4.2 illustrates the exploitation of resources for such a case. As can be seen, several issue slots are unused during each clock cycle.

In order to overcome the underutilization of the available functional units of a single processor, SMT [6] has been introduced. Under this scheme, the processor is able to issue instructions from more applications at the same time. This reduces the number of unused issue slots and improves the throughput of the processor. The important point is that the instructions issued from different applications share in this case the functional units of a single processor. Obviously, there is no meaning in issuing instructions from an arbitrary number of applications since there is a limit in the number of functional units that are available.

figure 4.2

FIGURE 4.2. Issuing instructions in a typical superscalar architecture and an architecture that supports simultaneous multithreading. The vertical dimension represents a sequence of clock cycles. The horizontal dimension represents the instruction issue capability in each clock cycle. A white box indicates that the corresponding issue slot is unused in a clock cycle. The shades of gray correspond to four different threads in the simultaneous multithreading processor.

In contrast, in a multicore processor, each core is in itself a complete processor and does not share its functional units with the rest of the cores. As a result, each core might be issuing and executing instructions from a different application without implications from other cores (with some exceptions as we will see next). Obviously, it is possible to apply SMT on each core of a multicore system, which is actually what most vendors of multicore processors do. The reader might be familiar with the Hyper-Threading Technology of Intel processors, which is an example of an implementation of SMT. We will see more examples later on in this chapter.

4.2.2   Shared Cache or Not?

Typically, a single-core processor contains two levels of cache memory, which are both smaller in size but faster to access than main memory. The first level of the cache, known as level 1 or L1 cache, is small but extremely fast. The second level, known as level 2 or L2 cache, is larger in size but slower than L1. The operation of the cache hierarchy is quite straightforward. Data required to perform calculations are first looked up in the L1 cache. If it is not found in L1, then it is looked up in L2. If it is not found there either, then the data are fetched from the main memory. In the last case, data are also transferred into the cache memory, so that they can be accessed faster next time they are required. Depending on the design of the cache, data residing in adjacent memory locations will also be transferred into the cache. This is done in order to exploit locality of references, that is, a property of most programs that, after accessing a specific location in memory, then it is highly possible that they will soon require data that reside in adjacent memory locations. A typical example is the calculation of values in an array. Using a loop, programs calculate elements of the array one after the other, which means that these programs access continuous locations in memory. Having next elements of the array in a level of the cache allows faster access and hence fewer stalls in the processor.

The advent of multicore processors generated an important question with respect to cache memory. Should each core have its own hierarchy of the cache, or should cores share some or all levels of the cache? A more radical question is whether there should be any cache at all. We will discuss this option in more detail in Section 4.5. The answers to these questions are not straightforward since a number of parameters have to be taken into account. Assume, for example, that an application has been parallelized and is executing alone on a multicore system using all available cores. Some could argue that if each core has its own L1 and L2 data cache, then the portion of the application that is running on a core will exploit optimally its cache and achieve the highest possible performance since it will not interfere with other cores and no data it still needs will accidentally be evicted from the cache by another core. On the other hand, if some level of the cache is shared, data (and its adjacent locations) brought into this level by one core could potentially be used by another core that needs that data. Furthermore, the shared level of the cache is now larger in size since it is the combination of the caches of all cores. As a result, if a core requires at some point in time a small amount of cache, the other cores can use a larger portion and can achieve better performance. This could also be the case when multiple, diverse applications are executing on different cores. It is highly possible that their requirements differ significantly and sharing a level of the cache could benefit some of them. But again, interference among such applications could overall hurt the performance of most of them.

Obviously, there is no easy way to determine whether and which levels of cache should be local to a core and which ones should be shared. It depends largely on the applications that will run on the system. Therefore, different processor architectures provide different solutions, starting from the number of levels in the cache, the size of each level, and whether each level is shared or not among cores. It would be ideal to know which data in the cache correspond to which application. In this case, a level of the cache that is shared at the hardware could be handled dynamically by changing the size that is assigned to each core, depending on the current needs of each application. Some proposals for implementing this idea in hardware already exist [710]. However, no architecture implements these proposals yet. Therefore, other proposals that handle the situation at the software level have also been proposed [11]. As we will examine contemporary multicore architectures later in this chapter, we will highlight the different parameters of the cache hierarchy.

4.2.3   Identical Cores or Not?

Applications from one area of science obviously do not behave in the same manner as applications from another area. For example, image processing applications heavily depend on specific algorithms like fast Fourier transform (FFT), whereas simulations of physical processes rely on algorithms that have completely different requirements. Even within a given application some computations might have different requirements than others. General-purpose processors try to strike a balance among all these different requirements and to execute all types of computations with an acceptable performance. Nowadays, special-purpose hardware accelerators are available for a large range of domains. Their purpose is to execute faster specific computations and thus to reduce the overall execution time of an application. Digital signal processing (DSP) units are designed to perform frequently used algorithms like FFT faster than general-purpose processors. Physics simulation engines implemented in hardware are also available. Such accelerators are usually connected to the processor through the supported interconnection network of the system. If a program needs to calculate one of the functions that the accelerator supports, it off-loads the calculation to it. When the latter finishes, the result is returned to the processor. A disadvantage of this method is that applications have to be written so that they explicitly use the accelerator.

Since the number of cores in a processor is anticipated to increase drastically in the next few years, the idea to incorporate different types of special-purpose accelerators into a single chip has emerged. Instead of integrating an ever-increasing number of the same core on a chip, it would be possible to use some of the available space for a few hardware accelerators. The number of general-purpose cores in the chip would be still large enough to handle the majority of applications, but for the applications that require it, the appropriate accelerator would be available. DSP units, physics emulation engines, cryptography hardware, and others could be included in a small percentage of the area of a chip. Even the graphics unit could be integrated into the same chip. Currently, exchanging data to process it on the graphics unit requires a good fraction of the total execution time of the calculation since the data have to be sent over the interconnection network to the graphics unit and back. Integrating the graphics unit into the same chip with all other cores could eliminate this time and provide faster graphics.

4.2.4   On-Chip Interconnection Networks

The role of the interconnection network is to connect the functional units of a parallel system and allow them to communicate and exchange information. For example, all cores of a processor must be able to communicate with the main memory to store and retrieve data. The architecture of the interconnection network has always been a significant decision in the design of a parallel system. Several factors play an important role and define its properties. For example, network latency is defined as the time that elapses from the moment the sender initiates a communication until the moment the receiver receives the first bit of information. Obviously, low latency is a desired property of a network. Bandwidth refers to the rate that data can be sent over the network, which makes a high value of this property very important for a network. Some other factors that are taken into account are the behavior under contention, scalability, speed of elements that are connected, and, of course, the cost to implement the network.

With respect to multicore architectures, three types of connections seem to concentrate a lot of attention: how a core is connected to the cache or its local storage (see Section 4.5), how a core is connected to the main memory, and how a core is connected with other cores. A crossbar switch is a possible solution to connect cores to the cache. The Oracle SPARC T3 processor uses such a network for this purpose. The Sony/IBM/Toshiba Cell BE processor (see Section 4.3.5) uses a more specialized network. A ring, the element interconnect bus (EIB), connects all cores. Through this ring, data can be sent from one core to another. In order to connect cores to the main memory, typically a memory controller is provided inside the chip, which is responsible to receive requests for communication with the memory and serve them. The controller can be connected with the memory modules in several ways. The Intel QuickPath Interconnect (QPI) provides point-to-point links, which are also used to connect the internal cores, the I/O hub, and other processors in the system. The Oracle SPARC T3 uses serial links from the memory controller to the memory modules. In Cell, requests to the main memory have to move through the EIB to the memory interface controller (MIC) and from there they are forwarded to the memory. There is still a great deal of dispersion in the field of interconnection networks for multicore systems and designs that concentrate on a specific architecture. As previously mentioned, this is a phenomenon to be expected in technologies that have not completely matured yet.

4.3   MULTICORE ARCHITECTURE EXAMPLES

In the following paragraphs, we will mention a few contemporary multicore processors and their main characteristics. It is worth to mention that, with the exception of the Sony/IBM/Toshiba Cell BE processor, the cores contained in these processors are based on typical single-core processors that are simply put together on a chip. The main difference is that now cores share some level of the cache hierarchy. As we will see later (Section 4.5), a new trend is to use simpler and smaller cores, allowing for a larger number of them to be integrated on each chip. However, such changes are accompanied with more architectural changes that have other implications.

4.3.1   The Intel Core i7 Processor

The Core i7 processor from Intel is the first processor we will examine. It is composed of four cores, each one supporting two-way SMT or Hyper-Threading as it is called by Intel. As a result, a total of eight threads can execute concurrently on this processor. Many of the design decisions for this architecture were driven by the desire to reduce power consumption and to retain more or less the same amount of performance. A notable example is the dismissal of the 2-bit branch predictor used in previous Intel processors with a simpler 1-bit predictor. Halving the number of bits used in this functional unit also halved its power consumption. Of course, the use of SMT itself is also a factor that reduces power consumption since functional units of the processor are better exploited. This is just one example of how significant power consumption has become in modern processors.

The Core i7 includes a three-level cache hierarchy. Each core of the processor has a local L1 cache of 32 kB and a local L2 cache of 256 kB. The L3 cache is shared among cores and is 8 MB in size. Each core can fetch up to four instructions per clock cycle. In order to interconnect all functional units of the processor, the crossbar switch used in previous architectures has been replaced with a ring bus. The bus provides faster transfer rates but also reduced power consumption.

Most desktop systems equipped with multicore processors from Intel take advantage of the additional cores by running multiple, different applications on separate cores. Only a small fraction of applications used every day is parallelized to take advantage of the multiple cores and SMT. As more cores will be integrated into chips, this is expected to change. A typical user does not execute concurrently a sufficient number of applications to use the number of cores that will be available in the future. Applications will have to be parallelized in order to take full advantage of future processors.

4.3.2   The AMD Phenom II Processor

The AMD Phenom II processor consists of six cores. In contrast to most other architectures, it does not include support for SMT. As the Intel Core i7, it includes a three-level cache memory hierarchy. Each core of the processor has a local L1 cache of 64 kB and a local L2 cache of 512 kB. The larger L1 and L2 caches provide more space to each core to keep its working set and to experience less stalls. This enables functional units to get used more efficiently and compensates for the absence of SMT. The L3 cache is shared among cores and is 6 MB in size. Another example of the importance of power consumption in modern processors is the fact that this processor includes three distinct technologies to consume less power. The PowerNow technology allows for enhanced power management features, which automatically adjusts performance states and features based on processor performance requirements. CoolCore technology reduces processor energy consumption by turning off unused parts of the processor. Finally, Dual Dynamic Power Management enables more granular power management capabilities to reduce processor energy consumption.

4.3.3   The IBM POWER7 Processor

The IBM POWER7 processor contains eight cores, each one of them supporting four-way SMT. This leads to a total of 32 threads that can be concurrently active on this processor. Performance of each core is further improved through an aggressive and deeper out-of-order execution design, better branch prediction, and reduced latencies to various resources, such as the caches and translation look-aside buffers. During each cycle, up to eight instructions can be fetched from memory. There are a total of 12 execution units within each core: 2 fixed-point units, 2 load-store units, 4 double-precision floating-point unit pipelines, 1 vector, 1 branch execution unit, 1 condition register logic unit, and 1 decimal floating-point unit pipeline. This is an example where more functional units of the same type reside in each core, allowing multiple threads running on that core to execute instructions of the same type simultaneously.

The POWER7 has a three-level cache hierarchy, with the first two levels being local to each core. More specifically, the L1 cache is 32 kB and the L2 cache is 256 kB, both the same size as in the Intel i7. Finally, the shared L3 cache is 32 MB, significantly larger compared to all other architectures. The large size and the fact that the L3 cache is shared among cores have a great, positive impact on the performance of each thread. The L3 cache is divided into eight 4-MB local regions, which serve a dual role. Their primary operation is to act as a victim cache for its associated L2 cache and, secondarily, as a victim cache for the other seven 4-MB regions. A set of heuristics achieves a balance between these roles by monitoring and managing capacity and bandwidth allocation according to workload dynamics.

4.3.4   The Oracle SPARC T3 Processor

The Oracle SPARC T3 processor is composed of 16 cores. Each core supports eight-way SMT, which results in a total of 128 threads that can run concurrently on this processor. The T3 is an interesting example of a multicore architecture as it focuses strongly on exploiting thread-level parallelism (TLP) rather than ILP. Combining multiple cores with SMT allows this processor to significantly increase throughput. Only two instructions are issued at each cycle. This is the same as its immediate predecessor, the UltraSPARC T2, but the UltraSPARC T1, the first processor of the line, was a single-issue processor. The two instructions issued per cycle are less than the instructions issued by most contemporary multicore processors, like the Intel i7, which issues four instructions per cycle.

Each core consists of two relatively small pipelines, one 8-stage integer pipeline and a 12-stage floating-point pipeline. The T3 uses fine-grained multithreading, switching to a new thread on each clock cycle. As a result, the processor's execution pipeline remains active doing useful work, even as memory operations for stalled threads continue in parallel. Threads that are idle because they are waiting due to a pipeline delay or cache miss are bypassed during scheduling. Hence, a core is idle only when all eight threads are idle or stalled. Each core of the T3 has access to its local 16-kB instruction cache and its own 8-kB L1 data cache. All cores are connected through a crossbar switch to the 6 MB of shared L2 data cache, which is composed of 16 banks. A cache coherency protocol ensures that the latest value of data is found in the memory hierarchy. Interestingly, the T3 includes six cache coherency links that allow it to connect to three more processors on the same system without any additional logic and to retain the cache coherence protocol among all processors.

4.3.5   The Sony/IBM/Toshiba Cell BE Processor

The Cell BE processor was designed cooperatively by Sony, IBM, and Toshiba. In contrast to the previous processors, which simply integrated multiple copies of existing or slightly modified processors on a single chip, this processor is based on a radically different design. First, it is composed of two different types of cores. There is a single copy of the first type of core, which is known as the Power Processor Element (PPE). The PPE is based on the 64-bit PowerPC general-purpose processor. It supports two-way SMT and includes 32 kB of L1 cache and 512 kB of L2 cache. The PPE is used to execute code but, most importantly, to control and distribute work and data to the eight Synergistic Processing Elements (SPE s). An SPE is a self-contained vector processor clocked at 4 GHz, which can execute code independently. An SPE contains 128 registers, each one 128 bits wide. It also includes four single-precision floating-point units and four integer units. A distinct feature of SPEs is the fact that they do not include any cache memory. Instead, they are supported by a small 256-kB local store, which is a fast memory that is local to each SPE. This is the only memory that an SPE can directly access. In contrast to the cache, which is manipulated in hardware and is transparent to the programmer, the local store is handled by the application itself. The programmer has to decide when the PPE will send data to the local store. Furthermore, the code to be executed by an SPE must also be transferred to the local store. Although this makes programming more complicated (or unconventional, if you like), it allows simplification of the design of the memory subsystem and thus increases its performance. Another important difference is that the PPE and the SPEs are in-order processors without any out-of-order capabilities. This means that the optimizations performed by the compiler are extremely important to achieve good performance. Thankfully, the 128 registers of each SPE provide the compiler with the required space to apply optimizations like loop unrolling and overcome the need for out-of-order execution.

Although a quite different design, compared to the processors that most of us are familiar with, it seems that most future multicore and many-core architectures will incorporate the characteristics of the Cell BE processor. In particular, the use of local store instead of a hardware-managed cache seems to be an alternative that is examined for several future designs.

4.4   PROGRAMMING MULTICORE ARCHITECTURES

In this section, we will shortly analyze programming models that can be used to program applications for multicore systems. The first three are models that have been extensively used to program more traditional shared-memory and distributed-memory architectures. Naturally, they can be used to program multicore systems too. The rest of the programming models are more or less inspired from the advent of multicore systems and are designed to better exploit them. However, they also support other types of parallel architectures. It is possible to find more information about these models in the references that accompany them. The reader is also encouraged to look for other similar projects like CellSs, SMPSs, Fresh Breeze [12], and Concurrent Collections [13].

4.4.1   Message-Passing Interface (MPI)

MPI [14] is probably one of the most successful programming models used in parallel computing. It has been designed for distributed-memory architectures, which are composed of computing nodes connected through a network. A process is employed on every node to execute the part of the computation that corresponds to it. If a node consists of multiple processors or cores, more processes can be spawned to take advantage of the additional computing power. If data residing in one process are required from another process, a message has to be sent from the first process to the second one.

MPI can also be used on multicore systems, which are typically shared-memory systems, by spawning multiple processes that will run on different cores. In this case, performance might not be as high as using a programming model that targets shared-memory systems. Normally, MPI uses communication primitives of the network to exchange data, even when processes reside on the same node. Naturally, this introduces some overhead. Some implementations of MPI optimize this shortcoming by exchanging data through memory. Again, however, communication has to be performed between processes, which do not share the same address space. Overall, although it is possible to use MPI to program multicore systems, it is not something that is seen very frequently.

4.4.2   Threads

The most typical way to program shared-memory architectures, including current multicore systems, is to employ some threading library. Threads are used as execution vehicles that run on top of the available processors or cores of the system. Computation is partitioned among threads using a multitude of scheduling techniques, ranging from the simple static scheduling to more advanced techniques like dynamic creation of tasks and distribution to threads through a queueing system.

POSIX Threads [15] are the best-known representative of this class of programming models. Their behavior is completely described in an international standard and implementations exist for almost all available architectures. Many applications and benchmarks are implemented using POSIX Threads. However, there is a multitude of other available threading libraries. This diversity stems from the fact that these libraries focus more on other aspects, for example, supporting a specific higher-level programming model or a dataflow-like execution of threads.

In contrast to MPI, where synchronization among processes is implicit through the exchange of messages, special primitives of the used threading library have to be exploited to ensure correct synchronization among threads. Ensuring that data modified by multiple threads will always contain consistent values is typically a task of the programmer. Identifying that correct synchronization is achieved in all possible execution scenarios is already a difficult task, let alone when high-performance implementations of algorithms are exploited to reduce the overhead of synchronization. As the number of cores on processors will raise, the overhead of synchronization will become an even more important factor in the development of applications. Therefore, improvements of basic synchronization primitives have started being incorporated into hardware. Barriers [16], context switching, and faster implementations of mutual exclusion in hardware [6] are some examples.

An important distinction between threading libraries is whether they implement kernel-level or user-level threads. Kernel-level threads are primitives provided and scheduled by the operating system. For every thread being created in an application, the threading library creates a kernel-level thread and assigns to it the computations that have to be performed. This is also known as a 1-to-1 model. Execution of these threads is the responsibility of the operating system. An implementation using kernel-level threads is quite straightforward. However, this type of thread is quite heavyweight, as entering and returning from the kernel of the operating system is required for several operations of the library. In contrast, user-level threads are implemented completely in user space, which makes them very lightweight. The application creates several threads that run on top of the single kernel-level thread of the application, a model known as 1-to-N. This, however, has a significant drawback. Since the operating system is not aware of the user-level threads, it cannot schedule them to all available processors and cores. In order to overcome this limitation the M-to-N model is typically employed. This model is a combination of the previous two models. It is the most difficult to implement but provides the best way to exploit parallelism on multiprocessor systems. Typically, a number of kernel-level threads are created, which equal the number of processors that the application requests. The user-level runtime system then creates user-level threads, which are scheduled from the user-level scheduler on top of the kernel-level threads. This allows the creation of hundreds or even thousands of threads with a relatively low overhead. The question that follows is which of these models is best for multicore architectures. The answer is not obvious and the fact that more cores per processor are expected in the future makes the selection even more difficult. It seems, however, that the M-to-N model is the one preferred by most implementers of threading runtime systems.

Although all of the following programming models do not directly expose threads to the programmer, they use threads internally to represent and schedule parallel tasks to available processors and cores. Hence, features of the underlying threading library are important for the performance that can be achieved and how easily a programming model can be mapped to the available primitives of the threading library. As the number of cores available on processors is expected to increase significantly in the next few years, choices made at this level are certain to have a great impact.

4.4.3   Directive Based (Open MP)

It soon became obvious that manually creating threads and distributing work among them is a procedure that follows patterns in most applications, especially for loop-level parallelism. In order to simplify this procedure, directive-based programming models were introduced. In these models the programmer specifies through special directives where and how parallelism should be created. The compiler is then in charge of creating the parallelism and distributing work among threads according to the directives. Although a number of such models has been proposed, only few of them were successful. The main reason for this was the fact that most of them were specific to a system. As a result, source code written in one of these models could not be easily ported to another system. This increased the cost of developing a parallel application and made these models unattractive.

OpenMP [17] is a directive-based model that was supported by a large number of organizations and companies from its design stages. The large number of implementations on a diverse range of systems has made it quite successful. One of its main features, allowing parallelization of an application in stages, made it attractive for large codes where parallelization using other models is much more difficult. Since it targets shared-memory systems, its use in current multicore architectures is straightforward. It is also quite common to combine the use of OpenMP and MPI, when the architecture under consideration is composed of multiple nodes with distributed memory, but each node is a smaller shared-memory system. In such cases, OpenMP is used to parallelize code inside a node and MPI to distribute work among nodes.

4.4.4   Intel Array Building Blocks

Intel Array Building Blocks provide a data-parallel programming environment designed to effectively utilize the power of existing and upcoming throughput-oriented features on modern processor architectures. This includes current and future multicore and many-core systems. The programming environment provides a C++ template library that allows programmers to continue writing code using the standard C++ language with some extensions. The programming model was created by merging the data-parallel programming models proposed by RapidMind [18], a company acquired by Intel before the merge, and Ct [19], a model also proposed by Intel.

The programming model adds new types and operators to the C++ language through header files and a runtime library, using only standard C++ capabilities. This separation allows code using these features to coexist with C++ code and requires only specific portions of an application to be rewritten to utilize parallel features of the underlying hardware system. These new features give the developer an expressive and safer parallel programming environment by isolating its data objects from the rest of the application and by restricting conflicting concurrent accesses to shared objects through structured primitives. This eliminates the need for low-level constructs like locks and avoids data races.

4.4.5   Intel Threading Building Blocks

Intel Threading Building Blocks is another C++ template-based library. In contrast to Intel Array Building Blocks, it focuses on loop-level parallelism. Instead of creating explicitly threads to distribute work, the programmer defines tasks that are then scheduled and executed automatically on top of the threads that the library creates. The components of Intel Threading Building Blocks include generic parallel algorithms, concurrent containers, low-level synchronization primitives, and a task scheduler.

Programmers using Intel Threading Building Blocks can parallelize the execution of loop iterations by treating chunks of iterations as tasks and allowing the Intel Threading Building Blocks task scheduler to determine the task sizes, the number of threads to use, the assignment of tasks to those threads, and how those threads are scheduled for execution. The task scheduler will give precedence to tasks that have been most recently in a core of the processor. The purpose of this strategy is to better exploit the cache. Running a task on such a core increases the likelihood of required data being found on the core 's cache. The task scheduler utilizes a task-stealing mechanism to balance the load during execution. If a thread has no tasks assigned to it, but another thread has more tasks, then the first thread will steal tasks to help the progress of execution.

4.4.6   Sequoia

Sequoia is a programming paradigm that extends C++ and pays special attention to movement and placement of data in different levels of the memory hierarchy [20]. For this purpose, it provides first-class language mechanisms to provide the programmer with explicit control over data. Sequoia includes the notion of hierarchical memory directly into the programming model to gain both portability and performance. To accomplish this, the memory hierarchy of a system is abstracted as a tree of distinct memory modules, and the program describes how data are moved and where they reside in a machine's memory hierarchy. Since the organization of memory in a multicore system plays an important role in performance, Sequoia can be thought of as an interesting approach to programming such systems. The second abstraction used in this programming model is tasks. These are defined as self-contained units of computation that include descriptions of key information such as communication and working sets. An important property of tasks is that they isolate each computation in its own local address space. In other words, a task can only access data from the address space in which it is defined.

In order to run Sequoia programs, a description of the memory hierarchy of the system to be used has to be created. This description includes information about the size of data that can reside at each level of the memory hierarchy for the specific problem that is being solved. For example, if the system under consideration is a Symmetric MultiProcessing (SMP) system, the description would include the size of the problem that fits into the shared main memory and the size of the problem that fits into the local L2 and local L1 cache of each processor. In addition, a number of task variants have to be created. Typically, the actual computation will be performed on the data that reside in the last level of the memory hierarchy. Hence, a task that handles this case is required. For the previous levels of the memory hierarchy, another task will have to be created. Its purpose is to divide the problem size that it receives from the previous level into smaller pieces that fit into the memory of the current level. In the previous example, the total problem size that fits into the shared main memory would be divided into smaller tasks that fit into the L2 cache of each processor. Each of these tasks would be divided further into smaller tasks that fit into the L1 cache. The actual computation would be performed from the task that corresponds to this level.

4.4.7   X10

X10 [21] is a class-based object-oriented programming language designed for high-performance, high-productivity computing on high-end computers supporting tens of thousands of hardware threads. It is a member of the partitioned global address space (PGAS) family of languages. These languages allow us to think of a computation as running across multiple processors on a common shared address space. Data are distributed and reside at some processor. Each processor can operate directly on the data it contains but must use some indirect mechanism to access or update data at other processors.

X10 is based on state-of-the-art object-oriented programming ideas primarily to take advantage of their proven flexibility and ease of use for a wide range of programming problems. However, it has been adapted to the context of high-performance numerical computing. The sequential core of X10 is a container-based object-oriented language similar to Java and C++ . Additional features of X10 have been added to address concurrency. An X10 program is intended to run on a wide range of computers, from uniprocessors to large clusters of parallel processors supporting millions of concurrent operations. To support this scale, X10 introduces the central concept of a place. A place can be thought of as a virtual shared-memory multiprocessor: a computational unit with a finite (though perhaps changing) number of hardware threads and a bounded amount of shared memory, uniformly accessible by all threads. In other words, a place is a repository for data and activities, corresponding loosely to a process or a processor. An X10 computation typically runs over a large collection of places. Each place hosts some data and runs one or more activities, which are extremely lightweight threads of execution. Places induce a concept of “local.” The activities running in a place may access data items located at that place with the efficiency of on-chip access. Accesses to remote places may take orders of magnitude longer.

Instead of traditional semaphores or mutexes provided by most threading libraries to synchronize access to shared data, X10 provides the concept of atomic blocks. Atomic blocks are a high-level construct for coordinating the access to shared data. They can be used to guarantee that invariants of shared data structures are maintained even as they are being accessed simultaneously by multiple activities running in the same place. Finally, several other high-level constructs are provided that facilitate the programming of concurrent applications. Clocks allow ordering of computations, distributed arrays spread data across many places, and it is possible to start asynchronous activities.

4.4.8   Chapel

Chapel [22] is another programming language with a target of improving productivity of parallel programming. The language is built around four principles. First, Chapel is designed to support general parallel programming through the use of high-level language abstractions. For this purpose, it supports a global-view programming model that pays attention in how data and control flow are abstracted. The first concept used to achieve this goal are locales. A locale is an abstraction of a unit of uniform memory access on a target architecture; that is, within a locale, all threads exhibit similar access times to any specific memory address. For example, a locale in a commodity cluster could be defined to be a single core of a processor, a multicore processor, or an SMP node of multiple processors. On top of this, global-view data structures are defined, which are arrays and other data aggregates whose sizes and indices are expressed globally, but their implementation in the supporting runtime system may distribute them across locales. Furthermore, Chapel supports the concept of a global view of control, which means that a user's program commences execution with a single logical thread of control and then introduces additional parallelism through the use of certain language concepts.

The second important concept in Chapel is locality-aware programming, which allows the user to optionally and incrementally specify where data and computation should be placed on the physical machine. Such control over program locality is essential to achieve scalable performance on distributed-memory architectures. The third concept of Chapel is the use of object-oriented programming, which has been of great importance in raising productivity in the mainstream programming community. This is achieved through encapsulation of related data and functions within a single software component, its support for specialization and reuse, and its use as a clean mechanism for defining and implementing interfaces. Chapel supports objects in order to make these benefits available to parallel programming. Finally, the last principle governing Chapel is support for generic programming and polymorphism. These features allow code to be written in a style that is generic across types. In other words, code is produced that is independent of the type of variables, and the compiler takes the responsibility to produce the correct instructions according to the type of the variables that are used to call the code. The main advantage of such an approach is that it allows algorithms to be expressed without explicitly replicating them for each possible type.

4.4.9   Habanero

Habanero [23] is a collection of research projects that try to address multicore programmability issues. Research is performed in numerous areas of parallel software. Overall, Habanero covers almost every aspect of parallel programming, introduces interesting new ideas, but also uses extensively ideas that have proven to be efficient and easy to grasp by most programmers. On the programming language front, new portable constructs are introduced to support explicit parallelism for homogeneous and heterogeneous multicore systems, implicit deterministic parallelism through support of array views [24] and single-assignment constructs, and finally, implicit nondeterministic parallelism through unordered iterators [25] and partially ordered statement blocks. In order to compile these constructs and to produce efficient code, a new parallel intermediate representation has been introduced into compilers, and several transformations and optimizations are applied to it. Several more optimizations have been implemented in order to handle parallelism more efficiently and make it possible to exploit larger numbers of cores. These include optimizations of high-level arrays [26], iterators, synchronization, data transfer, and transactional memory operations [27].

Certainly, an important aspect of any proposed programming model is the runtime system that will implement it. Habanero has put a great effort into extending work-stealing [28] task scheduling algorithms and into supporting fine-grained producer‒consumer synchronization through the idea of phasers [29]. Finally, research is performed on parallel program analysis tools that spot common parallel software errors and extract both statically and dynamically information about the percentage of execution time that is contributed by several constructs of applications.

4.5   MANY-CORE ARCHITECTURES

As more and more cores are expected to be integrated into a single chip, computer architects have already started experimenting with architectural options that are quite different from what we have been accustomed to, especially in general-purpose processors. Although there has been no consensus yet about which of the alternative possibilities should be used in future processors, it seems that some of them have an edge over others and are used in more designs. The large number of cores puts a lot of attention to power consumption. Hence, most of these proposals have a direct connection with this issue. In general, architectures seem to move to simpler designs, discarding methods that increase single-core throughput and instead freeing space to integrate even more cores in a chip. For example, out-of-order execution, which has been perfected over many generations of processors, is being abandoned in many cases. Instead, simple in-order cores are employed, which require less logic and consume less power. In addition, lower clock speeds are used, again to save power. Instead, good performance is achieved through better cooperation of more cores. Special attention is also paid toward programmability issues. The architecture of the memory subsystem plays a key role in this aspect. Do cores share memory or not? Is there a hardware-managed cache or only fast local storage? Are the levels of memory coherent? Design decisions here determine the possible programming models that can extract most of the performance of these systems.

A lot of effort is being concentrated toward answering the question whether a hardware-managed cache or a software-managed local storage should be employed in such architectures. In most current systems, hardware-managed caches are coherent among cores and other processors of a multiprocessor system. In other words, the system is always able to determine where the last value of a variable can be found, whether it is in the cache of the requesting core, the main memory, or the cache of another core or processor. In any case, this last value will always be used. However, the protocols used to ensure coherence are relatively slow, require a significant portion of the die area of a multicore processor, and increase exponentially in size and time with the number of cache memories that have to be kept coherent. It is expected that around half of the die area of a chip will be required simply to implement the coherency protocol, when integration levels reach around 100 cores per processor. As a result, several many-core architectures have started moving toward local storage. A second reason for this move is the fact that cache memories tend to consume more power than local storage. Although simpler to implement, local storage has an important drawback. The memory subsystem is not anymore a direct extension of the memory subsystem of a single-core processor. Programmers were not required to do anything special in their applications in order to find the last value of a variable. On a system with local storage, this is not true anymore. It is the explicit responsibility of the programmer to ensure when data are moved into and out of local storage. Again, there is no agreement yet which programming model is the best to achieve this. Therefore, many such programming models are actually tied to specific architectures.

4.6   MANY-CORE ARCHITECTURE EXAMPLES

Although the computer industry seems currently to prepare for the arrival of many-core architectures, the number of such processors that is currently available is quite limited. Furthermore, most of them seem to be more special-purpose processors that are designed for specific domain areas. Nonetheless, it is worthwhile to see a few examples and their characteristics. Some of the architectures mentioned here have less than 100 cores, which would typically put them in the multicore category. They are mentioned in this section because most of their other architectural features are closer to the many-core architectures we expect to see in the future.

4.6.1   Cisco 188-Core Metro Chip

In 2004 Cisco Systems introduced its CRS-1 high-end router. An interesting feature of the system is the fact that it uses the Metro processor, a specialized and massively parallel network processor that contains 188 cores. Four additional cores are provided for redundancy. Cisco designed and implemented the processor in cooperation with IBM and Tensilica. Each core is a simple 32-bit RISC processor clocked at 250 MHz. A traditional five-stage RISC pipeline is used to improve the throughput of the processor. There is a small cache available for instructions, but data are kept in a local memory. The complete chip requires only 35 W to operate, and its main purpose is to handle network traffic. A simple software programming model has been employed. Every packet received from the network is processed in a run-to-completion manner. This reduces overhead of switching among different tasks. The large number of available cores allows for such an approach. This methodology is also being explored for other architectures since it allows easier programming, and the large number of cores expected in future processors provides the resources to handle at the same time multiple applications or threads.

Each core of the Metro chip, also named a packet processing element (PPE), is only 0.5 mm 2 in size. It is worth noticing that the chip is implemented using a 130-nm technology. However, a 32-nm technology is already available to build processors, which means that more than 1000 cores could easily be put into the same die area of the chip. Although this currently holds for a special-purpose many-core processor, it is not difficult to foresee that such numbers of cores will be soon integrated into general-purpose processors.

4.6.2   ClearSpeed 192-Core CSX700 Chip

The ClearSpeed CSX700 is built from two multithreaded array processors, each one containing 96 processing elements. This amounts to a total of 192 processing elements. Eight more processing elements are provided for redundancy. The clock speed is 250 MHz and only 15 W is required in a typical workload to operate the chip. This puts the CSX700 into the group of processors with an excellentperformance per watt consumed. Each processing element contains a 32/64-bit floating-point multiplier, a 32/64-bit floating-point adder, an integer arithmetic logic unit, a 16-bit integer multiply-accumulate unit, and 6 kB of local SRAM memory. It is worth mentioning that the technology used to implement the processor is 90 nm. Hence, using the current 32-nm approach would allow for a much larger number of cores on the chip.

4.6.3   Tilera TILEPro64 64-Core Chip

The TILEPro64 is a many-core processor chip comprising 64 power-efficient general-purpose cores connected by six 8 × 8 mesh networks. The clock frequency is 866 MHz and power consumption is at 23 W. Cache coherence across the cores, the memory, and I/O allows for standard shared-memory programming. Each processing core comprises a 32-bit five-stage very long instruction word (VLIW) pipeline with 64 registers, 16-kB L1 instruction and data caches, and 64-kB L2 combined data and instruction cache. The L2 caches from each of the cores form a distributed L3 cache accessible by any core and I/O device. The short pipeline depth reduces power consumption and the penalty for a branch-prediction miss to two cycles. Each core is an in-order processor that can issue up to three instructions every cycle. Static branch prediction and in-order execution further reduce the area and power required. Translation look-aside buffers support virtual memory and allow full memory protection. As with the CSX700 chip, this processor is implemented using a 90-nm technology.

4.6.4   Intel 48-Core Single-Chip Cloud Computer (SCC)

The SCC is a research processor from Intel. It is composed of 48 cores, which are arranged in a 6 × 4 grid of tiles. Each tile contains two cores. Each core is a slightly altered P54C Pentium processor and includes a private 16-kB L1 cache and a private 256-kB L2 cache. The clock is at 1 GHz and power consumption is 125 W. The distinct feature of SCC is the fact that it has two levels of memory coherence management. Each core has a coherent L1 and L2 cache. However, these are not coherent with the cache memories of all other cores. Furthermore, although the main memory is shared among cores, it is not coherent with the caches. This means that the programmer has to pay special attention when data are distributed among cores and to maintain coherency in software whenever necessary. For this purpose, Intel developed a special message-passing architecture.

4.6.5   Intel 80-Core Teraflops Research Processor

The Teraflops Research Chip is a research processor from Intel containing 80 cores. The cores are arranged in a 10 × 8 mesh and they operate at up to 5.7 GHz. We discussed earlier power consumption issues for this processor (Section 4.1.2). Each core contains two independent fully pipelined single-precision floating-point multiply-accumulator units, 3-kB single-cycle instruction memory, and 3 kB of local data memory. The purpose of the chip is to explore the possibilities of many-core architectures and to experiment with various forms of networking and communication within the next generation of processors. One of the purposes of developing the Ct programming language [19] (Section 4.4.4) was the introduction of this processor.

4.7   SUMMARY

The continuing validity of Moore's law and the desire to keep power consumption of processors from an exponential explosion has led to the design and development of multicore systems. Almost, if not all, vendors that produce processors move currently toward this direction. The additional transistors that can be put onto a single chip due to the improvements in integration technology have been used in the past to improve the performance of uniprocessor systems. This has been achieved by extending and/or replicating functional units of these processors. However, it recently became clear that these well-studied and perfected techniques have all reached their point of diminishing returns. The alternative currently in use is to add more cores onto each chip.

The advent of multicore processors has brought new challenges. Hardware designers are currently exploring alternatives for the architectural features of processors at several levels, such as caches, numbers of cores, hardware accelerators, and interconnection networks. For all of these subsystems, new approaches are being considered or previous approaches are adapted to the new era. This process is still active and the impact of each of these changes will soon be clear. Furthermore, the expectation that even thousands of cores will be available on a single chip in a few years leads processor architects to introduce even more radical ideas, like abandoning well-established features used in current processors. Some ideas that gain acceptance right now are the replacement of hardware-managed cache hierarchies with software-managed local memory hierarchies and the use of simple in-order cores that require less power to operate and fewer transistors to be implemented.

All these changes in hardware have naturally an impact in the software area. Programming models used to program parallel architectures of the recent past are being adapted to the current multicore architectures. But the expectation that many more cores will be available soon with quite different architectural features drives the introduction of new programming models that are targeted toward these future hardware architectures. As in the case of the hardware, it is not certain yet which of the proposed features will be easily exploitable by the majority of programmers. In any case, it is certain that right now, the multicore era and, soon, the many-core era will impose changes in the way that programmers have been thinking about programming until now, and it will require them to become familiar with new concepts.

REFERENCES

[1] G.E. Moore, “Cramming more components onto integrated circuits,” Electronics Magazine, 38(8): 13, 1965.

[2] IBM Systems Development Division, IBM System/360 Model 85 Functional Characteristics, June 1968.

[3] S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta, “The SPLASH-2 programs: Characterization and methodological considerations,” in Proceedings of the 22nd International Symposium on Computer Architecture (ISCA '95), Santa Margherita Ligure, Italy, pp. 24‒36, June 1995.

[4] J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach. Amsterdam: Elsevier Science & Technology, 2011.

[5] D.A. Patterson and J.L. Hennessy, Computer Organization and Design: The Hardware/Software Interface. Amsterdam: Elsevier Science & Technology, 2008.

[6] D. Tullsen, S.J. Eggers, and H. Levy, “Simultaneous multithreading: Maximizing on-chip parallelism,” in Proceedings of the 22nd Annual International Symposium on Computer Architecture, Santa Margherita Ligure, Italy, pp. 392‒403, 1995.

[7] D. Chandra, F. Guo, S. Kim, and Y. Solihin, “Predicting inter-thread cache contention on a chip multi-processor architecture,” in Proceedings of the 11th International Symposium on High-Performance Computer Architecture (HPCA '05), San Francisco, CA, pp. 340‒351, February 2005.

[8] R. Iyer, “CQoS: A framework for enabling QoS in shared caches of CMP platforms,” in Proceedings of the 18th International Conference on Supercomputing (ICS '04), Malo, France, pp. 257‒266, June 2004.

[9] M.K. Qureshi and Y.N. Patt, “Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches,” in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '06), Orlando, FL, pp. 423‒432, December 2006.

[10] E.G. Suh, L. Rudolph, and S. Devadas, “Dynamic partitioning of shared cache memory,” Journal of Supercomputing, 28: 7‒26, 2004.

[11] D. Tam, R. Azimi, L. Soares, and M. Stumm, “Managing shared L2 caches on multicore systems in software,” in Workshop on the Interaction between Operating Systems and Computer Architecture, June 2007.

[12] J.B. Dennis, “A parallel program execution model supporting modular software construction,” in Proceedings of the 1997 Conference on Massively Parallel Programming Models, Washington, DC, pp. 50‒60, 1997.

[13] Z. Budimlic, A. Chandramowlishwaran, K. Knobe, G. Lowney, V. Sarkar, and L. Treggiari, “Multi-core implementations of the Concurrent Collections programming model,” in Proceedings of the 14th Workshop on Compilers for Parallel Computing (CPC '09), Zurich, Switzerland, January 2009.

[14] Message Passing Interface Forum, MPI: A Message-Passing Interface Standard, Version 2.2, September 2009.

[15] IEEE, IEEE Std. 1003.1c-1995 Thread Extensions, 1995.

[16] J. del Cuvillo, W. Zhu, Z. Hu, and G.R. Gao, “TiNy threads: A thread virtual machine for the Cyclops64 cellular architecture,” in Proceedings of the 5th Workshop on Massively Parallel Processing, Denver, CO, April 2005.

[17] OpenMP Architecture Review Board, OpenMP Application Program Interface, Version 3.1, June 2011.

[18] M.D. McCool, “Data-parallel programming on the cell BE and the GPU using the RapidMind development platform,” in Proceedings of the GSPx Multicore Applications Conference, Santa Clara, CA, October 2006.

[19] A. Ghuloum, E. Sprangle, J. Fang, G. Wu, and X. Zhou, Ct: A Flexible Parallel Programming Model for Tera-scale Architectures, 2007.

[20] K. Fatahalian, D.R. Horn, T.J. Knight, L. Leem, M. Houston, J.Y. Park, M. Erez, M. Ren, A. Aiken, W.J. Dally, and P. Hanrahan, “Sequoia: Programming the memory hierarchy,” in Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC '06), Tampa, FL, November 2006.

[21] K. Ebcioglu, V. Saraswa, and V. Sarkar, “X10: Programming for hierarchical parallelism and non-uniform data access,” in Proceedings of the 3rd International Workshop on Language Runtimes, Vancouver, British Columbia, Canada, October 2004.

[22] B.L. Chamberlain, D. Callahan, and H.P. Zima, “Parallel programmability and the Chapel language,” International Journal of High Performance Computing Applications, 21: 291‒312, 2007.

[23] Habanero multicore software research project web page. 2008. Available at http:// habanero.rice.edu.

[24] J. Shirako, H. Kasahara, and V. Sarkar, Languages and Compilers for Parallel Computing. Chapter Language Extensions in Support of Compiler Parallelization, pp. 78‒94. Berlin, Heidelberg: Springer-Verlag, 2007.

[25] M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L.P. Chew, “Optimistic parallelism requires abstractions,” in Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '07), San Diego, CA, pp. 211‒222, 2007.

[26] K.E. Iverson, A Programming Language. New York: Wiley, 1962.

[27] M. Herlihy and J.E.B. Moss, “Transactional memory: Architectural support for lock-free data structures,” in Proceedings of the 20th International Symposium on Computer Architecture (ISCA '93), San Diego, CA, pp. 289‒300, 1993.

[28] F.W. Burton and M.R. Sleep, “Executing functional programs on a virtual tree of processors,” in Proceedings of the 1981 Conference on Functional Programming Languages and Computer Architecture (FPCA '81), Portsmouth, NH, pp. 187‒194, 1981.

[29] J. Shirako, D.M. Peixotto, V. Sarkar, and W.N. Scherer, “Phasers: A unified deadlock-free construct for collective and point-to-point synchronization,” in Proceedings of the 2008 International Conference on Supercomputing (ICS '08), Isloand of Kos, Greece, June 2008.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset