Foreword

I am delighted to see this new book on architectural design for soft errors by Dr. Shubu Mukherjee. The metrics used by architects for processor and chipset design are changing to include reliability as a first-class consideration during design. Dr. Mukherjee brings his extensive first-hand knowledge of this field to make this book an enlightening source for understanding the cause of this change, interpreting its impact, and understanding the techniques that can be used to ameliorate the impact.

For decades, the principal metric used by microprocessor and chipset architects has been performance. As dictated by Moore’s law, the base technology has provided an exponentially increasing number of transistors. Architects have been constantly seeking the best organizations to use this increasing number of transistors to improve performance.

Moore’s law is, however, not without its dark side. For example, as we have moved from generation to generation, the power consumed by each transistor has not fallen in direct proportion to its size, so both the total power consumed by each chip and the power density have been increasing rapidly. A few years ago, it became vogue to observe, given current trends, that in a few generations the temperature on a chip would be hotter than that on the surface of the sun. Thus, over the last few years, in addition to their concerns about improving performance, architects have had to deal with using and managing power effectively.

Even more recently, another complicating consequence of Moore’s law has risen in significance: reliability. The transistors in a microprocessor are, of course, used to create logic circuits, where one or more transistors are used to represent a logic bit with the binary values of either 0 or 1. Unfortunately, a variety of phenomena, such as from radioactive decay or cosmic rays, can cause the binary value held by a transistor to change. Chapters 1 and 2 contain an excellent treatment of these device- and circuit-level effects.

Since a change in a bit, which is often called a bit flip, can result in an erroneous calculation, the increasing number of transistors provided by Moore’s law has a direct impact on the reliability of a chip. For example, if we assume (as is roughly projected over the next few process generations) that the reliability of each individual transistor is approximately unchanged across generations, then a doubling of the number of transistors might naively be expected to double the error rates of the chips. The situation is, however, not nearly so simple, as a single erroneous bit value may not result in a user-visible error.

The fact that not every bit flip will result in a user-visible error is an interesting phenomenon. Thus, for example, a bit flip in a prediction structure, like a branch predictor, can never have an effect on the correctness of the computation, while a bit flip in the current program counter will almost certainly result in an erroneous calculation. Many other structures will fall in between these extremes, where a bit flip will sometimes result in an error and other times not. Since every structure can behave differently, the question arises of how is each structure affected by bit flips and overall how significant a problem are these bit flips? Since the late 1990s that has been a focus of Dr. Mukherjee’s research.

By late 2001 or early 2002, Dr. Mukherjee had already convinced himself that the reliability of microprocessors was about to become a critical issue for microarchitects to take into consideration in their designs. Along with Professor Steve Reinhardt from the University of Michigan, he had already researched and published techniques for coping with reliability issues, such as by doing duplicate computations and by comparing the results in a multithreaded processor. It was around that time, however, that he came into my office discouraged because he was unable to convince the developers of a future microprocessor that they needed to consider reliability as a first-class design metric along with performance and power.

At that time, techniques existed and were used to analyze the reliability of a design. These techniques were used late in the design process to validate that a design had achieved its reliability goals. Unfortunately, the techniques required the existence of essentially the entire logic of the design. Therefore, they could not be used either to guide designs on the reliability consequences of a design decision or for early projections of the ultimate reliability of the design. The consequence was that while opinions were rife, there was little quantitative evidence to base reliability decisions on early in the design process.

The lack of a quantitative approach to the analysis of a potentially important architectural design metric reminded me of an analogous situation from my early days at Digital Equipment Corporation (DEC). In the early 1980s when I was starting my career at DEC, performance was the principal design metric. Yet, most performance analysis was done by benchmarking the system after it was fully designed and operational. Performance considerations during the design process were largely a matter of opinion.

One of my most vivid recollections of the range of opinions (and their accuracy) concerned the matter of caches. At that time, the benefits of (or even the necessity for) caches were being hotly debated. I recall attending two design meetings. At the first meeting, a highly respected senior engineer proposed for a next-generation machine that if the team would just let him design a cache that was twice the size of the cache of the VAX-11/780, he would promise a machine with twice the performance of the 11/780. At another meeting, a comparably senior and highly respected engineer stated that we needed to eliminate all caches since “bad” reference patterns to a cache would result in a performance worse than that with no cache at all. Neither had any data to support his opinion.

My advisor, Professor Ed Davidson at the University of Illinois, had instilled in me the need for quantitatively analyzing systems to make good design decisions. Thus, much of the early part of my career was spent developing techniques and tools for quantitatively analyzing and predicting the performance of design ideas (both mine and others’) early in the design process. It was then that I had the good fortune to work with people like Professor Doug Clark, who also helped me promulgate what he called the “Iron Law of Performance” that related the instructions in a program, the cycles used by the average instruction, and the processor’s frequency to the performance of the system. So, it was during this time that I generated measurements and analyses that demonstrated that both senior engineers’ opinions were wrong: neither any amount of reduction of memory reference time could double the performance nor “bad” patterns could happen negating all benefits of the cache.

Thus, in the early 2000s, we seemed to be in the same position with respect to reliability as we had been with respect to performance in the early 1980s. There was an abundance of divergent qualitative opinions, and it was difficult to get the level of design commitment that would be necessary to address the issue. So, in what seemed a recapitulation of the earlier days of my career, I worked with Dr. Mukherjee and the team he built to develop a quantitative approach to reliability. The result was, in part, a methodology to estimate reliability early in the design process and is included in Chapters 3 and 4 of this book.

With this methodology in hand, Dr. Mukherjee started to have success at convincing people, at all levels, of the exact extent of the problem and how effective were the design alternatives being proposed to remediate it. In one meeting in particular, after Dr. Mukherjee presented the case for concerns about reliability, an executive noted that although people had been coming to him for years predicting reliability problems, this was the first time he had heard a compelling analysis of the magnitude of the situation.

The lack of good analysis methodologies resulting in a less-than-optimal engineering is ironically illustrated in an anecdote about Dr. Mukherjee himself. Prior to the development of an adequate analysis methodology, an opinion had formed that a particular structure in a design contributed significantly to the reliability of the processor and needed to be protected. Then, Dr. Mukherjee and other members of the design team invented a very clever technique to protect the structure. Later, after we developed the applicable analysis methodology, we found that the structure was actually intrinsically very reliable and the protection was overkill.

Now that we have good analysis methodologies that can be used early in the design cycle, including in particular those developed by Dr. Mukherjee, one can practice good engineering by focusing remediation efforts on those parts of the design where the cost-benefit ratio is the best. An especially important aspect of this is that one can also consider techniques that strive to meet a reliability goal rather than strive to simply achieve perfect (or near-perfect) reliability. Chapters 5, 6, and 7 present a comprehensive overview of many hardware-based techniques for improving processor reliability, and Chapter 8 does the same for software-based techniques. Many of these error protection schemes have existed for decades, but what makes this book particularly attractive is that Dr. Mukherjee describes these techniques in the light of the new quantitative analysis outlined in Chapters 3 and 4.

Processor architects are now coming to appreciate the issues and opportunities associated with the architectural reliability of microprocessors and chipsets. For example, not long ago Dr. Mukherjee made a presentation of a portion of our quantitative analysis methodology at an internal conference. After the presentation, an attendee of the conference came up to me and said that he had really expected to hate the presentation but had in fact found it to be particularly compelling and enlightening. I trust that you will find reading this book equally compelling and enlightening and a great guide to the architectural ramifications of soft errors.

Dr. Joel S. Emer, Intel Fellow, Director of Microarchitecture Research, Intel Corporation

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset