C
Cache-coherent shared-memory multiprocessors,
288
input replication of,
234
Checking store buffer (CSB),
310
Checkpoint-based backward error recovery:
compile- and run-time methods in,
320
for shared-memory programs,
319
Chip-external fault detection,
281
Chip-level redundantly threaded
processor with recovery (CRTR),
269–270
Chip-level redundant threading (CRT) processors,
240–241
Circuit-level SERs, modeling of,
44
Clock circuits, vulnerability of,
59–60
Combinatorial logic gates: masking effects in,
52
Compiler-assisted fault tolerance (CRAFT),
310
Complementary metal oxide semiconductor transistors:
field funneling effect in,
49
radiation-induced transient
switching speed of,
17,
68
Configurable transient fault detection,
306
Content-addressable memory arrays:
of data translation buffer,
143
false-negative matches in,
137
hamming-distance-one match in,
137
lifetime analysis of,
134
of write-through and write-back cache,
144
Critical charge (Qcrit), ,
29
to FIT, semiempirical mapping of,
46
encoding and decoding process,
179–180
generator polynomials,
181
D
Data translation buffer,
128,
139
Deadlocks, for synchronization primitives,
321
Dependability models,
11–14
Dependence-based checking elision (DBCE),
245
Detected unrecoverable error,
process-kill
versus system-kill events,
89
tolerance in application servers,
35
Double-error correct triple-error detect code,
176–178
parity check matrix for,
177
Dual-in-line packages,
63
Dual-interlocked cell (DICE),
71–72
Dual-interlocked memory module (DIMM),
37
Dual modular redundancy (DMR) system,
208,
259–260
Dynamically dead instructions,
95,
196,
246
Dynamically scheduled superscalar pipeline,
152–153
masking effects of injected faults in,
152
Dynamic implementation verification architecture (DIVA),
241–242
evaluating NAND function,
57–58
evaluating NOR function,
59
masking effects in,
57–59
Dynamic random access memory:
E
Electrical masking,
37,
53
Electromigration (EM),
15–16
Emitter-coupled logic (ECL),
220
recording information about,
203
Error correction codes, ,
Error detection by duplicated instructions (EDDI),
303
Error information, propagation of,
197
Error recovery mechanism,
254
Exponential failure law,
12
F
on conditional branches,
196
on dynamically dead instructions,
196,
199
on neutral instruction types,
198
and true errors, difference between,
197–198
on uncommitted instructions,
198
before memory commit,
277
before register commit,
263
in SRT-Memory sphere,
286
using binary translation,
306
using cycle-by-cycle lockstepping,
212
using redundant execution,
208
Fault free checkpoint,
278,
281
natural
versus induced perturbations,
274–276
Fault screening, with pipeline squash and re-execution,
173
Field data collection,
62
Field-replaceable units (FRUs),
203
chip-external fault detection using,
281
First-level dynamically dead (FDD) instructions,
95,
107,
196
Forward error recovery,
255
pair-and-spare systems,
262
triple modular redundancy system,
260–262
using triplication and arithmetic codes,
315
Fujitsu SPARC64 V processor:
Full adder, logic diagram of,
54
Full-state comparison bandwidths,
281–282
H
Hardware error recovery schemes,
254
Hazucha and Svensson model,
46
Hewlett-Packard NonStop Himalaya architecture, lockstepping in,
218–219
High-performance microprocessor,
70,
102
freeing up entries in,
279
Hot carrier injection (HCI),
18
Hybrid RMT implementation,
310
“Hydrogen-release” model,
19
I
IBM G5’s Lockstepped processor architecture,
220–222
lockstepping with retry,
265
In-line error detection,
187
Instruction fetch buffer,
312
pipeline squash for, benefits of,
272–273
Instruction reuse buffer,
246
Integer register file,
312
Itanium
® 2 execution unit,
108
Itanium
® 2 instruction queue,
108–109
ACE and un-ACE breakdown of,
109–110
Itanium
® architecture,
195
Itanium
® processor, ,
66,
159
Itanium
® 2 performance model:
evaluation methodology,
107
program-level decomposition,
108
L
addition of capacitors to,
70
in performance simulator,
148
Latch-window masking,
54–56
of ACE and un-ACE components,
124
Linear particle accelerators,
76
AVF breakdown for instruction queue with,
112
Load/store queue (LSQ),
228
Lockstepped checkers,
87–89
in HP NonStop Himalaya architecture,
218–219
in IBM Z-series processors,
220,
265
Log-based error recovery,
283
in piecewise deterministic system,
283
logic-level simulation for,
57
Logical synchronization unit (LSU),
226
Logic derating factor,
80,
118
technology scaling on,
57
Log sequence number (LSNs),
318
Los Alamos Neutron Science Center (LANSCE),
47
M
Machine check architecture,
202–203
Marathon InterConnect (MIC) card,
223
Mean instructions to failure (MITF),
11,
271–272
Mean time between failures (MTBF),
10
Mean time to failure (MTTF), , ,
103,
271
of temporal double-bit error,
191
Mean time to repair (MTTR),
10
Median time to failure (MeTTF),
Metal lines, voids in,
15
Metal stress voiding (MSV),
16
Microarchitectural ACE bits,
90
Microarchitectural un-ACE bits:
idle or invalid state,
93
predictor structures of,
93
circuit enhancements,
68–74
device enhancements,
67–68
Monoenergetic neutron beam,
64
Multicore architecture, RMT in,
240
P
Pair-and-spare systems,
262
Parity prediction circuits:
for addition operation,
185
on caches and memory,
200
Perceptual vulnerability factor,
81
Pin dynamic instrumentation framework,
306
Point-of-strike fault model,
106
versus propagated fault model,
91–92
potentialCheckpoint() call,
319
Predicated false instructions,
95
Predicate register file,
312
Process-kill DUE events,
89
Program’s execution, fault-free and faulty flow of,
105
Pseudo-device driver (PDD) software layer,
323
R
Radiation exposure reduction:
with pipeline squash,
270
triggers and actions,
271
Radiation-hardened cells:
Radiation-induced transient faults,
Radioactive contamination,
Random access memory (RAM) arrays:
of data translation buffer,
142
of write-through and write-back cache,
141
Random access memory arrays, lifetime analysis of:
effect of cooldown in,
125
structural differences in,
125
working set size for,
129
Recovery mode, handling faults during,
287
Redundant execution schemes,
207
Redundantly multithreaded (RMT),
219,
222
in Marathon Endurance server,
223–225
in multicore architecture,
240
performance degradation reduction,
244
relaxed input replication,
244
relaxed output comparison,
245
in single-processor core,
227
using specialized checker processor,
241
Register name authentication (RNA),
201
Register transfer language (RTL),
102,
148
Register update unit (RUU),
228
Register value queue (RVQ),
268
Reliability and Security Engine (RSE),
201
for integer operations,
183
global checkpoint creation,
290
S
checkpoint coordination in,
291
local checkpoint creation,
290
Secondary cosmic rays,
23
Shared-memory parallel program:
deadlock scenarios for barrier and locks,
321–322
with potentialCheckpoint() call,
319
Signatured instruction streams (SIS),
299–300
tolerance in application servers,
35
Silicon-on-insulator (SOI) technology,
67–68
Simultaneous and redundantly threaded processor with recovery (SRTR) processor,
266–268
active list and shadow active list,
268
prediction queue (predQ),
268
register value queue,
268
Simultaneous and redundantly threaded (SRT)-memory:
input replication in,
232
Simultaneous and redundantly threaded (SRT) processor:
asynchronous interrupts in,
288
input replication in,
232
instruction replication in,
232
load value queue (LVQ)-based recovery in,
236,
284
output comparison in,
230
performance evaluation of,
236,
238
redundant threads in,
229
Simultaneous and redundantly threaded (SRT)-register:
input replication in,
233
Simultaneous multithreaded (SMT)
Slack fetch mechanism,
237
Soft error rates, ,
11,
30
of combinatorial logic gates,
56
accelerated measurements of,
62–63
cost-effective solutions to,
4–6
due to alpha particles,
63
Software error recovery,
299,
315
Software fault detection,
299
using signatured instruction streams,
299–300
Software fault-tolerance,
297
implementation options for,
298
Software-implemented fault tolerance (SWIFT),
305–306
Software RMT implementation,
298,
303
fault detection using,
301
sphere of replication of,
302
using binary translation,
306
SPEC CPU 2000 benchmarks,
95,
281
SPEC CPU 2000 floating-point (SPEC CFP),
282
SPEC CPU 2000 integer (SPEC CINT),
282
Sphere of replication,
208,
223
in Endurance machine,
223
in G5 microprocessor,
220
output comparison and input replication,
211
performance-reliability trade off,
308–309
Static random access memory,
addition of capacitance to,
69–70
alpha particle impact on,
45
Statistical fault injection (SFI),
102,
148
architectural and microarchitectural state comparison in,
151
AVF computation using,
146
Statistical fault injection (SFI) study, at Illinois:
Store value prediction,
246
fault detection and isolation,
216–217
SWIFT-R triplication and validation,
316
Symmetric multiprocessors (SMP),
216
Symptomatic fault detection,
273
System-kill DUE events,
89
System-wide checkpoints,
283