Chapter 21. Post-Silicon Debug and Characterization (“Bring-up”) and Product Qualification

21.1 Systematic Test Fails

During wafer testing, the set of failing dies may indicate that a systematic defect is present. As described in Section 19.7, test diagnostic procedures are pursued in an attempt to localize and determine the root cause of a defect. The ATPG tool flow is exercised in diagnostic mode, using the failing pattern syndrome to identify candidate faults. Physical locations on the die (and masks) are cross-referenced to the candidate faults for further investigation.

After wafer testing, good dies will be packaged and (likely) subjected to burn-in stress to screen infant fails. After (static or dynamic) burn-in, parts are retested. If there is significant population fallout at package testing, diagnostics are again pursued; in this case, however, a greater number of potential defect sources needs to be considered. The source of stuck-at, transition, or bridging fault identification could be due to defects accelerated by the burn-in stress conditions, including the following:

  • “Weak” device gate oxide breakdown

  • Device parameter drift (e.g., due to the presence of material contaminants, whose diffusivity is increased at elevated temperature and voltage)

  • Material interface delamination, with structural cracks and/or permeability to humidity exposure (most often occurring at the interface between the die and the underfill material that seals the die to the package substrate)

  • Die attach metallurgy fails (e.g., solder bump cracks or lifted bond wires)

For the root causes of these defect mechanisms, unique diagnostic techniques are applied. Subsequent sections of this chapter discuss methods of monitoring electrical parameters. For investigation of material defects, the package is removed, and high-resolution electron microscopy of the surface and/or cross-section of the die is performed.

The final stage of SoC bring-up testing occurs during end product evaluation. Fails that arise at this level are not typically related to die fabrication or assembly defects—hopefully burn-in screening has identified those issues—but rather to functional validation escapes or test escapes in the SoC methodology.

A validation escape implies that a feature of the SoC functional specification is incompletely or incorrectly designed, and SoC validation testbenches were insufficient. System software compiled to that specification likely uncovered the error. Product bring-up may be able to proceed if a software workaround is available or if it is possible to disable a feature; indeed, micro-architects may explicitly add disable signals to configuration registers in the SoC design for new features that are especially high risk. (These disable signals are commonly referred to as chicken bits.) The SoC bring-up team is tasked with assessing the impact of product failure on the logic design, validation testbench development, physical power/performance/area resources, and schedule. The product marketing group needs to evaluate this assessment against the value of the associated feature. Assuming that the validation escape has been detected as part of first-pass product bring-up and that a subsequent tapeout has been included in the SoC project plan, the overall impact should be manageable. However, if the SoC specification has experienced feature creep during development—that is, if new requirements have been added to the original specification late in the project design schedule—the risk of a validation escape on the last planned tapeout is higher. Assessing the trade-offs between the cost of another tapeout and the value of a differentiating feature requires challenging product management decisions.

Whereas a validation escape is pervasively evident at product bring-up (i.e., an SoC feature does not behave as expected), a test escape relates to a small population of bring-up parts with a fabrication or assembly defect that was not detected by the wafer/package test pattern set. The impact to the overall bring-up activity is usually small. (Ideally, the SoC on the bring-up board is socketed, and another prototype part is substituted for the defective part with the test escape.) The SoC test engineering team is responsible for fault diagnosis—not using the failing syndromes at the tester but with product runtime information on how the part with the test defect behaves differently than the population of good parts. The capability to breakpoint system functional operation and enter an SoC scan-shift test mode to observe runtime register data is extremely valuable in this regard; this is commonly referred to as a functional scan dump of the SoC internal state. The SoC test team is tasked with augmenting the existing test patterns to demonstrate that the test escape is covered; this may involve adding functional patterns to the patterns provided by the ATPG tool for the SoC DFT architecture.

21.2 “Shmoo” of Performance Dropout Versus Frequency

In addition to functional product testing, another bring-up task is evaluation of the performance of the prototype part distribution. The SoC design will have been subjected to static timing analysis at several “slow” process corners related to performance-limiting device and interconnect variations during fabrication. Path timing will have been closed at those corners to a target sigma of the process statistical distribution.

A sweep of the SoC clock frequency is applied to the parts during bring-up to evaluate correct functionality versus frequency. The cumulative distribution of pass/fail versus frequency in silicon is informative, as it should confirm the statistical variation design models. This distribution also offers insights into the percentage of parts that may be available in various performance bins, operating at a frequency higher than the baseline spec, as illustrated in Figure 21.1.

Percent of parts versus the frequency f.

Figure 21.1 Illustration of the performance of pass/fail parts as a function of operating clock frequency.

Bring-up may also involve a two-dimensional sweep to evaluate the functionality of a part population over ranges of applied voltage and frequency. An example of a two-dimensional pass/fail graph is provided in Figure 21.2; this plot format is familiarly known as a shmoo plot. Note that a shmoo plot is applicable to testing a single part, although it is most commonly used to represent a larger sample of the part population. It is also common to normalize the scale of the voltage and frequency axes, where 1.0 is the nominal value. Note that the voltage range on the shmoo extends beyond the published (VDD + n%) tolerance range specification.

A graph plots V against f, and marks the regions of "pass and fail."

Figure 21.2 Illustration of a two-dimensional shmoo of a sample of the part population versus applied voltage and operating clock frequency.

The bring-up data may be used as part of a dynamic voltage and frequency scaling (DVFS) product strategy, in which the SoC may support performance boost or (power-saving) throttle operating modes, with VDD changes coming from a voltage regulator.

There are some limitations with the traditional bring-up performance characterization methods. A statistically significant volume of parts needs to be measured to provide confidence in the binning and DVFS product strategies. The number of distinct power and clock domains on current SoC designs is large (and growing). For example, bring-up evaluation of large SRAM arrays may focus on a specific operating VDD_min for the arrays, using a distinct voltage domain from other IP. As a result, quickly and concisely capturing shmoo data for multiple domains during bring-up is becoming more difficult.

The specific circuit methods used to measure shmoo data also present an engineering challenge. Traditionally, the statistical fabrication parameters for an individual bring-up part would be adequately represented by a single, distinct measurement circuit added to the SoC design. A performance-sense ring oscillator (PSRO) circuit would be integrated, and the oscillator frequency would be measured as part of bring-up testing. (Reference 21.1 provides a description of methods to efficiently implement a PSRO frequency counter/decoder.) The tracking of fabrication variation across the die implied that the oscillator frequency would be representative of the overall SoC performance. For advanced process nodes, the tracking of circuit and interconnect parameter variations is more localized, and the correlation of PSRO data to overall performance diminishes. Designs began to incorporate multiple PSRO macros placed throughout the die. Still, these stand-alone circuit measurements are difficult to interpret for bring-up performance characterization. Instead, specific test programs are developed by the product bring-up team to exercise critical timing paths for performance shmoo data generation.

The bring-up team may be asked to delve further into the diagnosis of the shmoo data to determine specific paths that are the root causes of performance failures; hopefully, there is a strong correlation between the measured performance paths and the static timing analysis path reports. The SoC project plan may allocate design resources to focus on optimizing these performance-limiting paths in subsequent tapeouts. This optimization phase may be prior to initial production ramp, or it may be scheduled for a subsequent product release. Another tapeout release after initial volume sales may be appropriate to sustain the revenue and market share; this performance specification improvement is commonly known as a mid-life kicker.

The bring-up results from the initial tapeout identify functional errors and electrical issues to be addressed as part of preparation for the next tapeout iteration. The tapeout review team determines which proposed fixes should be accepted, and it merges the design change requirements with the previously deferred items. A key consideration is the ability to identify whether the necessary design ECO updates can be limited to BEOL mask layer revisions and existing spare cells or whether an all-layer tapeout release is required (along with the corresponding cost and schedule impacts).

21.3 Product Qualification

21.3.1 High-Temperature Operating Life (HTOL) Stress Testing

A separate activity that occurs concurrently to the prototype bring-up effort is initiating product qualification testing. A statistically significant sample of good parts from multiple wafer lots is subjected to high-temperature operating life (HTOL) stress testing; essentially, HTOL qualification is an extension to burn-in infant defect screening. The goal is to measure data related to the failure rate of the part. To accelerate the lifetime aging and parameter drift mechanisms, an applied voltage greater than the specification for VDDmax (= VDDnom + n% tolerance) may also be part of the HTOL qualification procedure. In addition, the qualification test fixture could be developed in support of:

  • Static inputs—Fixed chip input voltage signals

  • Dynamic inputs—Patterns applied during the qualification stress testing, where the pattern set provides high internal net switching activity

  • Dynamic inputs with functional monitoring—Patterns applied throughout the stress hours, where the chip output response is measured and compared to expected values

The qualification engineering cost is certainly higher for dynamic stress testing than for application of static inputs, because more sophisticated test chambers are required. For dynamic stress testing, the pattern application rate is greatly reduced compared to the functional part specification. Circuit timing paths are slower at the elevated temperatures.

There are specific HTOL qualification targets for different applications markets, such as consumer, medical, automotive (engine compartment and interior), and aerospace. Several industry standards organizations participate in establishing the qualification test criteria for integrated circuits, by application area, including:

  • Joint Electron Device Engineering Council (JEDEC), specifically JESD47I

  • Automotive Electronics Council (AEC), specifically AEC-Q100

  • United States Department of Defense, specifically MIL-STD-883

These standards include definitions of the test sample size, the total stress duration, and the intermediate intervals within the total at which parts are withdrawn from the chamber and retested. The acceptable number of failing parts for each intermediate and final test is also specified. As might be expected, that number is explicitly set to zero for any high-reliability application.

The HTOL qualification test involves two temperatures: the ambient temperature of the chamber and the local (maximum) device junction temperature. The industry standard specifications refer to the operating junction temperature, necessitating a calculation of the appropriate external environment (Tambient, VDDqual) to achieve the target junction temperature, Tj, with estimates for the internal power dissipation for the qualification patterns. This calculation is more straightforward for static inputs. For dynamic patterns, the potentially wide variation in switching activity for different functional blocks may result in a significant thermal gradient across the die and some variation in Tjunction. Ideally, the pattern set will have high switching activity across all SoC blocks to provide a comprehensive stress test.

Although less frequently employed than HTOL, a low-temperature operating life (LTOL) qualification stress test is also defined.

In addition to HTOL stress testing on devices and interconnects, product qualification includes tests focused on investigating other packaged part failure mechanisms.

21.3.2 Thermal Cycling

Parts may demonstrate structural reliability issues related to the mismatches in the coefficient of thermal expansion (CTE) between the die, die attach, and encapsulation materials. The chamber environment is cycled over time between high and cold temperatures, potentially with an applied voltage to the parts (e.g., multiple thermal excursions between extremes of −55°C and 125°C at ~10°C per minute; JEDEC JESD22-A104/A105; MIL-STD833, Methods 1007/1010). In additional to electrical retesting of the parts after thermal shock stress, a detailed visual inspection of the packages is required to search for external cracks.

A fatigue issue arising from CTE differences may result in internal cracking of a die-to-package pin connection and a corresponding I/O (logic or parametric) test failure. In addition, the CTE stress during cycling may propagate away from the die/package material interfaces, resulting in delamination of the top metal interconnect and (low-κ, structurally weak) dielectric layers below the die surface. Electrical retesting may then present pervasive failures that would be difficult to diagnose without detailed electron microscopy analysis.

21.3.3 Highly Accelerated Temperature/Humidity Stress Test (HAST)

To investigate the susceptibility to corrosion-related failures, parts are placed in a chamber with both elevated temperature and high relative humidity (e.g., Tambient = 105°C − 145°C, RH = 85%; JEDEC JESD22-A110/A118). To further accelerate the water vapor permeability, the chamber would typically be at elevated air pressure as well (>> 1 atm). HAST is also known as the pressure cooker test (PCT). The parts may or may not receive an applied voltage. HAST is unique in that the die power dissipation would typically reduce the moisture influx rate. However, an applied voltage that is cycled could provide additional acceleration of corrosion fails, allowing moisture condensation during periods when die power dissipation is zero.

21.3.4 Part Sampling for Qualification

The qualification effort includes electrostatic discharge (ESD) and latchup integrity testing of I/O circuits assembled in the specific package, as described in Sections 16.2 and 16.3.

Recall from earlier in this section that qualification uses “good” parts; that is, parts that have passed the production test pattern set are sampled, stressed, and retested. However, the SoC project manager may be faced with a dilemma. Qualification testing requires hundreds of hours and considerable costs. Ideally, if a prototype tapeout version is planned for volume production ramp, qualification is done concurrently with bring-up tasks. However, a critical decision point is reached if bring-up identifies a bug that necessitates a tapeout respin, with several key questions to address:

  • Should the qualification activity be halted to wait for the updated revision prototype hardware? The costs incurred to that point would be lost, and additional budgeting (and time) for a new full qualification would be required.

  • Should the qualification activity continue to completion? Would the qualification results be a gate to the subsequent tapeout revision release?

  • Most significantly, if the full qualification passes on the initial version, what results are still applicable to a tapeout revision? Is a “partial” qualification investment (in time and cost) suitable for the silicon update, or is a full requalification required?

The SoC project manager may present a justification to the sustaining product engineering (PE) team that a logic ECO implemented in a tapeout revision does not invalidate the qualification results from the existing silicon die and package technology. Conservatively, the PE team (and end customers) may require full qualification on the final production version.

The SoC project manager and PE team must also address the question of qualification for silicon dies sourced from multiple fabs. Typically, the foundry uses different fabrication lines for (high-throughput, low-volume) prototype wafer lots than for production volume. Although the prototype qualification involves selecting a sample of parts from multiple wafer lots to incorporate lot-to-lot variation, the wafers are likely to be from a single fab. Also, once in production, the foundry may wish to move/re-balance orders between fabs supporting the same process node. The foundry provides data demonstrating that the fabs are all equivalently qualified to the foundry’s criteria; the PE team needs to review that data to decide what (full or partial) requalification is required for multiple fab sourcing.

21.4 Summary

The bring-up and qualification phase of an SoC design project is of critical priority. The fact that these activities are pursued concurrently with the set of planned tapeout revision tasks amplifies its importance. Any issues identified during bring-up or qualification need to be urgently diagnosed, decisions need to be made promptly about the best corrective actions, and ECO updates need to be developed quickly to avoid impacting the (final) tapeout schedule.

Note that qualification failures are typically much more far-reaching and disruptive than functional validation escapes or test escapes. Any issues with package substrates, encapsulation materials, die-to-package pin connection metallurgy, and so on require engineering decisions involving the entire SoC product management organization, working with foundry and OSAT customer support. Referring again to Figure 20.5, qualification issues could result in significant product release delays and financial impact.

Although not described in this text, mechanical CAD (MCAD) software tool vendors provide algorithms to support analysis of material stress and elastic flow for complex three-dimensional geometries when subjected to mechanical forces and thermal gradients. The input to these algorithms provides the geometric model description, along with material strength and surface properties. These tools are indispensable for early investigation of potential structural issues between die, underfill, and package substrate. The MCAD structural analysis, the foundry’s and OSAT’s manufacturing qualification efforts, and the verification of ESD and latchup layouts by the SoC team will hopefully result in zero qualification failures, and thus no additional product delays, and no lost revenue opportunities.

Reference

[1] Zick, K., and Hayes, J., “Low-Cost Sensing with Ring Oscillator Arrays for Healthier Reconfigurable Systems,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 5, Issue 1, March 2012, pp. 1–26.

Further Research

Burn-in

Describe the capabilities (and capacities) of commercial burn-in systems.

HTOL Qualification

Describe and contrast the procedures for HTOL qualification for the different standards described in Section 21.3.1 (JEDEC, AEC, and MIL-STD).

SEM and TEM (Advanced)

Research and describe the features of scanning electron microscopy (SEM) and transmission electron microscopy (TEM), as applied to semiconductor failure analysis. Describe the characteristics of sample preparation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset