6

Philosophy 101—Design Assurance Through Design Practice

As noted previously, DO-254 contains little or no technical information that an engineer can use to create safe and reliable electronics, focusing instead on the processes and methodologies that will first ensure that minimal errors are inadvertently inserted into a design, and second, ensure that the errors that are inserted can be revealed and fixed. The absence of technical guidance is appropriate since codifying the technical basis or standards for a field as rapidly evolving as electronics would seriously cripple the avionics industry in a very short time. Thus the technical aspects of a design are predominantly left to the discretion of the design engineers and their company design standards, subject to the technical review and approval of FAA regulators and their designated individuals or organizations. So while the technical content of a design may not be guided by DO-254, there are other checks in place to ostensibly ensure that the technical content complies with the technical goals of the FARs.

Some people may find this aspect of design assurance rather disconcerting, but it is worth noting that if the processes used by the designers comply with the objectives in DO-254 and are executed in a conscientious manner, much of the uncertainty in the design will be taken out of the equation. Eliminating this uncertainty is an unstated objective of DO-254.

DO-254 defines design assurance as a methodology for identifying and correcting errors as a way to satisfy the regulatory requirements. In many respects the DO-254 defnition summarizes much of the content of DO-254, including the lack of specifc technical guidance.

The DO-254 definition of design assurance, when considered against the entire scope of high reliability system design, addresses what could be considered the “back end” of design assurance. It is essentially reactive in that it focuses mainly on the quantitative process of detecting and eliminating errors (as opposed to the mostly subjective art of identifying error-free but unsafe design features) after they have occurred or been otherwise introduced into the design. It does little to directly address the “front end” of design assurance, which is to prevent unsafe design features that are either intentionally or unintentionally designed into the hardware by engineers or design tools. DO-254 does provide for design reviews that can be used to detect unsafe design features, but it relies heavily upon reviewers who are competent to identify and understand unsafe features, which may not be possible if a company’s engineering culture is permeated by imperfect design methods and philosophies. In other words, if a company’s engineering culture promotes and standardizes the use of unsafe design features and methods, the reviewer’s perspective will reflect that culture and legitimize those features and methods, so peer reviews will only confirm that those unsafe design features and methods are “properly” in the design.

Design assurance as presented in DO-254 is implemented through multiple means at multiple points in the design process, as well as at various levels of project execution. In addition to the obvious aspect of design assurance that focuses on a well-planned and systematically executed design process, there are supporting processes that work in the background to continuously confirm that the design process is producing the correct output at the correct time and with the correct integrity. These supporting processes consist of configuration management, process assurance, validation, verification, and certification liaison, all of which are peripheral to the design process but are essential if it is going to generate designs that can satisfy the FARs.

The interaction between the design process and the supporting processes can be considered analogous to the architectural mitigation techniques described in ARP4754 and in Appendix B of DO-254. As noted in the Introduction to DO-254 chapter of this book, architectural mitigation is the means by which electronic circuits can, through high-level architectural design methods, realize a system reliability that is orders of magnitude higher than the inherent reliability of the electronic components themselves. Likewise, the use of a similar philosophy in a process system can result in a level of design assurance that exceeds what is possible just through the application of a structured design process. In this case the design process, like electronic circuits, can only realize a limited level of excellence on its own. It is the supporting processes—analogous to architectural mitigation—that enable the process driven methods to identify, capture, and eliminate design errors and thereby increase the effective reliability of the process by many orders of magnitude, similar to how architectural mitigation captures, isolates, and nullifies failures and errors generated by electronic circuits.

The design process that is presented in Section 5 of DO-254 is an example of the “classic” design process. Classic design is the process that is often taught in basic engineering courses to introduce students to the concept of following a structured process to ensure that their designs are executed in a logical, systematic, and efficient manner that maximizes control over the design and minimizes the chance of errors. All projects, regardless of their size, complexity, or goal, will cycle through the phases of the design process even if the designers do not realize it and are not consciously trying to follow it. This is because the design process mirrors the thought processes that engineers will naturally employ when solving a problem: figure out what is wrong, decide what the fix or solution has to do if it is going to fix the problem, think of a way to implement the fix, create the fix, and then test the fix to make sure it actually fixes the original problem. The process is logical, and in the long run it is the shortest path to the goal.

The design process in Section 5 of DO-254 consists of five “phases” of design activity:

1.  Requirements Capture Phase, where the hardware item’s requirements are conceived, written, captured in documents and/or a requirements management system, and version controlled in a configuration management system as appropriate for the hardware’s DAL.

2.  Conceptual Design Phase, where the high level strategy for implementing the functionality expressed in the requirements is conceptualized and documented.

3.  Detailed Design Phase, where the conceptual design is elaborated and refined into the design (usually HDL code for PLDs and schematic diagrams for non-PLD electronic hardware) that will be implemented in hardware.

4.  Implementation Phase, where the detailed design is converted to its hardware implementation and then development tested to ensure that the hardware works as designed.

5.  Production Transition Phase, where the final version of the hardware design is readied for series production.

In theory these phases are intended to be executed in order, but in reality the practical considerations of project schedule and resources will often require that some of the phases overlap or even run concurrently. While this is not considered strictly “correct,” the process is flexible enough to accommodate such variations and still produce adequate design assurance. If an organization’s normal design practices include such overlaps and concurrency, the design process description in the project HDP should make a note of this along with a justification that substantiates the claim that such variations can be tolerated without sacrificing or impacting the design assurance of the process and its resulting hardware item.

The design process described in the HDP should also account for the continual interaction with the supporting processes, in particular the configuration management, validation, and verification processes.

The design process that is used (and documented in the HDP) does not have to be the same as the design process in DO-254, but it does have to satisfy the objectives that are stated in paragraphs 5.1.1, 5.2.1, 5.3.1, 5.4.1, and 5.5.1. Virtually any design process can be acceptable as long as the process can be shown to fully satisfy those objectives and generate the artifacts that will substantiate that those objectives were met. The processes also need not have the same phases as long as the phases that do exist support the DO-254 objectives. This “mapping” of a different design process to the process objectives in DO-254 is generally straightforward because of the way that the DO-254 process captures the natural flow of a design project from conception to implementation: as the adage goes, a rose by any other name is still a rose, so attaching different labels and phases to a design process will not change the essential flow of development as long as the process has reasonable integrity and does not contain too many unusual or excessively arcane practices.

Note that the design process does not have to satisfy the activities that are documented in paragraphs 5.1.2, 5.2.2, 5.3.2, 5.4.2, and 5.5.2 of DO-254. The activities listed in those paragraphs are suggestions for how the objectives can be satisfied, but are not mandatory aspects of the design process, and are not required for a design process to be DO-254 compliant. However, most design processes with the requisite integrity for AEH design will incorporate most if not all of the guidance contained in those paragraphs.

DO-254 Section 5 also contains subsections on acceptance testing (5.6) and series production (5.7), but neither of these topics are significant factors in the design and development processes.

A DO-254 compliant design process differs from the classic design process in that it has supporting processes to boost the integrity of the basic design process and make it more effective. The design process has a requirements capture phase in which the solution’s requirements are developed and recorded, and the validation supporting process scrutinizes those requirements to make sure they are correct and complete, ensuring that errors introduced during that phase are minimized or eliminated. The design is created during the conceptual and detailed design phases, and the verification supporting process scrutinizes the design to make sure it is correct and complete, ensuring that errors introduced during those phases are minimized or eliminated. When the hardware is created during the implementation phase, the verification supporting process steps in again to scrutinize the hardware and conduct detailed tests on it, ensuring that the hardware is correct and complete, and that errors introduced in that phase are minimized or eliminated.

Throughout these phases, the configuration management supporting process creates an environment where each version of the requirements, design, and verification data is documented, identified, and managed, ensuring that errors will not be introduced through mishandled data. At the same time, the process assurance supporting process scrutinizes the data generated by the design and supporting processes to ensure that everything was performed when and how they were supposed to be performed, and the certification liaison supporting process scrutinizes the entire project to make sure the design and supporting processes are being implemented and executed properly.

The end result is that the supporting processes obtain every bit of effectiveness from the design process to ensure that any hardware coming out of it has as much integrity as it possibly can.

However, as mentioned previously, while the design and supporting processes can effectively minimize the errors that are introduced into the design and maximize the errors that are uncovered and fixed, it is not designed to identify and fix design features that are inherently unsafe or can adversely affect the integrity of the hardware. Identifying and fixing weak or inherently unreliable design features requires qualitative judgment backed by experience, knowledge, and technical competence—things that cannot be acquired from a process or its governing document. On the other hand, developing those technical skills can be fostered or synthesized through adopting and adhering to a good set of design standards and philosophies that can be used within a process to minimize or eliminate the introduction of unsafe or unreliable design features.

DATDP

Design Assurance Through Design Practice (DATDP) is an engineering approach that emphasizes the use of a good design philosophy to engender solid engineering practices and methodologies to promote safe and reliable designs. It is applied at the circuit design level and is, by and large, the result of lessons learned from decades of experience in designing (and auditing) high reliability electronic systems. Fundamentally its approach is to use robust design techniques and philosophies to design reliability in, rather than rely solely on testing errors out. It is the process of proactively designing in reliability instead of reactively weeding out errors.

Like DO-254, DATDP does not prescribe specific design techniques or circuit architectures, although the design standards that may come out of DATDP might. Instead its focus is on developing a mentality that enables engineers to create designs that are safe, and then to objectively examine those designs from all angles to find potential weaknesses that could compromise the integrity of the system.

DATDP addresses three aspects of front-end design assurance: device selection, design philosophy, and design execution.

DEVICE SELECTION

Device selection focuses on the components that are used in a design. While most components will be adequate from the reliability perspective if they meet the environmental requirements of the system, there are still non-environmental considerations that can affect a component’s ability to support the necessary reliability for airborne systems.

Some recommended device selection guidance includes the following. The guidance in this book is written to focus on programmable logic devices, since that is where the current application of DO-254 is focused, but can be applied to other components as well.

•  Let the system select the devices. In other words, let the system’s functional, reliability, safety, cost, and related requirements, not personal preference, guide the component selection process. It is acceptable to have favorite components, but their use should be dictated by the system requirements.

•  Power requirements, sequencing, and consumption versus temperature. Beyond the basic parameter of power consumption are the larger scale considerations for the range of power supplies, how the power supplies must be managed, and how the power consumption may vary with temperature. Keep a more global view and consider how a device may affect the complexity and cost of the rest of the system. For example, an FPGA that requires two power supplies that must be sequenced in a specific way may not be a good choice if its power supply and management circuits add appreciably to the complexity of the system. In addition, some devices that consume very small amounts of power at room temperature can become considerably more power consuming at higher temperatures. If a system’s operating temperature range overlaps the temperature at which the device consumes more power, the device may not operate at as low a power level as originally intended, and in some cases may be susceptible to thermal runaway. Research power considerations thoroughly before selecting any device.

•  Radiation tolerance. Although radiation is not normally a consideration for commercial aircraft, radiation tolerance includes susceptibility to single event upset (SEU) events, which are relatively common at the normal cruising altitude of commercial aircraft. Different semiconductor device technologies have different levels of sensitivity and susceptibility to SEUs, so keep SEUs in mind when selecting devices.

•  Service life. Do not sabotage a system by selecting a part that may go out of production during the life of the system. Look for parts that are mature (have been in large scale use long enough to establish their trustworthiness) but not likely to be discontinued anytime soon. Also look for manufacturers that ensure availability of suitable parts for medical and automotive markets.

•  Lead time. Designing a device into a circuit and then fnding out that the real devices will not be available in time for your project can cause more than schedule delays, it can result in design errors when a new device type is designed in as a replacement. Be aware of a device’s availability and lead time so that the designer will not be surprised.

•  Lead time. Designing a device into a circuit and then finding out that the real devices will not be available in time for your project can cause more than schedule delays, it can result in design errors when a new device type is designed in as a replacement. Be aware of a device’s availability and lead time so that the designer will not be surprised.

•  Technical support. Not all device vendors are equal in this regard. Working with a company that has fast and reliable telephone technical support can be a real time (and cost) saver when schedules get tight and problems arise.

•  Product support. Implementing a design in a given device is the other half of creating hardware. Are the vendor’s design tools easy to use and understand?Is the user interface logical, and do the tools have features that can introduce errors into the design?

•  Packages. Does the selected device come in the type of package that best serves the needs of the system? Some package types are better suited to harsh environmental conditions than other types. Think about the environmental and electrical conditions of the system when selecting a device and its package. Device storage and handling aspects can also affect package choices. Also consider access to device pins for in-circuit testing and whether suitable sockets are available to facilitate testing.

•  Device features. Integrated circuits have been evolving for decades, and with each decade the devices get considerably more capable, with an attendant increase in complexity. FPGAs in particular come packed with more and more features ranging from mathematical resources to built-in processors. This adds enormously to the capabilities of the device, but with the capabilities comes complexity, and with complexity comes an increase in the downstream burden of verification. In addition, if the design does not use all of those features, some certification authorities may express concern over these unused functions and how the design can ensure that they will not present a safety problem.

•  Service history. Using new devices, no matter how capable and ideal they are for a design, is not always the best way to go. Devices that have no commercial or industry track record can be problematic with the certification authorities. Before selecting such a device for a design, discuss it with the certification authorities to get their concurrence.

•  Semiconductor technology. Different semiconductor materials and fabrication technologies will have different robustness characteristics, and some types of semiconductors that perform flawlessly in a consumer product may have weaknesses that could preclude their use in aircraft systems. Before selecting a device, conduct an analysis to determine whether its semiconductor material, architecture, feature size, gate type, and programming methods can affect the safety of a system when subjected to the conditions aboard an aircraft in all of its potential operating environments.

•  SEUs. Single event upset (SEUs) considerations are related to semiconductor technology because susceptibility to SEUs can be highly influenced by it. Some types of devices are simply more susceptible to this error source. Soft errors will affect all technologies more or less equally and can be mitigated through system level design features such as architectural mitigation or simply refreshing data on a periodic basis. Hard errors, which can change the programming of a PLD, are a more serious issue and are dependent on the semiconductor technology.

•  PLD size. The size of a PLD can affect more than just how many circuits can be designed into it. In some cases, a device that is significantly over-sized for its application can raise questions and concerns from the certification authorities about the disposition of the unused resources. Size a PLD to fit its application while allowing room for the design to grow, but keep it within reason.

•  Speed. Faster, like bigger, is not always better. Faster devices can mean higher susceptibility to noise and errant signals, more cross-talk and reflections at the circuit card level, and more radiated noise.

•  Data retention in flash devices. Flash-based PLDs have a limited data retention period, and that period may depend on temperature. Do not rely solely on the data retention banner on the first page of a data sheet; study the data retention tables in the data sheet to get a more realistic value. It may be that the advertised 20 or more year data retention period could fall considerably if the device is operated at high temperatures.

•  Power-on performance. Some devices have to load their configuration from external memory each time they power up. If the configuration time is greater than the specified start-up time or required availability for the system, then it may not be wise to select that part.

DESIGN PHILOSOPHY

Design philosophy employs mental attitude, thought processes, and design rules that guide the design process and provide a sound basis for creating safe and reliable designs. Design philosophies are used to create a mindset and attitude that are conducive to high integrity designs.

The heart of DATDP is a set of rules that together comprise a design philosophy that has served well as the basis for creating safe and reliable designs. These rules, facetiously called “Roy’s Rules,” are at times frivolous, but in their essence they embody an engineering mindset which, when applied to a design, can make the difference between a reliable, high integrity system and a weak or marginal one.

Roy’s Rule #1: Passing the Buck Is Expensive (or Carry Your Own Burden)

Experience has demonstrated that the later a problem is fixed, the harder and more expensive the fix becomes. Similarly, passing a task to the downstream user of a product can have a similar effect, and increase the likelihood of errors as well. For example, if a requirement author does not want to go through the effort of writing down the fine details of a specification and elects to let the reader of the document “figure it out themselves,” the requirement has just gotten considerably more expensive, and the chance of errors has gone up as well. Considering that a document (or in this case a requirement) will typically have one author and many readers, and the author knows the topic best and can provide the details more efficiently than anyone else whereas the readers will have less expertise and will take considerably more time to generate the same information, then letting the readers derive the details for themselves is a bad business decision at best, and could be a problematic one as well since the reader is more likely to make a mistake when deriving that information. Supposing that a requirement author could write down the details of one requirement in five minutes whereas a reader might take ten minutes to figure it out on their own, and if there are ten readers who will use that requirement, then the author’s failure to provide the detailed information has increased the cost of that requirement by a factor of 20. If that figure is multiplied by the number of incomplete requirements in a document (which is likely to be significant given the attitude of the author), then it quickly becomes clear that the simple act of delegating the details to the reader can be a very costly proposition, and that does not even include the cost and time associated with fixing errors that may occur.

It is tempting for a requirements author to leave some details to the readers to figure out, normally as a way to avoid spending time on something that the author may feel the readers can figure out on their own. In other cases, the author does not realize that their knowledge or experience is not common or universal. As described earlier, however, delegating the details in this way can result in wasted time and effort, delays due to mistakes in figuring out the requirement details, delays and errors when the requirements are used for verification (even more if the verification engineer misinterprets the requirement and creates erroneous test cases), and expensive hardware fixes if the mistaken interpretations are not caught early and find their way into hardware. DO-254 lists completeness as one of the objectives of validation, which means that a derived requirement that does not provide all relevant details is insufficient and should be corrected.

Another area where there is a strong temptation to delegate work is writing comments in HDL source code. Again, the reason is typically that a designer feels (with good intention) that some level of passing effort to the reader is harmless and will save time and effort on their part, not realizing that it will multiply the time and effort expended by anyone who reads the code, such as by a reviewer during a code review, when making changes later in the project, or when reusing the code at a later date in a new application. And as with requirements, poorly commented code will require that the reader interpret and figure out the functionality, which means there are now opportunities for misinterpretations and outright mistakes, which can result in more serious delays and costs if the mistakes find their way into the final hardware. Well- commented code will often have more lines of comments than lines of code. While this seems unnecessary to some people, anyone who has had to review or work on code that was commented less than this can testify that delegating the details to the reader is not the way to save time and effort.

Roy’s Rule #2: Predictability May Be Boring in People But It Is Goodness in Electronics

Deterministic operation, where there are no variations or uncertainty in how a system operates each time it operates, is one of those behavioral traits that can be immensely boring when exhibited by a person, but highly attractive in an electronic system. A deterministic system will behave predictably under all operating conditions, and is essentially incapable of changing its behavior regardless of the inputs it receives from the rest of the system. When a system or circuit is designed to operate in this manner, it is inherently immune to outside influences, including abnormal inputs. However, it is not the deterministic behavior that creates this immunity; instead, the determinism and immunity are both characteristics of a type of system or circuit that controls or is independent of, rather than reacts to, its environment. Or alternatively, a system will exhibit deterministic behavior if its inputs define its data output but not its operation and behavior.

A very simple example of this concept is shown in Figure 6.1 through Figure 6.3. Figure 6.1 shows a very simple finite state machine (FSM) that implements a control interface for an analog to digital converter (ADC) device. It is typical of how control circuits are implemented, particularly for devices that provide feedback. In this case, the state machine holds in an idle state until it receives a convert trigger, at which point it increments to state 1 to start a conversion cycle in the ADC, and then increments to state 2 to wait for the ADC’s end of conversion (EOC) signal. When the EOC signal is received, the state machine switches to state 3 to assert the ADC read control signals, and then to state 4 to read the ADC’s output data before going back back to state 0 to await the next conversion trigger.

Image

FIGURE 6.1 Example One—Typical FSM Control Circuit

Image

FIGURE 6.2 Example Two—Deterministic Circuit

Image

FIGURE 6.3 Example Three—More Deterministic Circuit

An examination of this implementation to identify potential failure modes will immediately zero in on state 2 because the state machine could latch in that state if the end of conversion signal does not arrive as expected. There are also three unused states (states five, six, and seven) that could become undefined states if not properly managed in both the source code and the synthesis tool.

An improvement on this circuit is shown in Figure 6.2. The circuit in Figure 6.2 employs a counter in place of a state machine. The counter is reset to zero by the conversion start trigger, and then increments from zero to its terminal count, where it then latches and awaits the next conversion trigger. The control signals for the ADC are generated by decoding the counts. This circuit is semi-deterministic in that it does not have an equivalent to state 2 in the previous example, nor does it contain unused states, but it does include a trigger input that resets the counter to zero (its starting state).

One significant difference between these two examples is that the second circuit, if using the same type of asynchronous ADC as the first circuit, will count through the ADC’s maximum or worst- case conversion period (taken from the data sheet) and then read the output data rather than wait for the EOC signal from the ADC, and thus operate the ADC in open-loop fashion. However, for this type of circuit a better selection would be a synchronous ADC in which the conversion and its output are controlled explicitly by the control signals generated by this circuit.

For an asynchronous ADC the use of count states to mark time ensures that the circuit will not hang up if the EOC fails to appear for whatever reason. Thus the effect of a failed ADC or interrupted EOC signal will at most result in null data, a missed sample, or invalid data, which can be managed at the system level, and will not affect the circuit’s operation in any way. This approach creates a distinct division between the circuit’s (and system’s) operational and data characteristics, in particular ensuring that the circuit will operate normally and deterministically regardless of whether the data or its source becomes corrupted or otherwise untenable.

The most deterministic, and thereby robust, of these example circuits is shown in Figure 6.3. This circuit improves on example two by removing the trigger input and turning the circuit into a free running counter that operates independently of all external signals (except the clock). In all other respects this circuit operates identically to example two.

This circuit approach works best in a master role where the counter establishes the periodicity or frame rate for the system, so it would most commonly be used at a higher level than shown in the example. In this type of system every function and circuit is operated and timed by decoding the counts generated by the central counter, ensuring that every aspect of the system is precisely timed with respect to overall operation and to each other. Systems designed with a central free running counter of this type will exhibit truly deterministic behavior where every node, register, and even data value can be precisely predicted and modeled at any point after the system starts operating.

Roy’s Rule #3: Use HDL to Design Logic, Not Behavior

Of the three levels of HDL design (behavioral, RTL, and structural), register- transfer level (RTL) is most compatible with the objectives and processes in DO-254. RTL designs are compatible with a pin level description of requirements and the respective pin level testing. Structural HDL, because of its very low level of design expression, will increase the cost of a program by requiring that the processes and methodologies in DO-254 be applied at a sub-functional level. Behavioral HDL, which has the advantage of allowing a PLD design to be expressed at a functional level, might save time and effort during the design phase of a program, but because the code is describing the behavior of the design rather than the hardware itself, the logic design is left to the design tools rather than the designer. Since the processes and algorithms in the design tools are not visible to the designer, there is no way other than through exhaustive testing to determine conclusively whether the circuits created by the tools are in fact the circuits that the designer intended to put into his design. In addition, the high level at which the design is coded results in larger elements, which can affect the elemental analysis of the design. These loose ends represent a large unknown in design assurance.

RTL coding provides a good compromise between the two coding methods. It allows the designer to have a high level of control over the implementation of the design while keeping the size and complexity of the elements at a level equivalent to that of a circuit card, which is the level for which DO-254 was written. The added advantage of RTL coding is that its lower level of design expression allows the design to avoid many of the pitfalls that can be encountered during synthesis when the tools’ problematic reduction features are engaged. An example of this phenomenon is discussed in the second example circuit in the Design Execution portion of this chapter.

Roy’s Rule #4: Make Your Circuit Bulletproof Even If No One Is Shooting at It

This should really be obvious for engineers who appreciate the seriousness of level A design. It is not enough to design a circuit that works; a conscientious designer who focuses on safety and reliability will do their utmost to make sure that their circuits are bulletproof and will not malfunction or misbehave for any foreseeable operating condition (this phrase from the FARs should be familiar by now). An example of this rule is discussed in the first example circuit in the Design Execution portion of this chapter.

Roy’s Rule #5: Use Top-Down Design or Prepare to Go Bottom-Up

Top-down design, where a system or function is first defined at the topmost level and then decomposed downward to the lowest levels of the design, is the only recommended methodology for defining a system’s functionality and design. Bottom- up design, which defines a system or design from the bottommost elements and then attempts to work upward to define the system, is never recommended and normally results in curious designs and programmatic disasters.

However, the combination of top- down design and bottom- up implementation, where a hardware item is defined and designed from the top down and then assembled and tested from the bottom up, results in the best of both worlds.

Roy’s Rule #6: Find Your Own Failure Modes

There are two places where this rule should be applied: first, finding all possible and potential failure modes of a design; and second, finding and knowing our own failure modes, or in other words, understand ourselves and our issues so that we can prevent ourselves from sabotaging our own work. The first is actually the easiest since designs are generally straightforward and can be analyzed to consider all possible sources of failure. The second, on the other hand, can be complex and difficult due to the unpredictability of human nature and experience. However, understanding ourselves to the point where we can recognize our limitations, biases, and eccentricities and how they affect our work will allow us to mitigate their negative effects.

Roy’s Rule #7: Never Assume

An assumption is a decision that is based on a lack of knowledge—being fully informed means that assumptions are unnecessary. Some assumptions can be based on incomplete rather than no knowledge, but even then it is still a lack of knowledge that leads to an assumption instead of an informed decision. Making an assumption can jeopardize all downstream decisions and actions, so the best course of action is to avoid assumptions and rely as much as possible on informed decisions.

Roy’s Rule #8: Do Not Ask for Trouble (Avoidance Is Better Than Mitigation)

This means that it is a better idea to avoid any kind of risky or questionable design feature than to put one in and then mitigate its effects. For example, one- hot state machines can be unreliable, so rather than design a one- hot state machine into a PLD and then find ways to mitigate their weaknesses, it is better to not use them at all and instead rely on an alternative circuit type that is more reliable. Intentionally introducing a weakness into a design is not compatible with the goals of safety critical design no matter how well the weaknesses are mitigated.

Roy’s Rule #9: DO-254 Is Our Friend

DO-254 is a collection of industry best practices. They can improve the reliability of a system and even reduce development costs if applied properly, so the processes in DO-254 should be embraced, not shunned or avoided. In fact, considering that DO-254 contains best practices, anyone creating level A hardware should already be complying with DO-254.

Roy’s Rule #10: Review Now or Pay Later

It is tempting to save time and money by skipping or skimping on peer reviews, or even by conducting low quality reviews. However, this will only cause errors to be overlooked, which allows them to propagate downstream and become more expensive to fix. It is far easier and less expensive to spend a little extra time and effort on a peer review to make sure the review is complete and thorough, than it is to fix any problems that get through because of an inadequate review.

Roy’s Rule #11: Just Deal with It

Level A design is often difficult and expensive. Rather than fight it, it is usually a lot easier in the long run to just accept it and deal with it. The same applies to complying with DO-254: the most expensive approach to DO-254 is to try to avoid it, and the more you try to avoid it the more expensive it gets.

Roy’s Rule #12: Ignore the Trees

Sometimes, especially when a project is behind and time is critical, the need to focus on the immediate task can make it hard to see the big picture. When making decisions, always remember to look at the impact of every decision on the long-term conduct of a project and not just on the short-term cost or benefit. What might work well, look attractive, or solve an immediate problem may not work well or could even cause problems in the long run.

Roy’s Rule #13: Require Requirements

As described elsewhere in this book, DO-254’s processes focus on functions and functionality, and the way functionality is defined is through requirements. And as described in the Requirements, Validation, and Verification chapters of this book, the number and quality of the requirements will have an enormous influence on the cost and effort of complying with DO-254. With requirements being as important as they are, it makes sense to invest a great deal of effort in producing high quality requirements. Experience in DO-254 has shown that requirements are consistently the single most influential data item when it comes to affecting the course of a development project, so to even consider shortchanging the requirements is inviting potential increases in development time, cost, and effort.

Requirements are not an appendage to a project, nor are they a documentation burden. They are a critical part of the design and verification processes, and it is not an exaggeration to state that the quality of the requirements will dictate the conduct of the entire project. So it behooves us to treat requirements with respect and to never try to save time and money by reducing the effort and time that is put into them.

Roy’s Rule #14: Have No Faith

It can be easy to have faith in our designs, but often the reality is that our faith is misplaced or is not realistic. In other words, believing something does not make it so. Faith- based certification is not an accepted approach to compliance with the FARs; the only acceptable approach is to support all decisions and claims with hard data, so we need to know, not believe.

Roy’s Rule #15: There Is No Hope

Hope is one of those terms or concepts that should never find its way into the business of Level A design. We should never hope that our designs are safe, we should only know it for a fact. If we find ourselves hoping for the best, then we probably have not done our job with diligence. In level A design there should be no hope, only certainty.

DESIGN EXECUTION

Design execution combines the device selection and design philosophy aspects of DATDP and applies them to the creation of individual circuits. Design execution is best presented through simple circuit examples as opposed to descriptions and concepts. The following FPGA design examples describe actual circuit issues to illustrate how design philosophy can be used to create more reliable designs and to mitigate potential error sources.

Example 6.1: Shift Registers

Shift registers are one of the simplest and most useful of the basic digital logic circuits. However, because of their simplicity and utility they often escape scrutiny and may be overlooked as a source of potential errors.

Figure 6.4 illustrates a simple three-stage shift register that shifts on the rising edge of a common clock in an FPGA. Upon examination of this circuit and application of “Roy’s Rules” number 6 (find your own failure modes) to it, there is a potential error mode or weakness that is based on the fundamental criteria of how shift registers work: the hold time of each register’s output must equal or exceed the following register’s input setup time with respect to the rising edge of its clock input. In other words, the shift register will no longer work properly if the timing between a register’s clock edge and input setup time, and the previous register’s output hold time, deviate from normal specifications. These deviations can result from a number of factors, including a reduction in the output hold time of the preceding register, an increase in the input setup time, or an unequal delay in the active edge of the clock. The most common of these is a delay in the clock edge due to routing delays, which causes the clock edge to reach each register at a different time (clock skew).

Image

FIGURE 6.4 Simple Shift Register

Image

FIGURE 6.5 Waveforms for a Properly Functioning Shift Register

Image

FIGURE 6.6 Shift Register with Clock Skew in Second Stage

Figure 6.5 shows typical waveforms for a properly functioning shift register, where the rising edge of the clock arrives simultaneously at each of the three registers. As can be seen in the waveforms, each bit of the input data stream is shifted through the three register stages and appears at the output of the shift register.

In contrast to the normally functioning shift register, Figure 6.6 shows the waveforms for a shift register where the rising edge of the clock arrives late at the second of the three registers. Because the rising edge of the clock arrives later than the hold time of the first shift register stage, the second register will immediately shift through the data going into the first stage, causing each bit of data to shift into both the first and second stages at the same time.

Clock skew can be caused by a number of factors, such as any of the following:

1.  A failure to use a dedicated low-skew clock net in the device.

2.  Failure of the dedicated low-skew clock net to perform to its expectations (in actual devices these nets have been shown to occasionally still have enough skew to cause shift registers to malfunction).

3.  The need to use normal routing resources because there are too many clocks for the available clock nets.

4.  The place and route tool automatically removed the clock from the clock net because its assignment there conflicted with the tool’s placement rules.

5.  The clock could not be placed on the clock net because access to the net was determined by pin assignment, and the input clock signal was assigned to the wrong pin on the device.

Since the introduction of clock skew can have a number of causes, and some of those causes may be inadvertent or even unknown to the designer, the application of Roy’s Rule number 4 (make your circuits bulletproof even if no one is shooting at it) is a good way to prevent potential clock skew problems.

Image

FIGURE 6.7 Shift Register with Controlled Delays

How does one make a shift register reliable? Since the problem stems from clock skew exceeding the hold time of the previous shift register stage, and assuming there is little that can be done about this skew, the logical alternative is to design a shift register that is impervious to clock skew (within realistic limits of course), and one way to do that is to force the hold time of the previous stage to exceed any expected clock skew.

Figure 6.7 illustrates one way to do this using synchronous logic. In this solution, extra registers that use the alternate phase of the clock are inserted between the shift register stages to implement controlled delays (as opposed to uncontrolled, which would be the case if asynchronous logic was added between stages) between stages of the shift register. While this may almost double the number of registers in the shift register, the cost of additional registers is insignificant in a register- rich FPGA.

Figure 6.8 shows the waveforms for this shift register. Because the newly inserted shift register stages operate off the alternate (falling) edge of the clock, the shift register will now operate correctly for any amount of clock skew up to the period from the rising to falling edge of the clock, which for a symmetrical clock will be about half the clock period.

As noted previously, the lesson from this example is not how to design a robust shift register, but rather that applying a high integrity design philosophy can predict and identify potential failure modes in even the most simple and unassuming circuits that normally escape scrutiny. Identifying and correcting these weaknesses can significantly reduce the potential for latent failures.

Image

FIGURE 6.8 Waveforms for Shift Register with Controlled Delays

Example 6.2: Synthesis Tools

Circuit designers, particularly the less experienced ones, have a tendency to put too much trust in their design tools. The problem with this is that the tools may not entirely deserve that trust.

Figure 6.9 shows the VHDL code for a very simple clock divider that divides a source clock by three. It is simply a modulo-three counter that counts from zero to two and then back to zero, with count three being an unused count. There is no reset in this circuit because the synchronous device in which it is used needs its clock during reset to operate and initialize. The circuit diagram for the code is shown in Figure 6.10. Because the most significant bit of the counter is fed back to its synchronous clear input, both the two and three count values (binary values 10 and 11) will cause the counter to synchronously reset and restart its count sequence. Thus, if for any reason the counter reached count three (11 binary) it would immediately recover and start over.

Figure 6.11 is the logic diagram for the expected output from a synthesis tool. Its truth table shows that it should behave in the same manner as the original design.

Figure 6.12 is the logic diagram for the actual output of a synthesis tool that processed the VHDL for the clock divider, taken directly from the synthesized netlist output of the tool. A quick analysis of the synthesized output will reveal that it not only looks different than the original circuit, it also behaves differently in that it will latch in count three. This is a potential failure condition which, because the circuit generates a clock, could cause a system to lock up and fail. Hardware testing showed that the PLD would in fact lock up regularly on power up or during a power transient because the two registers in the circuit would randomly initialize to count three.

Image

FIGURE 6.9 VHDL for a Simple Clock Divider

Image

FIGURE 6.10 Logic Diagram for a Simple Clock Divider

Image

FIGURE 6.11 Expected Output of Synthesis Tool

Image

FIGURE 6.12 Actual Output of Synthesis Tool

A call to the synthesis tool manufacturer revealed that the tool was doing what it was designed to do—it recognized the circuit as a modulo three counter and intentionally changed the circuit’s topology to be as efficient as possible (eliminating the unused state reduced the size of the circuit by a logic cell or two). Unfortunately it did so at the expense of changing a fail-safe unused count into a weakness that could cause a system failure. The tool manufacturer also revealed that the problematic reduction algorithm could not be turned off or defeated in that model of the tool. Since there was no software “switch” in the design tool user interface to turn off that feature, there was no way for the designer to know that that algorithm even existed, let alone that it could cause problems.

The synthesis tools used for PLD logic reduction are another example of the capability/complexity dichotomy that was brought up in the discussion of device selection, where an increase in capability will often bring with it a corresponding increase in complexity that will normally increase the uncertainty associated with using it. Since the goal of design assurance is essentially to reduce uncertainty to manageable levels, any device, tool, design feature, or design method that can increase the potential for uncertainty is at odds with this goal and needs to be dealt with in a deliberate manner. In this case, since there is no alternative to using a synthesis tool when designing with PLDs, and since the synthesis tool is an integral part of the design process, some other means of mitigating this type of effect (this is just one example of who knows how many such features exist in these highly complex but capable tools) must be brought to bear to maintain some degree of control (and therefore certainty) over the implementation of the design.

Synthesis tools are quite adept at recognizing certain code constructs and design techniques and then applying their technical magic to improve on them, often without the knowledge (or consent) of the designer. In the vast majority of cases this assistance is useful, but when considered within the context of high reliability design, the rare instances of fault introduction like this example are still frequent enough to be of concern, and should be anticipated where possible and then mitigated through defensive design techniques. Thus, if a circuit has any potential to be selected by a tool to be “improved,” or if the application of Roy’s Rule number 6 indicates that the circuit topology has any potential for inherent failure modes, it should be designed in a way to be bulletproof or to minimize the chance that the tool will recognize it.

Common circuit topologies, such as counters and finite state machines, are the most likely to be modified by synthesis tools. The essence of this problem is that the synthesis tool is recognizing the circuit and acting on it in an attempt to optimize it, but it is doing it at the expense of function and robustness. In order to prevent the tools from recognizing and then modifying our circuits without resorting solely upon setting switches to pre- emptively set up the tools to disable optimization features that cannot be fully predicted, the approach contained in Roy’s Rule number 3 can be used, which is to design logic without the use of behavioral HDL, which leaves the actual logic design to the designer rather than relying upon the synthesis tool to do the design. This approach can work well with functions that are normally designed with behavioral HDL, such as finite state machines and complex math functions such as multipliers. However, for circuits such as this example, where the logic was already designed at the RTL level, another option is to design the circuits in ways that can minimize the chance that the tool will recognize it for what it is, or as an alternative, modify the circuit topology in a way that the circuit no longer behaves like its standard form.

Figure 6.13 is the logic diagram for a modified version of the divide by three counter that breaks the terminal count feedback path with a register, which can reduce the probability that a synthesis tool will recognize the circuit as being a modulo counter. The addition of the register also alters the behavior of the counter such that, while it still counts in the normal manner, it does not rely on asynchronously feeding back the terminal count.

Image

FIGURE 6.13 Modulo Counter with Broken Feedback Path

When this circuit is processed by the synthesis tool using the same settings as the original design, the result is the netlist shown in Figure 6.14. It is immediately apparent that this circuit is identical to the expected synthesis result shown in Figure 6.11 except for the addition of the delay register in the feedback path. Analysis of its operation results in the truth table shown in Figure 6.14, showing that the circuit’s operation is also identical to the original design. Thus, with the addition of the register this counter was able to avoid the problematic feature of the synthesis tool and retain its original (and safe) functionality.

Note that some synthesis tool manufacturers have come to recognize this concern and have introduced tools with a “safe” option that bypasses the optimization features that can introduce fault modes. This is a very welcome feature for those who work with high reliability designs. However, even with this nice addition it still behooves designers to continue to practice safe and defensive design techniques.

This example illustrates the problems that can occur if designers are too willing to trust their tools. Trusting their tools is, in fact, the first impulse of most designers. One insidious reality of tool- induced failure modes is that most, if not all, of these failure modes are latent and will not be revealed through normal requirements-based verification. The designs will function as they were intended to and still meet their requirements, and so will pass all requirements-based tests. Finding these tool-induced failure modes requires additional robustness verification that targets the circuit types that can contain potential failure modes, such as finite state machines and counters.

Image

FIGURE 6.14 Actual Synthesis Result for Modulo-3 Counter with Broken Feedback Path

When dealing with complex tools that cannot be completely understood it is important that the designer know as much as possible about the tools, but never to trust them. In addition, designers should disable as many of the advanced features as practical, and never assume that the tools are going to correctly implement the design’s intended functionality. Designers should also be aware of any inferred functionality in their designs, such as unused states in counters and state machines, and create verification standards that account for all conceivable tool-induced errors.

The material presented here is not a comprehensive treatment of DATDP. DATDP is, in its purest or fullest incarnation, a lifestyle as well as a paradigm, and as such a single chapter cannot adequately express all of its aspects to their full extent. However, the introduction presented here can provide a starting point for a first step into that lifestyle, whether it is based on the information provided here or on an equivalent set of ideas that are customized for the organization or individual. So whether it is called Roy’s Rules, Larry’s Laws, Mark’s Mandates, or any other name, DATDP should, in some form, be an integral part of any high integrity design environment.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset