As you have already learned, a key objective of deploying a prognostic‐enabled system is to monitor prognostic targets to provide advanced warning of failures in support of condition‐based maintenance (CBM). There are criteria associated with, for example, equipment availability and other metrics, test coverage, and confidence levels. To meet the criteria, the various sensing, signal‐processing, and computational (algorithms) routines in a prognostics and health management/monitoring (PHM)1 system need to be factored into the entire design. CBM methods and approaches – especially those using condition‐based data (CBD) signatures that are ultimately transformed into functional failure signature (FFS) data that is processed by a very good prognostic information program – provide significant advantages over (i) a system based on statistical or other methods applicable to populations rather than a specific instantiation of a population and (ii) a system based on using CBD to detect damage without prognosing when such damage will result in the system no longer operating within specifications.
Chapter 6 presented a design of an exemplary prototype of a PHM system that prognostic‐enabled multiple instantiations of systems and prognostic targets with excellent results (see Table 7.1).
Table 7.1 Performance measurements and metrics.
Prognostic target | PI specifications | BD | EOL | PD | PH | |
Name | PITTFF0 | @ time | @ time | Maximum | PH ERROR = (PITTFF0/PDMAX) | |
FFS NM | Estimated BD | Estimated | FOM | 25% @ time | 10% @ time | |
χ [%]–pts [#] | χ [%]–pts [#] | |||||
SMPS | 4800 h | 1368 h | 4200 h | 2939 h | Initial PH error = 63% | |
3% | 1261 h | 4176 h | 96.4% | 1560 h | 2760 h | |
93.5%–9 pts | 52.6%–23 pts | |||||
EMA load | 4800 h | 504 h | 3168 h | 2727 h | Initial PH error = 76% | |
2% | 441 h | 3164 h | 97.7% | 576 h | 1368 h | |
97.3%–3 pts | 68.3%–19 pts | |||||
EMA winding | 4800 h | 1104 h | 2760 h | 1693 h | Initial PH error = 184% | |
2% | 1067 h | 2745 h | 97.8% | 1320 h | 1680 h | |
87.2%–10 pts | 66.0%–25 pts | |||||
EMA power | 4800 h | 960 h | 4440 h | 3476 h | Initial PH error = 38% | |
Transistor | 1% | 958 h | 4434 h | 99.9% | 984 h | 1512 h |
99.3%–2 pts | 84.1% |
The design supported multiple subsystems (two), each comprising prognostic targets: a power supply and two electro‐mechanical actuators (EMAs). The design included monitoring each of the two power supplies for a single mode of failure and monitoring each of the four EMAs for three failure modes. The monitoring, conditioning, and processing were all based on CBD signatures, and the prognostic approaches and methods provided excellent, if not superior, results:
Electronic health solutions, such as those described in this book, become part of a PHM system. They are sometimes referred to as a prognostic ecosystem (see Figure 7.1), within which such solutions can be categorized at levels as shown in Figure 7.2: die, component, board, module, and system (Ridgetop Group 2018). Other levels could be added, such as an assembly of boards and a collection of modules into a replaceable unit.
An ecosystem can be described as prognostic models within a system that includes descriptions of data, quantification of uncertainty, justification and validation of model selection, and limitations of application (Astfalck et al. 2016). The locations in the broader view of an ecosystem shown in Figure 7.1 are the following: location 1 is a system or subsystem comprising one or more line‐replaceable units (LRUs) that are prognostic enabled (monitored for damage and/or degradation); location 2 is a PHM system that acquires, manipulates, manages, and processes data to produce prognostic information that is used to initiate a service and maintenance action; location 3 is where failures are analyzed and products are improved by a supplier; location 4 are repositories of LRUs, assemblies, components, and devices used for service and repair; and location 5 are maintenance personnel who perform service and maintenance.
A complex PHM system contains devices, components, boards, subassemblies, and so on. A sensor is attachable to any node within a system, and therefore health solutions that process sensor data can be categorized in accordance with the node to which the sensor is attached, as exemplified by the five‐level model of health solutions shown in Figure 7.2 (Ridgetop 2018 ).
Critical systems are considered vital to the ongoing operation of everyday life, and criticality is a key consideration when evaluating and selecting a node for prognostic enabling (a prognostic target). For example, a power system in an aircraft, or a gear box in a wind turbine, would be considered critical, since their operation is essential to meet design objectives of the overall system. Another example of criticality is the safety of life and health: prevention and avoidance of loss of life and injury is a primary objective of a system. Fault severity and fault propagation also play a role in the definition of systemwide criticality.
Advance warning, such as an alert, of any impending failure of a mission‐critical or safety‐critical prognostic target is vital: a properly designed PHM system will provide detection of anomalies that affect the ability of the system to operate within specifications and issue appropriate alerts. For example, Chapter 6 included examples of messages and alerts issued by an exemplary prototype PHM system. A PHM system will issue alerts (health monitoring) and/or initiate appropriate actions (health management) such as soft shutdowns, load shedding, and scheduling maintenance. There might also be various levels of alerting where threshold levels and fault models can be used to prioritize what information is available to an operator of an aircraft, or a seagoing ship, or a machine tool on the manufacturing floor. This brings in the notion of fault severity, access to the information, and what is done to mitigate an issue that results in an alert.
To save money and resources, maintenance intervals can be optimized based on actual evidence of degradation. For example, a system might have components that fail after 250 hours of operation, some that fail after about 600 hours, and others that fail at about 500 hours. A usage‐based PHM system might be designed to do one of the following:
In a first design of a PHM system, an objective might be to avoid all unexpected failures, but at increased maintenance costs: for example, an average 300 hours of lost usage for each instance of avoidance, and cost increases due to increased maintenance actions (more frequent replacement). A second design might focus on reducing sustainment costs by increasing the time between maintenance actions – but unexpected failures would increase. Typically, even disregarding mission and safety issues, the cost of an unexpected failure is higher than an early repair‐and‐replace action.
So, instead of a usage‐based PHM system, we advocate CBM using a PHM system that is CBD‐based; uses signature‐based detection and prognostic approaches and methods; and employs a fast, highly accurate set of data‐conditioning, prediction, and computational routines. The advocated approach of using CBD signatures is an effective method for handling variability introduced by the operational environment: operating in the desert of Arizona is very different than operating equipment in a rainy, cold environment such as that in Puget Sound, Washington.
This book is focused on prognostic enabling to monitor the health of a system of nodes: the prognostic targets chosen because those nodes have signals that change in response to degradation of devices, components, and so on that, when they fail, have a critical effect on the operation of the system. They may cause mission‐critical functions to cease or otherwise operate out of specifications, or they may create a hazardous threat to the safety of the system or life.
But monitoring, per se, does not avoid unexpected outages, does not repair anything, and does not prevent loss of life. Refer back to Figure 7.1 : a PHM system needs to provide services for health management, maintenance, and logistic support to schedule maintenance, locate and deliver parts and equipment, and dispatch a maintenance team.
Given the accuracy of the prognostic information produced by the PHM system in Chapter 6, it would not be unreasonable to defer maintenance until a detected state of health (SoH) value at or below a specified level, such as 25%. PHM management support might be designed to act on alerts, such as those shown in Figure 7.3, which are excerpted from Chapter 6. In contrast, PHM management support might be designed to act on damage‐detected alerts, such as those shown in Figure 7.4.
A PHM system needs to alert users when maintenance is required – based on physical evidence of degradation, and not on an arbitrary number of elapsed hours. In addition to alerts, a PHM needs to provide for and support maintenance‐related services to avoid unnecessary replacements, increase usage of systems, decrease downtime, and reduce sustainment costs. The approaches and methods used for maintenance are application specific, need be integrated with health management and logistic services, and are beyond the scope of this book.
A critical function of a PHM system is logistics support. Parts and equipment must be located and delivered to the service and repair site, and a service and maintenance team needs to be dispatched to arrive on or after, but not before, the arrival of needed parts and equipment. Additionally, maintenance and inventory records need to be updated; and suppliers, vendors, and manufacturers must be notified per contractual obligations. The latter is especially true when dealing with a government agency such as the Department of Defense. Logistic support might also be required to arrange for and record the outcome of ancillary activities such as cause‐and‐effect review of repairs.
The previous chapters were focused on PHM aspects deemed critical to the design of a PHM system, including approaches and methods not suitable for CBD‐based prognostics. The overall objective of this book is to provide you with the knowledge to understand, evaluate, design (at least at a high level), and verify health monitoring. The introduction of this chapter has briefly introduced you to the concept of ecosystems, critical systems and warnings, reduction in maintenance, and health management. The remainder of this chapter is devoted to the evaluation, selection, and specifications of prognostic targets: nodes to be monitored to detect damage and provide prognostic information, to avoid unscheduled outages of critical functions and loss of safety in a system.
The remainder of this chapter is organized to present and discuss topics related to prognostic enabling:
This section includes descriptions of the meaning and relationship of TTF, time before/between failure (TBF), prognostic distance (PD), and prognostic horizon (PH); distributions of the onset of degradation and functional failure; mean time to failure (MTTF); and mean time before/between failure (MTBF).
This section is devoted to the cost‐benefit analysis of prognostic approaches and includes example comparisons of no PHM, two usage‐based approaches, CBD‐based detection, and CBD‐based prognostics.
This section is devoted to the bathtub curve, prognostic triggers, and the relationship of the bathtub curve to failure rate and MTBF.
This section summarizes and ends both the chapter and the book.
Selecting a target to be prognostic enabled probably seems pretty straightforward: collect and analyze historical records pertaining to maintenance and repair to identify those targets having high rates of failure and/or failure of mission‐critical and/or safety‐critical parts regardless of failure rate. Prepare a cost‐benefits business case: cost to replace or repair, cost associated with unplanned failure, savings due to prolonged time in use, savings due to reduction in sustainment costs and unplanned downtime, and so on. But you also need to factor in a hard‐to‐quantify cost related to criticality of mission and/or safety (refer back to Section 1.8).
Because you are designing a PHM system to prognostic enable an operational system, your team will not perform any traditional failure mode and effect analysis (FMEA) or failure mode effect and criticality analysis (FMECA) (DAU 2018). Instead, your team will review the existing FMEA and FMECA data; the historical failure, service, and repair data; and any other data related to failure. The focus is on identifying, selecting, and winnowing prognostic targets. To select and winnow a list of prognostic candidates, including those identified as candidates by FMEA/FMECA, you need to know the following:
PHM systems are typically referenced to either MTBF or MTTF, and therefore you should know and understand the difference between them (Speaks 2005):
As you can see, the meaning of (and therefore the use of) these terms is dependent upon the definition of failure and the definition of repairable. To achieve an understanding of those terms, recollect that Chapter 2 introduced failure in time (FIT) in Eq. (2.15):
where AF is the value of an acceleration factor for a specified test. Refer to Tables 2.3 and 2.4 for examples.
But you are not given a failure rate: instead, you are told that the FIT number is 50. Now you need to know the following to relate that FIT number to a failure rate (Ellerman 2012; NIST 2018 ):
where 1 FIT = 1/109 hours.
Even though your research confirms that your calculation is correct, because the calculated rate of failure is so large, you decide to calculate an MTBF (mean time before failure) value using Eq. (7.1) (Abernethy 2006; RAC 2005; Speaks 2005 ; Weibull 2008):
But this does not help, because you do not know the total time, which you know is calculated using Eq. (7.3):
You also don't know the number of tested units or the test time. So, you find another expression for MTBF
which, for FIT = 50, is equal to the Example 7.4 calculated value for λ: 20 000 000 hours.
Literature research reveals the following:
where MTTR is defined as “mean time to repair.”
Regardless of meaning and/or definition, neither MTTF nor either of the two definitions of MTBF is useful for prognostic enabling. MTTF and MTBF should be limited to a classical definition of reliability (Section 1.6): without intervention, there is a 63% probability the system will fail prior in time to the value of MTTF or MTBF.
Figure 7.5 illustrates the relationship of a failure distribution (density of failures/time), the MTBF (failure rate between failures), the MTBF (mean time before a first failure), and MTTF: MTTF and MTBF were originally defined for an exponential distribution having a constant, low failure rate: for example, solid‐state (integrated circuit) devices. Those devices are subjected to one or more accelerated tests, such as a HALT, with test results extrapolated to normal life using an AF (Ellerman 2012 ; O'Connor and Kleyner 2012; NIST 2018 ; RAC 2005 ; Speaks 2005 ; Wilkins 2002).
Typical commercial FIT values for solid‐state devices are in the 50–1000 range, with FIT values in the range of 1–10 for space applications (Johnston 2010). There are simulators that calculate values called MTTF and MTBF using simulated failure times (Weibull 2008 ), which adds even more uncertainty as to the meaning of and/or calculation of a particular MTTF and/or MTBF value.
Even worse, different failure distributions, different CBD signatures, and so on, can result in identical (or nearly identical) reliability metrics such as MTTF: compare Figures 7.5 and 7.6. Finally, this book asserts that attempts to use a TTF value of hundreds of thousands (or larger) of hours for CBD‐based prognostics is nonsensical: an MTTF value of 100 000 hours is equivalent to more than 11 years.
We need to know how to determine, calculate, and/or estimate a TTF value that begins when degradation begins and ends when functional failure occurs (see TTF 1 and TTF 2 in Figure 7.5 ). But at the time when degradation is first detected, there is no a priori knowledge of that future time of failure; yet our prediction program needs to converge from an initial estimate of that time to a very accurate estimate of the time of failure. Research into reliability metrics such as MTTF, MTBF, and FIT indicates that they are not really close in value to what we need for TTF.
The prediction program we are using, ARULEAV, provides a parameter called PITTFF0 (introduced in Chapter 6) as a means to specify an initial value for TTF. TFF is not a value that is looked at or specified by manufacturers and/or vendors of products; failures in the field are usually due to either an anomalous event such as a lightning strike or degradation. Degradation typically is not caused by a part entering what is referred to as the wear‐out region of a bathtub curve; rather, degradation is typically due to an accumulation of fatigue damage caused by cyclic stresses and strains (such as thermal and mechanical) during operation (Hofmeister et al. 2006).
We can estimate an initial value for TTF using a number of methods: a service‐life determination, an end‐use test method, or an MTTF‐based method. But be aware that the supplier of the prediction program advises that, in general, that program converges to within 25% accuracy in less time when the initial estimate is higher, rather than lower, in comparison to the true time of functional failure.
Instead of using MTTF or MTBF values for an extremely low failure rate, you might use end‐use values based on service‐life values from vendors. For example:
Set the PITTFF0 parameter to twice the value you have of the service‐life determined value for TTF, for three reasons:
You can calculate TBF and TTF values in the same way as MTBF and MTTF values, respectively, using the following (Weibull 2008 ):
Referring to Figure 7.5 , a simplistic MTTF‐based method would be to set TTF equal to MTTF, but that method works well when the spread of the majority of the failures is wide compared to the value of MTTF. In such cases, simply setting PITTFF0 to twice the value of MTTF suffices. But if you are fairly confident that the situation is more like that illustrated in Figure 7.6 , you need to specify a lower value for PITTFF0. However, it might be the case that your PHM system supports the same type of LRUs in two distinct operating environments: one that induces earlier‐than‐expected failures (akin to a situation like that shown in Figure 7.5 ) and a second that is less variable and causes failures to be more closely bunched together (akin to a situation like that shown in Figure 7.6 ). Further, to avoid misunderstanding and/or for procedural reasons, suppose you must always set PITTFF0 to a value (such as MTTF) specified by a manufacturer, vendor, or governmental agency. In such situations, you need a method to cause the prediction program to adjust the specified PITTFF0 value. The supplier of your prediction program agrees, and changes are made to provide a node‐definition parameter, PITTFADJ, to allow you to adjust how the value of PITTFF0 is handled:
We will use the SMPS‐EMA examples from Chapter 6 as a base platform to construct an example situation to illustrate the cost‐benefit for various approaches. You and your customer agree that cost‐benefit analyses are not to include unavoidable catastrophic failures (such as, for example, being hit by another vehicle) and that, because of criticality considerations, repair‐and‐removal activity due to unexpected functional failures will be held to less than 5% of the total number of repairs and removals.
The situations are the following: (i) none; (ii) usage‐based, MTTF; (iii) usage‐based, 2/3 MTTF; (iv) CBD‐based, replace when damage is detected; and (v) CBD‐based, replace within 720 hours after estimated SoH becomes 75% or less. Although the first situation fails to meet the requirement that unexpected failures will be less than 5%, it establishes a baseline estimate of cost.
Your customer arranged for special delivery of 6 power supplies and 12 EMAs. Those 18 units were subjected to end‐use tests similar to HALTs (refer back to Section 2.2). Your team assisted in the design of the experiment and building of the test beds; the tested units (power supplies and EMAs) needed to be prognostic enabled, as described in Chapter 6. For analysis purposes, test failures are to be evaluated as though all of the tested units were installed at the same time and failed in the sequences and times indicated by the test.
The numbers to be used in a cost‐benefit analysis are provided by the customer and listed in Tables 7.1 and 7.5. After examining the rest results (see Figure 7.15 and Table 7.6), your customer concludes the cost‐benefit analysis for the power supply scenario is sufficient for evaluation of the five approaches.
Table 7.5 Cost estimates for benefits evaluation of prognostic enabling.
Estimated costs: use for benefits analysis | Sustainment (Use) | ||||
LRU name | Acquisition | Scheduled R&R | Unplanned failure | Expected life (h)/LRU | Period (h)/LRU |
Power supply | $10 000 | $2 000 | $4 000 | 3 500 | 14 400 |
EMA | $25 000 | $3 000 | $6 000 | 3 500 | 14 400 |
Table 7.6 Summarized list of test results.
Power supply | Detect degradation | Detect failure | EMA | Detect degradation | Detect failure |
Supply #1 | 1368 | 4200 | EMA #1 | 540 | 3168 |
Supply #2 | 1320 | 3624 | EMA #2 | 516 | 3024 |
Supply #3 | 1296 | 3240 | EMA #3 | 588 | 3600 |
Supply #4 | 1248 | 3048 | EMA #4 | 636 | 4032 |
Supply #5 | 1224 | 4584 | EMA #5 | 1104 | 2760 |
Supply #6 | 1416 | 2856 | EMA #6 | 1080 | 2472 |
EMA #7 | 1152 | 3192 | |||
EMA #8 | 1200 | 3624 | |||
EMA #9 | 960 | 4440 | |||
EMA #10 | 936 | 4008 | |||
EMA #11 | 1008 | 4872 | |||
EMA #12 | 1056 | 5304 |
In this scenario, the system runs until a unit fails – a “do nothing until failure occurs” approach. When failure occurs, the failed unit is removed and replaced. The system is restarted and runs until the next unit fails, and so on, until the end of a sustainment period of 14 400 hours (600 days of operation):
Repeating this for the remaining five power supplies results in 24 removals and replacements because of unexpected outages due to degradation failure, at a baseline cost of $336 000.
In this scenario, all power supplies are replaced when usage equals 3592 hours (MTTF). Examination of the data in Table 7.6 shows that power supplies #3, #4, and #6 would functionally fail before they are replaced. In the sustainment period, a total of 27 power supplies would be removed and replaced, at a total cost of $354 000 – an increase of $18 000 per system during the sustainment period to reduce the number of unexpected outages from 24 to 14. However, 52% of all repair and removal actions would be attributable to degradation failures. This approach fails to meet requirements.
In this scenario, all power supplies are replaced when usage equals 2395 hours. There would be no unexpected outages, but 36 power supplies would be replaced at a cost of $432 000 – an increase in cost of $96 000 and a reduction of unexpected failures to zero. That $96 000 becomes $96.0 million for a population of 1000 systems.
In this scenario, whenever damage is detected in a power supply, it is removed and replaced within 720 hours, which would also, for these supplies, result in zero unexpected outages. This approach is seemingly a good one because there would be no unplanned outages due to degradation leading to failure. But there would be a large increase in removal and replacement activity: a total of 43 power supplies in the 14 400‐hour sustainment period at an estimated cost of $516 000 per system – an increase of $180 000 more than the baseline cost per system, which becomes $180.0 million for a population of 1000 systems.
In this scenario, whenever a prognostic SoH estimate is 75% or less for a power supply, it is removed and replaced within 120 hours: again, for this approach, there would be zero unexpected outages. This approach would result in the removal and replacement of 28 power supplies during the sustainment period at an estimated cost of $336 000 per system – neither a savings nor an increase in cost compared to the baseline cost of a “do nothing until failure” approach.
Yes, there is the cost of the sensors and PHM systems, but the SoH approach has the following advantages:
Sustainment costs can be further reduced when your PHM system is sufficiently accurate and reliable to let your customer defer maintenance until SoH estimates fall below 50% or even lower.
Of the five approaches in the cost analyses, two are based on CBD: the damage‐detection approach and the SoH approach. The damage‐detection approach is diagnostic in nature: it processes CBD and detects damage, and maintenance is scheduled. The SoH approach is prognostic in nature: it processes CBD, detects damage, and provides estimates of SoH that are used to trigger scheduling of maintenance. The prediction program, ARULEAV, also provides RUL and PH estimates for use in health management.
A bathtub curve, shown in Figure 7.16, is a statistical depiction of the failure rate over the lifetime of a population of electronic products. There are three distinct regions involved, where the curve depicts the failure rate versus time. Beginning on the left, and moving to the right:
So, there is really nothing about a bathtub curve that can be used to enable or to support CBD‐based prognostics.
As you can see in Figure 7.16 , MTBF (between failures) is not a time‐axis value: it is a failure‐rate value. Neither MTTF nor MTBF is seen in a typical view of a bathtub curve, perhaps because of the relationship suggested in Figure 7.17 (Seastrunk 2016).
Also shown in Figure 7.16 is a conceptual diagram intended to convey the notion that it is possible to employ a prognostic trigger to provide advance warning of a probable failure within time PD while, at the same time, conveying a notion that useful life does not extend into the wear‐out region. Figure 7.18 conveys a more practical view of prognostic trigger points:
This is the final chapter in this book. We discussed topics related to the selection, evaluation, and other considerations of prognostic enabling. The introduction briefly touched on critical systems, advance warning, and health management. The bulk of this chapter presented a rationale for not using reliability metrics such as MTTF and MTBF; instead, a rationale was presented for using the time between the onset of degradation and the time when such degradation results in functional failure: TTF. Methods to determine or calculate a value for TTF include service life, end‐use testing, and MTTF‐based methods. A section on cost‐benefit analysis of prognostic approaches included example approaches for the following approaches: (i) no PHM; (ii) usage‐based MTTF; (iii) usage‐based 2/3 MTTF; (iv) damage‐detection; and (v) SoH at 75% or less. The section after that focused on the bathtub curve and how it relates to failure distributions, MTBF, MTTF, trigger points, and CBD signatures.
By no means does this book cover the entirety of information related to PHM and CBD – conditioning, modeling, and processing for CBM. On the other hand, this book contains a wealth of information dealing with basic approaches and, importantly, CBD signatures and how to process and linearize those signatures, which lessens the burden on prediction programs and improves the accuracy of prognostic information. Chapter 6 presented the design of a hypothetical prototype PHM system to illustrate the challenges a designer might face and to demonstrate the application of the approaches discussed in this book.