1012:full

Conducting a Failure Analysis, Part 2: Nine Steps to Failure Resolution

December 1, 2020

This summary focuses on failure analysis process, but also on pitfalls. Every engineer can lay out a reasonable process. But their respective processes invariably leave out some necessary steps to avoid numerous pitfalls that they may not be familiar with. This framework is intended to address these common pitfalls.

By Dan Harrison

When confronted with a test failure, design engineers will often think in terms of a pre-defined Failure Analysis Tree or Fault Tree, but often with bias based on their respective specific discipline and experiences. As design engineers with specific areas of expertise, that’s their job. But this bias encourages them to leap to a root cause skipping the first critical step in the anomaly resolution process, which is to first clearly define the anomaly with firsthand datum. Without first doing this the team will often be split into multiple false paths. Once you let this happen, recovery can be difficult and possible only after a large expenditure of time and energy. You must start by clearly defining the anomaly[1].

1. Define the Anomaly

Design engineers invariably feel that they “know” what happened and immediately want to jump to a root cause that they “know” is the reason for the failure. But their initial presumptions about both what happened, and the root cause of the anomaly are usually all different and mostly all wrong. This is not criticism, just experienced observation.

In the first 20 years or so of my engineering career, as I either engaged in or led these failure analyses, I observed firm agreement only once between several of the design engineers this early in the process. In this isolated case two design engineers were sitting next to each other at adjacent test consoles and had seen the exact same data on the same or adjacent console screens. They had physically observed the data anomaly and would subsequently provide screen captures as firsthand datum to prove what had occurred.

My first task is always to blunt the initial enthusiasm of a room full (or multiple sites full) of design engineers all of whom “know” what happened and insist on skipping the first boring step that I always require. I insist on first clearly defining what happened in sufficient engineering detail so that everyone will agree. This means identifying the individual(s) who physically observed the “anomaly” and having them describe this anomaly to the team with supporting data such as screen captures and other relevant firsthand datum.

It takes 10 to 20 minutes just to quiet the dissention in the room long enough to get this started. Once the anomaly has been described with supporting data, many of the design engineers’ pet root causes will have fallen off the table. It’s a painful process that must be done to get the failure analysis headed in the right direction.

Scaling this process to address more complex System(s) of Systems[2] anomalies does not change what has to be done and must be done using firsthand information. If it turns out that you don’t have the right people in the room, they will need to be called in or connected on a conference call. In part, you are instilling within your team a discipline that must be observed.

2. Review or Create and Expand the Fault Tree

Once the anomaly resolution team has a consistent view of what occurred, and how the anomaly actually presented itself, a review of an existing generic fault tree for this UUT (together with a description of the test set up) may help to motivate the creative juices. There are many engineering disciplines represented in this conference, and all must be brought up to speed. Do not allow this discussion to be curtailed by engineers still wanting to jump to their favored root cause.

As the Senior Launch Controller for a major aerospace developmental flight test program begun in the 1980s, I was the first to exercise the newly created anomaly resolution team[3]. Our anomaly resolution team’s audio conference was connected to two offsite laboratories and multiple support activities. We had a six-hour launch window for this mission and had rehearsed this anomaly resolution process several times in preparation for the next launch. In spite of this, a handful of lead engineers objected to the process and disrupted the start of the meeting for at least 15 minutes until the last two were pulled out of the conference room by their respective managers. The last one was escorted off site, never to return.

We successfully launched within the launch window and continued to successfully launch the remaining 12 missions within their prescribed launch windows in spite of multiple anomalies on several launches in this highly successful development program.

During review of the fault tree, many potential causes of the test anomaly will be identified and either left open as assignable actions or closed with rationale. All substantial possible causes, whether closed by the team or not, should be captured by a designated engineer for update of the fault tree and for documentation of this troubleshooting process.

Postulated potential failure modes that have been shown not to contribute to the root cause will be retained in the revised fault tree if substantial because they may still spark new ideas should this process extend over several weeks or months. The revised fault tree should be presented periodically for this purpose. Who will be capturing these data?

3. Review the Ground Rules for Developing the Test Plan

The Test Director or designee will present a short discussion on the process for developing the test plan. Ideally, these ground rules will already have been engrained in the test team. In addition, an update should be periodically provided addressing:

  1. What testing has been authorized for the test cabling and test set up and what is completed or still ongoing,
  2. A status of any new developments from the test team, and
  3. An overview of the steps in the anomaly resolution process.

This briefing should be focused and short.

4. Begin Development of the Test Plan

After a few promising possible causes (usually two to five) are identified, the Test Director will make the decision to refocus the team on fleshing out and prioritizing these first few possible causes. The intent is to clarify the failure mode and to lay a foundation for development of a test plan for each of these possibilities. Once the test plan developer has an understanding of what’s desired, the Test Director will focus the team on the questions being asked by these test plan developer(s). This process will be repeated for each of the identified possible causes of the anomaly within this first group of possible causes.

Make sure that everyone knows that the anomaly resolution team will continue to investigate other possible causes of the failure once this first high priority group is turned over to the test team to finalize a test plan, obtain approvals and execute. Both the test team and the anomaly resolution team will continue to work in parallel and in concert.

5. Define Risks and Prioritize According to the Ground Rules

This one will be difficult although it does not have to take much time! Both risks of damage to the hardware and the risk of compromise of the cause of the test failure by disturbing the configuration prematurely must be considered. These risks will be the primary determinants for defining the sequence of execution of the test plan.

Note that for this small group of possible failure modes, prioritization of the order of testing will not be based only on an estimation of the most likely cause of the failure. This is why the Test Director will determine the size of this first group. The order of the test plan steps will be based on the ground rules including:

  1. Starting the testing for each possible failure mode with the test cabling and test set up,
  2. The internal UUT cabling to the extent practical, and only then,
  3. The full UUT.

In addition, the UUT testing sequence will be based on:

  1. The desire not to alter the test configuration sooner than necessary to reduce the chance of confounding the cause of an anomaly or failure, and
  2. To reduce the potential for damage to the UUT by failing to exonerate all accessible test and UUT cabling and the full test set up first.

Some of this work may have already been done for the test cabling and test equipment, but whether or not this is true, many design engineers may try to insist on proceeding directly to the full UUT ostensibly to save time. This violates all of these ground rules increasing the risks to all hardware including the UUT and by possibly compromising identification of the root cause through a change in the test configuration earlier than necessary and with the risk of propagation of the anomaly producing an unidentified change in the test results.

There will likely be some test requirement commonalities within this group, such as common test cabling and test equipment. Combining some or all the lists of cabling and/or test equipment for the failure modes being considered may save some time. The Test Director will consider this for efficiency of the overall test process.

Starting with a small group of possible failure modes is also preferred for other practical reasons: to get the testing started in parallel with the anomaly resolution team’s efforts, and due to the possibility of completing this troubleshooting process early once the cause of the failure is identified.

6. Develop, Approve and Execute the First Increment of the Test Plan

The test team can now work in parallel with the anomaly resolution team on this first set of test objectives. But the Test Director can expect the test plan developers to have suggestions and questions for the anomaly resolution team. The point of contact will normally be a test support representative on the anomaly resolution team, a designated system engineer, and/or a design engineer specific to the failure mode being addressed.

The Test Manager, Quality, the Test Director, and others such as Safety, as internal procedures specify, will be required to approve this first increment of the troubleshooting plan before implementation. Make sure that everyone knows that the anomaly resolution team will continue to investigate other possible causes of the failure, with additional test plan development to follow, once the first high priority group of possible failure modes is turned over to the test team.

7. Continue with the Fault Tree Analysis

While the test team is executing the first increment of the failure analysis plan, the anomaly resolution team will continue its investigation of other possible causes of the failure. Constant updates will be fed to the Anomaly Resolution Team on the results of troubleshooting by the Test Team. With these updates the Fault Tree will be updated and Steps 5 and 6 will be repeated until the root cause has been identified and verified. The next step will be to develop corrective actions.

8. Develop and Verify Corrective Actions

Corrective actions, even for a design error, may involve much more than a design change. The design may be found to be acceptable with software changes. Or the problem may be found in manufacturing processes without requiring any design changes. Finally, the concept of operations (CONOPS) or the way the test itself was performed may be found to require revisions.

As a consultant I was recently asked to lead a client’s failure analysis efforts for a prototype Unmanned Air Vehicle (UAV) that had crashed and been largely destroyed in a flight test. This proved to be a difficult failure analysis driving us right up to the planned Critical Design Review (CDR) five months later. But in addition to identifying and resolving the contributing causes of this failure, our team’s efforts identified several improvements in the selection of components, hardware and software design and in the CONOPS for the conduct of flight testing for this and future UAV developments.

Corrective actions will become the responsibility of the affected functional areas but defining and getting these functional areas involved should be initiated by the Anomaly Resolution Team. At this stage, the cause of the anomaly or failure will be best understood by those on the Anomaly Resolution Team. And with their broader range of experience and skills, this team may be able to contribute out-of-the-box solutions beyond the functional areas most immediately affected.

In addition, beyond resolving the immediate anomaly or failure, the solutions developed may present opportunities for other ongoing programs and future developmental programs. This needs to be captured in the documentation, executive briefings, and design reviews.

9. Document Results

Documentation is not an easy task for a complex failure analysis. Maintaining good documentation throughout the process is the only way to build a complete picture describing how the root cause was determined and corrected for future applications on this and other programs. Who will be responsible for pulling this together? Often, it’s the leader of the troubleshooting effort by default. Plan ahead for this responsibility.

Conclusions

Leading a troubleshooting effort is difficult for a lot of reasons. One individual cannot possess detailed knowledge of all the disciplines that may need to be brought to bear to resolve a complex effort such as this UAV failure or any System(s) of Systems test failure. The leader of this effort must rely on the team for the detailed work while always keeping in mind multiple objectives.

The disciplines involved in a troubleshooting effort are not just the many detailed design disciplines, but also software development, test discipline, subcontractors and supplier forensic analysis, models and simulations, failure modes and effects analysis, manufacturing capabilities and methods, quality engineering and quality assurance, safety and many others. This places the lead for any significant troubleshooting effort in a supervisory role even if he or she is not a supervisor. This should be a consideration for management when selecting the lead.

The leader’s responsibilities do not end with identification of the root cause. Corrective actions are very much a part of the lead’s responsibilities even if just to kick it off with the assistance of other functions. Documentation must also be maintained throughout the process, often by default by this lead.

The leader should always look for synergy of effort. Finding the root cause of an anomaly and its corrective actions are related efforts and can often be addressed this way, so too for team morale. Avoid allowing anyone to assign fault during this process.

Beyond this program, other programs may be impacted by the corrective actions identified by this team. Think ahead to how what has been learned can be transferred to other efforts within the organization.

Now you’re ready to prepare an executive briefing, address the results of the failure analysis with customers and suppliers, complete formal design reviews and improve morale by focusing on what’s been learned, not what failed.

 

Dan Harrison is an SMA Principal Associate in our Technical Management & Engineering Services Practice, and has over 35 years of experience in aerospace engineering. You can read Part 1 of this article here.

If you’re building a team and you have positions you can’t fill, you need to use SMA Talent on Demand (TOD®)! With TOD®, you can find experienced talent, such as Dan, matched to your exact needs:

 

[1] The meaning of the word “anomaly” as used by the Anomaly Resolution Team is broadened to include all anomalies and failures.

[2] The INCOSE System(s) of Systems Primer is available for free download from the INCOSE Store at their website: https://www.incose.org/products-and-publications/sos-primer

[3] The Peacekeeper Anomaly Team (PAT) was established with training of the launch team by injection of anomalies for resolution during mock launch scenarios. This was the LGM-118 Peacekeeper (MX) Combined Developmental Test and Evaluation (DT&E)/Operational Test & Evaluation (OT&E) Flight Test Program. The last Peacekeeper was taken out of operational service on 19 September 2005.

Posted on December 1, 2020, by

Dick Eassom, CF APMP Fellow, SMA, Inc.