Conducting a Failure Analysis: Troubleshooting Ground Rules

For over 35 years I have participated in or led engineering troubleshooting efforts ranging in scope from laboratory test failures to highly complex distributed multi-site System(s) of Systems[1] pre-launch and launch anomalies. Over the years I have often sought and failed to find good references that lay out an overall strategy for these troubleshooting efforts.

by Dan Harrison

This summary is based on many early lessons from more senior engineers. I did not originate these disciplines. I only learned, applied, and confirmed what follows through my experiences as a System Test Engineer, Launch Controller, Launch Operations Manager, Test Director, Chief Test Engineer and Flight Test Manager.

For this discussion, a Test Director is named as leading the anomaly resolution team. Of course, any capable individual designated by management may be tasked to lead these efforts. A few anecdotes have been included to provide context for this discussion.

Define Test Disciplines for Laboratory and Bench Top Testing

Informal engineering testing on a bench top or in a small test lab, when there is no risk to personnel or facilities and where the risk to the Unit Under Test (UUT) and equipment is not significant, may fall outside of some of these formal disciplines. However, many of these test disciplines are still appropriate.

After starting a new position as a program manager, I was taking a tour of the facility’s laboratories when…entering a small test lab I was accosted by a test technician who explained that they had just experienced a test failure. Upon entering I had noted that test cabling was lying helter-skelter all over the floor without any cable trays. The engineers and technicians were stepping on this cabling some of which was supporting this test.

After introducing myself I quietly suggested that stepping on the cables could damage the connectors as this would push or pull and twist the cabling. It soon became apparent that the UUT was receiving a lot of attention without the cabling or test equipment having been exonerated. So, I quietly suggested that this might be worth doing before attempting to find an assumed fault in the UUT. “This can prevent damage to the UUT when the test set up or a procedural error is at fault.” Several days later the same test technician found me at my desk and told me that they had finally looked at the test set up and found the cause of the failure in a damaged cable connector.

My program was not using this laboratory and these personnel did not work for me, so I could only make suggestions. But two weeks later I stopped by the laboratory again. It had been cleaned up with all cabling now installed in protective cable trays, and I also saw some procedures lying on the tables now. I had not needed to direct anything.

We begin with a test anomaly or failure, and we will assume that this test has been set up and conducted with approved written test procedures. Nothing else beyond this should be assumed, not even a proper test setup or procedures.

The first consideration in the event of an anomaly or failure is for the safety of personnel, the UUT, and any associated facilities and test support equipment. An anomaly may be defined as an unexpected result that may (or may not) require some form of corrective action that will allow the test to be continued. An unexpected result may be due to just a procedural or operator error that can easily be corrected without interrupting the test.

Alternatively, an unexpected result may be due to the failure of an electrical or other component or even a design error that may place the hardware into an unknown condition, a condition that could cascade into a more serious hardware failure if the test is continued. The immediate question to be answered is, does this situation pose a risk to the safety of personnel and/or to the test set up and UUT?

If the test team determines that the answer may be “Yes,” then the test is immediately halted, and the test set up and UUT are safed and secured while appropriate actions are being taken to ensure the safety of all personnel. But when an unexpected response is first observed without evidence of risks to personnel, facilities or the UUT, the test team must quickly determine how to proceed. The team must first verify that a test anomaly (or a failure) has in fact occurred.

Verify the Test Anomaly

The first step in investigating an anomaly (not a hardware failure) is to confirm that the anomaly is not due to a procedural or operator error. This may take place immediately (per procedure) or while the rest of an anomaly resolution team is assembling in the conference room provided confidence in the test set up and UUT can first be confirmed with the Test Director.

Using predefined procedures to back up to a safe and stable condition, the procedural steps that lead to the anomaly are repeated. If a procedural or operator error is discovered, this information is reported to the anomaly resolution team. If the test team or anomaly resolution team can confirm that the anomaly was induced by this error and that the UUT and the test set up has not been exposed to harm, then with the concurrence of the Test Director, the test can be resumed.

If instead this process results in the same test anomaly, then this test anomaly will have been verified. Finally, should this process result in proper execution without a repeat of the test anomaly and without detecting and correcting any procedural or operator error, then an unverified test anomaly has been observed.

This is typically the most difficult type of test anomaly to resolve. Among the possibilities that may cause this, an operator error, an intermittent electrical connection, a marginal design condition or a marginal, or degraded, electrical component may ultimately be found to be at fault. In this case the anomaly resolution team will have to consider these ambiguous possibilities as these conditions may reoccur at any time with potentially destructive consequences.

The Test Director will initiate the Anomaly Resolution Team while directing the Test Team to continue per its procedures to:

Confirm safety status of all personnel, the UUT, facilities and equipment,
Safe and secure the test configuration, and
Identify and capture primary data and other evidence of the anomaly or failure.

Confirm Safety Status of Personnel, UUT, Facilities and Equipment

All personnel must be accounted for with their current location and condition. In the event of fire, injuries or deaths, safing and securing shall be completed per the procedure and in coordination with emergency personnel. The status of the UUT, facilities and equipment will then be confirmed to the extent possible. Reentry of test facilities will be part of the safing and securing procedures in the test plan and in coordination with any emergency response personnel.

Safe and Secure the Test Configuration

In addition to facilitating the safety of personnel and equipment, safing and securing the test set up configuration also prevents inadvertent “tampering with the evidence” that may result in masking of the cause of the anomaly or failure. Simply disconnecting a connector that does not need to be disconnected may reseat a pushed pin.

Identify and Capture Evidence of the Anomaly or Failure

The representative test leads and designated design engineering leads will be forming in a predefined location or conference areas to develop a troubleshooting plan. This leaves the majority of the test team consisting of console operators, engineers and technicians to identify and capture any and all relevant primary evidence of the anomaly. Screen captures that show anomalous readings is just one, but a key, example. This evidence will need to be gathered and provided to the test team at large and to the anomaly resolution team as quickly as possible. Without this primary evidence of the anomaly, the anomaly resolution team’s efforts can quickly get off track.

The Test Director will also direct initiation of the Anomaly Resolution Team[2]. The test team representatives and engineering design leads will assemble in a conference room with, as needed, audio or video teleconferencing capabilities. For a test incorporating multiple test assets at dispersed sites or laboratories, these conferencing capabilities are essential. The Test Director may or may not lead the Anomaly Resolution Team. Another lead may have been trained for this position while the Test Director remains with the Test Team.

Define Troubleshooting Ground Rules and the Test Sequence

The safety of personnel comes first, followed by the UUT, facilities and test equipment. Following assurance of the safety of all personnel including any emergency responders, the order of priority of the troubleshooting activities should be based upon the following objectives:

Control the test set up and the UUT configuration to prevent compromise of the cause of the anomaly or failure.
Reduce the risk of damage to the UUT by exonerating test cabling, test equipment and accessible UUT cabling before testing the UUT. This will include conducting troubleshooting in the order that reduces significant intrusion or modification of the test configuration for as long as possible.
In determining the sequence of troubleshooting, consider the respective values (in replacement costs and schedule impacts) of the test equipment and the UUT.

The sequence of the troubleshooting may prove to be critical. Alteration of the test configuration sooner than necessary may cover or confound the cause of the anomaly or failure. Also, damage could be induced in the UUT by investigating, for example, a potential UUT cause of the anomaly before exonerating all accessible test and UUT cabling and the full test set up.

Early in my career I learned that approximately 50% of test anomalies are due to the test set up, which includes all of the supporting facilities, test equipment and test cabling, with the other 50% being due to a problem internal to the UUT. Also, about 50% of these anomalies will ultimately be attributed to cabling and connector issues within either the test set up or the UUT as summarized in the table below.

Table 1: A Likely Failure Distribution

Probability of Finding the Cause of the Failure	Test Facilities & Equipment	UUT
Cabling	25%	25%
Non-Cabling Hardware	25%	25%

Given this likely distribution where would you start troubleshooting and how would you proceed? To reduce the possibility of compromise of the cause of the anomaly and for the lowest risk of damage, the preferred sequence should be:

Test cabling and connectors,
Test equipment,
Accessible UUT cabling and connectors, and only then,
The UAT.

Generally, this sequence is reinforced by the respective values (cost and replacement schedule) of this equipment, encouraging a prescription to proceed with the least intrusive testing being performed first: the test cabling.

Exonerate the Test Set Up Cabling and Connectors

Over thirty years ago we were routinely using Time Domain Reflectometry (TDR) to isolate possible breaks or loose connections in the cabling for the test set up and for the UUT in high value programs. (In most cases we had connector pinouts with rough TDR baselines for the cabling at least for the UUT prior to entering system level testing.)

The test execution team may be able to obtain permission from the Test Director to proceed with limited testing of relevant Test Facility and Equipment cabling and connectors while the troubleshooting plan is being developed. This can save time while satisfying some of the ground rules above.

Exonerate Selected Test Equipment and Other Facility Support

The primary objective at this point is not to fix the problem, but to identify the root cause. Typically, the test execution team will perform, with the approval of the Test Director, at most only limited test equipment checks by repeating existing procedural steps to confirm proper set up of the test equipment. Going beyond this at this stage can create unacceptable risks for high value, high visibility test programs.

Once a problem is found in the test cabling or test equipment set up, analysis must be done to show if and how this problem contributed to the anomaly. Only then can this isolated problem be corrected, and the previously verified anomaly reconfirmed or exonerated.

Focus on Technical Leadership

I’m going to take a brief deviation for a moment before we initiate the anomaly resolution team’s efforts. I’d like to discuss the difference between “facilitation” and “technical leadership” as it applies to leading an anomaly resolution team of this type, or more broadly, for the leadership of any complex task team. A technical leader/advisor often uses facilitation skills, but the skills possessed by a technical leader go well beyond facilitation.

A facilitator focuses on process. His or her objective is to keep everyone on track with a singular focus. A technical advisor may sit in a meeting and say nothing for 20 minutes and then make an observation that changes the whole direction of the meeting. The focus of a technical leader or advisor is on the substance of the meeting from a very broad perspective. The situation of the design engineers composing most of the participants typically on anomaly resolution teams is very different.

Design engineers invariably feel that they “know” what happened and immediately want to jump to a root cause that they “know” is the reason for the failure. These design engineers are very knowledgeable about their sub-disciplines. That’s their job, to focus on their design. But they are often out of their element in another’s area of expertise.

Engineers are also often forced to diversify, and some do as a matter of course. My primary area of expertise became trajectory analysis, applicable to missiles and spaceflight, but not much else. Still, as a systems engineer, I learned to lead troubleshooting efforts. As a project or program manager, I would lead more complex troubleshooting efforts encompassing multiple sites for support, or for launch conduct. This would also translate into problem resolution on a broader business scale. Often I would sit in a meeting just listening putting what I was hearing into a broader context and I learned to pay a lot of attention to the speakers—not just to what they said, but how they said it.

Was this speaker showing a concern that he or she just couldn’t put into words that others would understand? This would often raise questions in my mind. But note that as a technical leader or advisor, my questions would not be general, but cuttingly precise to help this person characterize this concern. General questions might keep the speaker talking but will not help with pinning down the concern. My objective was to draw out the speaker to clarify his or her concern so that everyone (including myself) could understand its implications. As a technical leader you need both a solid general technical knowledge and development of expertise for doing this.

Problem resolution often entails asking the right questions and insisting on answers that everyone, especially the technical leader, who is not usually a design engineer in this area, can understand and restate. This forces analysis in plain language that often leads the discussion toward another related discipline. Another engineer, the one I’ve asked questions of, or his counterpart, gets involved and runs with it, until another engineer picks it up, or I have to redirect it again.

Another visible distinction between a technical leader and a facilitator may arise because of a technical disagreement. A facilitator will have conflict resolution skills, but these skills may be overwhelmed when the conflict is over a technical issue.

In this situation the loudest voice may win out especially when the loudest voice is an “expert” in a related area. The risk here is the same cavalier dismissal described for design engineers before being confronted with firsthand data (which may not always be as compelling as we’d like). The technical leader/advisor cannot allow a team member to overwhelm the discussion in this manner no matter how knowledgeable and respected he or she is. To do this the advisor may have to drive the technical discussion.

I can’t give you a better explanation of what being a technical leader/advisor means, except that you will have a breath of experience and the patience to explore the context, then find and ask the right questions to engage others who may lead the team in another direction. Sometimes the only arrow I have in my quiver is hard logic with a good dose of patience, and sensitivity…and tenacity. I can’t describe it any other way. But I’m often very successful. I assume that you have or will develop a similar skillset.

Dan Harrison is an SMA Principal Associate in our Technical Management & Engineering Services Practice, and has over 35 years of experience in aerospace engineering. You can read the second part of this article here.

If you’re building a team and you have positions you can’t fill, you need to use SMA Talent on Demand (TOD®)! With TOD®, you can find experienced talent, such as Dan, matched to your exact needs

Find Talent

[1] The INCOSE System(s) of Systems Primer is available for free download from the INCOSE Store at their website: https://www.incose.org/products-and-publications/sos-primer

[2] The meaning of the word “anomaly” as used by the Anomaly Resolution Team is broadened to include failures.

Conducting a Failure Analysis, Part 1: Developing Ground Rules for Troubleshooting