When Failure is Not an Option
Gene Kranz’s memoir When Failure is Not an Option describes the responsibility of NASA engineers to ensure the reliability of the technical systems of the US Space Program. Many industries have this in common: everything must be done to prevent asset failure. For example, the turbines that supply the power grid must not fail. The cooling systems for nuclear reactors must not fail. The engines and turbines that power military vehicles and vessels deployed in the field must not fail. The critical manufacturing assets at the heart of production processes must not fail. Machine failure of these critical manufacturing assets causes unplanned downtime and creates millions of dollars of lost revenue. In addition, supply chain issues for replacement parts can extend those losses to many months or even a year.
Because machine failure is what reliability teams absolutely must avoid, it’s easy to understand why in these industries they focus their predictive maintenance strategy on asset failure modes. Here are three major reasons this approach can fail:
Insufficient Data for Comparable Machine Failures
To generalize and model the conditions leading to a failure you need many examples of each specific failure mode. The first difficulty here will be that data will be sparse because failure of these critical assets is rare. Secondly, failure modes for a complex asset like a turbine will be localized to a specific subsystem and within each subsystem the failure will be localized to a single component: a valve, a gasket, etc. Understanding and modeling (for example) the valve failure modes does not transfer to the failure modes for the gasket or any of the other subsystem components.
Unpredictability of Machine Failures
A comprehensive study by NASA showed that 82% of assets display a random failure pattern. This means that only 18% of assets display a failure pattern that is amenable to failure mode modeling. For instance, that study showed that even a simple ball bearing failure cannot be modeled using time-based strategies.
New Failure Modes Happen
In the domain of cyber intrusion detection, one of the most damaging types of attack is a zero-day attack, which is an attack vector that is not currently programmed into the existing detection rules. With critical assets, new and unanticipated failure modes cannot be modeled and can bring down aircraft, networks, and critical-to-operation assets.
Why is Amber Different?
The most significant paradigm shift in deploying Amber for failure-is-not-an-option assets is that Amber focuses on providing the most accurate measurement of asset health available.
Amber provides real-time, ML-based measurements of asset health. For each sensor fusion vector posted to it, Amber responds with more than 10 machine learning-based outputs describing various health measurements for the asset. The simplest is the Amber Warning Level providing a measurement (0 = compliant, 1 = changing, 2 = critical) that can be displayed on a scorecard or dashboard and used to trigger maintenance alerts.
Amber trains its models based on compliant asset behavior. For failure-is-not-an-option assets, reliability teams usually have a large database of historical sensor telemetry, most of which is for compliant asset behavior. Let me be clear that I do not mean that the asset is “perfect” or “brand new” or anything like that. Compliant assets can have diverse maintenance histories and ages and operating environments. Compliant means “nominal” in the NASA sense that the asset is functioning within its normal, operational range of variation. There is no need with Amber to create labeled data sets for each possible failure mode. Since Amber learns the compliant behavior of each asset, any developing failure mode will start creating anomaly alerts, even if it has not been previously identified as a possible failure mode of the asset.
Amber trains very high-dimensional models of asset compliant behavior: Asset telemetry from a complex asset is (not surprisingly) complex. There are no simple statistical relations between asset telemetry values. (See our blog on statistical models). In our experience with many critical assets, it is common for Amber, during its self-training, to create hundreds of clusters in order to capture the normal variation in the asset behavior.