A Comparison of Residual-based Methods on Fault Detection

An important initial step in fault detection for complex industrial systems is gaining an understanding of their health condition. Subsequently, continuous monitoring of this health condition becomes crucial to observe its evolution, track changes over time, and isolate faults. As faults are typically rare occurrences, it is essential to perform this monitoring in an unsupervised manner. Various approaches have been proposed not only to detect faults in an unsupervised manner but also to distinguish between different potential fault types. In this study, we perform a comprehensive comparison between two residual-based approaches: autoencoders, and the input-output models that establish a mapping between operating conditions and sensor readings. We explore the sensor-wise residuals and aggregated residuals for the entire system in both methods. The performance evaluation focuses on three tasks: health indicator construction, fault detection, and health indicator interpretation. To perform the comparison, we utilize the Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) dynamical model, specifically a subset of the turbofan engine dataset containing three different fault types. All models are trained exclusively on healthy data. Fault detection is achieved by applying a threshold that is determined based on the healthy condition. The detection results reveal that both models are capable of detecting faults with an average delay of around 20 cycles and maintain a low false positive rate. While the fault detection performance is similar for both models, the input-output model provides better interpretability regarding potential fault types and the possible faulty components.


INTRODUCTION
Determining the health state of complex industrial systems, such as turbofan engines, under different operating conditions has become feasible due to the abundance of condition monitoring data collected by diverse sensors.A health state is usually described by a health indicator or a condition indicator, which is a value that reflects system health conditions and health status in a predictable way as a system degrades (Lei et al., 2018).In complex systems, inferring these indicators and monitoring their evolution over time provide a more comprehensive understanding of the system's health and enable effective condition monitoring.
Typically, a distinction is made between condition indicators and health indicators.A condition indicator refers to a specific feature within system data that exhibits predictable changes as the system undergoes degradation or operates in different operational modes (Fink et al., 2020).It encompasses any feature that proves valuable in differentiating normal operation from faulty or any deviation from normal operation.Health indicators, in contrast, integrate multiple condition indicators into a single value, providing the end user with a comprehensive health status of the component.
Different approaches have been proposed to extract and learn the condition and health indicators of a system.These approaches can be categorised into three main categories: feature-based, one-class classification-based (OCC-based), and residual-based methods.Feature-based methods primarily focus on condition indicators.These methods identify relevant features that exhibit predictable changes as the system deteriorates, and they detect early-stage faults by directly applying the threshold method to the feature values.For instance, the relative root mean square (RMS) value of the acceleration signals from bearings serves as an indicator of wear evolution (Pan, Meng, Chen, Gao, & Shi, 2020).While this approach is straightforward, it requires expert knowledge and can be sensitive to varying operating conditions (Saufi, Ahmad, Leong, & Lim, 2019).
While the feature-based methods for extracting condition indicators focus on expert-based determination of one or several features that capture the condition evolution of different components, OCC-based methods focus on learning a global indicator that represents the health state of the systems (Michau, Hu, Palmé, & Fink, 2020).OCC-based methods are particularly suitable for setups of missing faulty samples during training.They are trained on data from one class (usually healthy data).While OCC outputs can provide binary health information (healthy or unhealthy), measuring the distance to the healthy data can effectively infer the evolution of the degradation.This distance can be interpreted as a health indicator, which can also be derived for subsystems by considering a subset of condition monitoring signals related to the specific subsystem.It can then be utilized to monitor the evolution of health conditions, detect anomalies, or distinguish between different severity levels of faults (Michau, Palmé, & Fink, 2017).
The third direction encompasses residual-based methods, which extract health indicators based on the residuals.The residuals are the differences between the measured values and the predicted outputs, serving as indicators of any deviation from the healthy training dataset (Arias Chao, Kulkarni, Goebel, & Fink, 2019).These methods can be categorised into two main types: autoencoders and input-output models.Autoencoders are trained to reconstruct their own inputs, whereas input-output models establish mappings between operating conditions and sensor readings.For example, in the case of a turbofan engine, operating conditions are used as inputs to the full authority digital electronic control (FADEC) to derive monitored signals as health indicators (Rausch, Goebel, Eklund, & Brunell, 2007).Both, inputoutput models and autoencoders are typically trained solely on healthy data and, as a result, learn the healthy data distribution.Consequently, when presented with anomalous samples stemming from a different data distribution, they generate significant residuals.
In residual-based methods, there are various approaches to calculating residuals, particularly for aggregating the residuals of multivariate condition monitoring signals.One of the most commonly used methods for aggregating residuals is to compute their sum, offering a comprehensive representation of the overall global health condition (Guo, Yu, Duan, Gao, & Zhang, 2022).Another approach to utilizing the residuals is to bypass their aggregation and instead use them individually.By analyzing the residuals individually, it becomes possible to identify the specific signals most affected by each fault type (Reddy, Sarkar, Venugopalan, & Giering, 2016;Michau et al., 2020).This approach enables fault segmentation and fault diagnostics, as different faults tend to impact distinct sets of signals.
However, since residual-based models are trained solely on healthy data and residuals are calculated based on the distance to the training data distribution, they are unable to differentiate between a fault and a new operating condition.In other words, high residuals may not necessarily indicate deteriorating health conditions of a system but rather the presence of a novel operating condition.This presents a significant challenge in accurately inferring the health state or conducting further downstream tasks, such as fault detection and fault segmentation.
While several residual-based approaches have been applied to different case studies (Arias Chao et al., 2019;Lövberg, 2021;Darrah, Lövberg, Frank, Biswas, & Quinones-Gruiero, 2022), to the best of our knowledge, their performances have not been compared.In this study, we compare two residualbased methods: autoencoders and input-output models.We use a simulated turbofan dataset with three different fault engine components exhibiting degradation behavior.We evaluate their performance by first constructing the health condition using two types of residuals as health indicators.Subsequently, we perform fault detection and interpret the constructed health indicators.

METHOD
We present in this section the overall proposed testing framework as summarised in Figure 1.First, in Section 2.1 we present two strategies for calculating residuals, enabling us to identify instances when the data distribution deviates from the healthy distribution.Second, we describe how the residuals can be used to construct health indicators in Section 2.2.We show in Section 2.3 how we infer the fault initiation from the constructed health indicators.

Residual Calculating Models Autoencoder Model (AE Model)
One commonly used residual-based models is the autoencoder.It aims to encode inputs into latent space with the encoder E θe (•) while preserving important information, and then decode it back to its original form using the decoder D θ d (•).Here, θ e and θ d represent the model parameters of the encoder and decoder, respectively.In our case, we consider a multivariate dataset containing several sensors z t ∈ R Nz , where t is the time index and N z is the number of sensors.To learn the distribution of the healthy samples, the autoencoder is trained exclusively on samples captured during the early stages of the system's lifecycle, denoted as t ∈ {1, ..., T H }. In this period, we assume that the system is in a healthy state.We denote by r ae the residual of the AE model, which represents the difference between the output and input signal.Mathematically, it can be written as: By training the autoencoder, we aim to find the parameters θ e and θ d that minimise the residuals in terms of mean square error.With || • || F the Frobenius norm, the optimisation problem of the autoencoder model can be written as: Operating-conditions-based Model (OC Model) In addition to the autoencoders, we also evaluate an inputoutput method that maps the operating conditions to the sensor readings (Lövberg, 2021;Darrah et al., 2022).We refer to this model as the operating-conditions-based model (OC Model).The OC model is based on operating condition descriptors that characterize the state of a system.For instance, in an industrial bearing, these descriptors include rotating speed and static loading.For a turbofan engine, the considered state descriptors include altitude, flight Mach number, throttle-resolver angle, and the total temperature at the engine fan inlet.The multivariate time series z t can be subdivided into operating condition descriptors w t ∈ R Nw and sensor readings x t ∈ R Nx .Here, N w and N x represent the number of operating condition descriptors and sensor readings, respectively, and N z = N w + N x .The OC model, denoted as M θm (•), aims to establish a mapping between the operating conditions and sensor readings by learning the functional relationship between the two.The OC Model is expected to be more robust to variations in operating conditions compared to the autoencoders where the operating condition descriptors are part of the reconstructed signals.We define the residual of the OC Model as the difference between the estimated and real sensor readings such as: The OC model is trained by finding the parameters θ m that minimise the residuals.The corresponding optimisation problem can be written as follows: (5)

Health Indicators
In this work, we consider the residuals defined in Equation 1and Equation 3 as a basis for computing the health indica-tors.We assume that the training dataset is representative of all the operating conditions.Consequently, changes in operating conditions will not be detected as anomalies and an increase in the magnitude of the residuals will be associated with faulty system conditions.
We first consider two aggregated health indicators, denoted as h A-AE and h A-OC , which represent the norm of the residuals for the AE and OC models, respectively.These indicators combine the residual information from each sensor and can be written at any time t as follows: We also propose two sensor-wise multivariate health indicators, denoted as h S-AE and h S-OC .These indicators correspond to the absolute residuals of the AE and OC models, respectively.By considering sensor-wise information, we aim to have indicators that are easier to interpret and more precise for the fault detection task.Using the absolute value operator | • |, the health indicators h S-AE and h S-OC can be written at any time t and for sensor i as follows:

Fault Detection
We propose to use a fault detection algorithm based on a threshold determined by the reconstruction performance of the models on the healthy validation dataset.Considering any of the previously presented health indicators h, we define the mean µ i and standard deviation σ i characterising the healthy condition for sensor i as follows: Note that for an aggregated health indicator, there is only one set of statistics µ and σ, that needs to be computed.Thus, we can define the threshold τ i for sensor i as: We also divide the time index into C cycles.A cycle is denoted as n c , and it corresponds to a series of time indices t ∈ {T 0 , T 0 + 1, ..., T c−1 − 1} that is a segment of the time samples.A cycle can correspond to a full rotation of a bearing or the flight duration of a turbofan engine.The average health indicator during cycle n c for sensor i is denoted as hi (n c ) and is calculated as follows: To avoid false alarms, we introduce the waiting cycle number N wait .The fault is detected and the alarm is raised only when, for at least one sensor i, the corresponding averaged health indicator hi (n c ) is larger than the threshold τ i for N wait consecutive cycles.For convenience, we denote n 0 as the cycle where the fault is detected and the alarm is raised.1.

CASE STUDY
The entire dataset is partitioned into multiple sub-datasets, each comprising run-to-failure trajectories of several units affected by distinct fault types.In this work, we focus on subdatasets DS04, DS05, and DS07.These sub-datasets are chosen because their units are impacted by fault types that affect only a single component, rendering them well-suited for evaluating fault segmentation performance.Other subsets contain units affected by fault types that involve multiple components.Specifically, the fault component for DS04 is the fan, for DS05, it is the HPC and for DS07, it is the LPT.Each sub-dataset contains 10 turbofan engines with the same fault types.
# Symbol Description Units sensor readings x 1 T24 Total temperature at LPC outlet  1. Sensor readings x and operating condition descriptors w in the N-CMAPSS dataset with its description and corresponding units

Pre-processing
All sensors undergo a downsampling process by a factor of 10 to reduce data size and computational costs.Each sensor reading and each descriptor are standardised to have a zero mean and unit standard deviation.This standardisation process is carried out on the training set and the resulting parameters are then applied to the test and validation sets.The training, validation, and test setup is explained in Section 3.3.For this study, we solely consider the cruising phase of the flight as it exhibits a more stable behavior in comparison to the take-off or landing phases.The cruising phase is defined when the normalised flight altitude exceeds 0.85.Normalised flight altitude is calculated by dividing all altitude values by the highest altitude within this cycle.The fault detection waiting cycle N wait is fixed at 3 cycles.

Applied Neural Network Architectures
For the OC model, we use two 128-neuron layers.For the AE model, we consider three hidden layers with 128-8-128 neurons each.All activation functions used in the models are rectified linear units (ReLU), except for the final layer of both models, which employ a linear activation function.

Training Setup
The training was performed for 70 epochs with a batch size of 64, an early stopping waiting epoch of 10, and using Adam optimizer (Kingma & Ba, 2014) with β 1 = 0.9 and β 2 = 0.999, learning rate of 0.001.We arbitrarily selected the first 16 cycles of each unit to be the healthy data for training.The models are trained using healthy data from all 30 units from DS04, DS05, and DS07.The remaining cycles are then assigned to the test set for evaluation.Within the training set, we randomly select 15% as a validation set for deciding an early-stopping training epoch.For each setting, we train the models 5 times with the validation set randomly split.And the results are presented as the average over the 5 realisations.

Evaluation Metrics
To assess the fault detection results of a single engine or unit u, we consider the detection delay d u that can be computed as the difference between the ground truth occurring fault cycle n true and the cycle that raises the alarm n 0 .It is written as: In case the detection delay is negative (d u < 0), it corresponds to a false positive alarm.An effective detection algorithm should avoid generating false positive alarms, as they lead to the unnecessary consumption of resources.As a second metric to evaluate the fault detection algorithm, we propose the false positive rate (FPR).FPR corresponds to the number of units with negative detection relative to all units.
Additionally, the silhouette score (Rousseeuw, 1987) is applied to evaluate the clustering results for fault segmentation.
The score measures the similarity of a sample to its own cluster and other clusters.It is calculated for a sample k using the mean intra-cluster distance d intra,k and the mean nearestcluster distance d nearest,k and defined as follows: The silhouette score is 1 if the clusters are well separated, 0 if they are overlapped, and -1 if at least one cluster is similar to the others.We take the mean of the scores over all samples.The aggregated health indicators h A obtained from both models h A-OC and h A-AE for DS07 unit 7 are shown at the top of Figure 4.This unit is randomly selected for visualization purposes.The fault occurs at cycle 24 and is detected when the fault detection algorithm is applied to h A-OC at cycle 40 and h A-AE at cycle 52.Both health indicators remain constant for approximately 10 to 15 cycles even after the fault occurs.This could be because the fault initially starts with mild severity, but as time progresses, it deteriorates and the health indicator increases at a faster rate.The bottom of Figure 4 displays both OC model health indicators h A-OC and h S-OC .The sensor-wise residuals h S-OC exhibit different degradation rates for different sensors.Furthermore, some trajectories show exponential behavior, increasing faster than the aggregated h A-OC health indicator.This indicates that specific sensors exhibit the fault behavior before others.

Fault Detection Performance
We evaluate the fault detection performance using the proposed fault detection algorithm on the proposed health indicators.The detection delay d u of each unit, the average detection delay of each model, and FPR are provided in Table 2. On average, the aggregated health indicators h A raise an alarm 24.2 cycles and 33.4 cycles after fault initiation for the OC and AE models, respectively.In this case, the FPR is null, indicating that these indicators are robust against false alarms.However, no faults are detected for unit 3 of the DS04 dataset.The sensor-wise health indicators h S raise alarms at earlier cycles, with an average detection at 15.5 cycles for the OC model and 17.3 cycles for the AE model.The detection occurs earlier than in the aggregated health indicators, primarily due to specific sensors that exhibit faulty behavior first.However, the sensor-wise health indicators are more sensitive to false alarms, as the FPR is not null.

Sensor-wise Health Indicator Visualization
We visualize the normalised sensor-wise health indicator of each unit in low-dimensional space using the first two principal components (PC1 and PC2) from Principal Component Analysis (PCA) in Figure 6.The visualization is performed at 10 cycles after the fault is detected (n 0 + 10).The value of 10 cycles is chosen to strike a balance, avoiding reaching the end-of-life while ensuring that the fault behavior is exhibited by multiple sensors, rather than just one.The colors in the visualization are assigned based on the ground-truth fault type, which is not available in reality.In the left figure representing the OC model, units with different fault types form distinct clusters.However, in the case of the AE model, the faults are mixed, and the clusters do not align with specific fault types.
As an alternative to the sensor-wise health indicator, we also propose considering the visualization of the output from the embedding latent space layer of the AE model, referred to as AE-Embedding.However, in this case, units with similar fault types do not form coherent clusters.The silhouette scores with different numbers of cycles after the fault is detected are plotted in Figure 5 when using both sensor-wise health indicators h S−OC and h S−AE as inputs.Using the OC model, the silhouette score is consistently higher than when using the AE model for all cycles, indicating better clustering results.This superiority is also evident in Figure 6, where different fault types form distinct clusters.The score exhibits a decreasing trend with an increase in cycles, suggesting that over time, the fault evolves into a higher severity state, leading to the degradation impacting all measurements.Consequently, the clusters begin to overlap, making differentiation more challenging.

Sensor-wise Health Indicator Interpretation
The sensor-wise health indicators h S are displayed in Figure 7 at 10 cycles after fault detection.The residuals are normalised for each unit.In the case of the OC model, a larger number of sensors have high residuals when a fan fault occurs, compared to HPC or LPT faults.Since the fan is the first component in the engine system, most downstream components are affected when a fault arises in this particular component.For the AE model, interpretation becomes more challenging as multiple sensors have high residuals simultaneously, and there is a higher degree of variation between engines.This highlights the advantage of the OC model, as it provides more refined residuals that are easier to relate to the physical system.The OC model offers a clearer and more straightforward understanding of the deviations from normal behavior.
In Figure 8, we depict the sensors that have detected faults at different cycle numbers using the sensor-wise health indicator of the OC model h S-OC .We have not provided the same figure for the AE model due to excessive variation in sensor activation, which complicates interpretation.This figure is relevant to understanding the evolution of triggered sensors.
A triggered sensor is a sensor that has a sensor-wise residual higher than the pre-defined threshold as discussed in Section 2.3.Darker colors indicate earlier sensor-wise fault detection.At 10 cycles after the first fault detection, the triggering pattern resembles that shown in Figure 7, wherein health indicators with high values are triggered.
With this figure, it becomes possible to track the evolution of faulty components over time.For instance, in the case of   an HPC fault, initially, sensors S2-T30, S3-T48, and S4-T50 are predominantly triggered.As the fault progresses, deviations are observed in S13-Nc (physical core speed) and S14-Wf (fuel flow), indicating that the fault begins to impact the burner and the shaft.Towards the end-of-life, sensors S8-P24 (pressure at LPC outlet), S9-Ps30 (pressure at HPC outlet), and S10-P40 (pressure at burner outlet) situated around the burner and HPC demonstrate further deterioration.
This figure can also be used to differentiate between the HPC fault and the LPT fault.While the sensor S2-T30 is triggered in both faults, it exhibits an earlier trigger in the HPC fault compared to the LPT fault because the sensor S2-T30 measures the total temperature at the HPC outlet.In the case of the HPC fault, the impact on the sensor S2-T30 is direct and immediate.However, in the LPT fault, the impact on S2-T30 reading becomes noticeable only after a longer duration, typically 20 or 30 cycles after fault detection.

CONCLUSION
In this study, we performed a comparative analysis of two residual-based models, specifically the AE and OC model, for the purpose of fault detection.We constructed two types of health indicators from residuals: a univariate aggregated health indicator and multivariate sensor-wise health indicators.Our framework was applied to three sub-datasets from N-CMAPSS, each presenting different fault types in different engine components.
The results demonstrated that the sensor-wise health indicator outperformed the aggregated health indicator in terms of fault detection performance.Furthermore, the health indicators obtained using the OC model exhibited superior fault separation capabilities.It effectively highlighted the sensors triggering and could be directly linked to specific fault components.This research highlights an alternative residual model that surpasses the commonly used AE models in both fault detection and segmentation.It not only demonstrates superior performance but also provides more meaningful health indicators.
As a future direction, it is essential to evaluate the proposed approaches on other systems that exhibit different fault evolution behavior.Additionally, it is important to consider systems with higher variability in terms of operating conditions and their impact on fault evolution.By testing the approaches across a diverse range of systems, we can ensure their effectiveness and applicability in various real-world scenarios.
Furthermore, the prediction of remaining useful life could be achieved using a two-stage approach, where the prediction begins only after the fault is detected.Another potential avenue involves evaluating the evolution of fault patterns by analyzing factors like the sequence of sensor triggers.Lastly, it would be beneficial to explore different fault detection ar-chitectures, such as recurrent neural networks and variational autoencoders.

Figure 1 .
Figure 1.Overall architecture of the testing framework.The framework includes residual calculating models, health indicator construction, and fault detection algorithm.We assess the performances of each health indicators based on detection performances, data visualization and the interpretive capability associated with the machine's condition.

Figure 2 .
Figure 2. C-MAPSS model schematic representation with sensor position within the engine, adapted from (Arias Chao et al., 2021)

Figure 3 .
Figure 3. Example of operating condition descriptors w for the first cycle of unit 1 in DS04 including altitude , flight Mach number, throttle-resolver angle, and total temperature at the fan inlet, downsampled by a factor of 10

Figure 4 .
Figure 4. Health indicators calculated from aggregated residuals h A obtained from the OC and AE models: h A-OC and h A-AE for DS07 unit 7 (top).Health indicators, aggregated and sensor-wise residuals: h A-OC and h S-OC obtained from the OC model for the same unit (bottom).The vertical black line indicates the fault initiation at cycle 24.

Figure 5 .
Figure 5. Silhouette scores of h S−OC and h S−AE calculated from 0 to 34 cycles after fault detection.The higher silhouette scores suggest more distinct clusters.

Figure 6 .
Figure 6.Visualization of the clustering results using the normalised sensor-wise health indicator with n c set to 10 cycles after fault detection, color-coded by fault component.

Figure 7 .
Figure 7. Normalised sensor-wise health indicator h S values calculated 10 cycles after the fault detection.The upper figure depicts results using the OC model h S-OC , while the lower figure represents the AE model h S-AE .In the AE model, sensor 1 to 14 represent the sensor readings x, while sensors 15 to 18 correspond to operating condition descriptors w.

Figure 8 .
Figure 8. Sensors that are triggered using h S−OC with n c = 10, 20, 30, 40 cycles after the first fault detection.Darker colors indicate earlier sensor triggering.The darkest color signifies triggering within 10 cycles after fault detection.The lightest color, labeled as "No" indicates no triggering up to n 0 + 40 cycles.

Table 2 .
Overview of the fault detection delays d u using OC and AE models, average over five realisations."-" means no fault is detected.In total, there are 30 different units, 10 units from each sub-dataset.*: FPR is 0% but the fault is not detected in one unit