Fault Prognosis of Turbofan Engines: Eventual Failure Prediction and Remaining Useful Life Estimation

In the era of industrial big data, prognostics and health management is essential to improve the prediction of future failures to minimize inventory, maintenance, and human costs. Used for the 2021 PHM Data Challenge, the new Commercial Modular Aero-Propulsion System Simulation dataset from NASA is an open-source benchmark containing simulated turbofan engine units flown under realistic flight conditions. Deep learning approaches implemented previously for this application attempt to predict the remaining useful life of the engine units, but have not utilized labeled failure mode information, impeding practical usage and explainability. To address these limitations, a new prognostics approach is formulated with a customized loss function to simultaneously predict the current health state, the eventual failing component(s), and the remaining useful life. The proposed method incorporates principal component analysis to orthogonalize statistical time-domain features, which are inputs into supervised regressors such as random forests, extreme random forests, XGBoost, and artificial neural networks. The highest performing algorithm, ANN-Flux, achieves AUROC and AUPR scores exceeding 0.95 for each classification. In addition, ANN-Flux reduces the remaining useful life RMSE by 38% for the same test split of the dataset compared to past work, with significantly less computational cost.


INTRODUCTION
The field of prognostics and health management (PHM) has attracted recent research attention for large-scale, highdimensional, and dynamic engineering systems. Typically performed on the component level, the goal of intelligent prognostic approaches is to predict in advance the progression of degradation to facilitate swift and responsible decision-making before catastrophic failure (Lee et al., 2014;Tsui et al., 2015). Typical PHM applications include datadriven fault diagnosis and prognosis of bearing failures (Shao et al., 2018) and gearbox failures (C. Li et al., 2016) utilizing vibration, current, and/or acoustic emission signals. As described by Liao and Köttig (2014), PHM approaches can be separated into different categories: physics-based, expert knowledge-based, or data-driven, with significant potential for hybridization. PHM is essential for reliable operation of safety-critical systems such as nuclear power plants, which have devastating consequences should catastrophic failures occur and are often difficult to predict due to the lack of historical, labeled failure data (Coble et al., 2015).
Recently, deep learning approaches have seen growing popularity in prognostics research, particularly for estimating the remaining useful life (RUL) of physical assets. Graph neural networks (GNNs) and graph convolutional networks (GCNs) have become attractive for fault diagnosis and prognosis tasks with their ability to handle highly correlated and non-Euclidean applications in PHM (T. Li et al., 2022;Lai et al., 2023;Kong et al., 2022). Advanced techniques leveraging long short-term memory (LSTM) networks and attention mechanisms improved RUL prediction for time series data (Song et al., 2022). Berghout et al. (2022) further illustrated how these architectures can benefit from transfer learning to improve prognostic performance. This paper will focus on the new Commercial Modular Aero-Propulsion System Simulation (N-CMAPSS) dataset, which was featured in the 2021 PHM Data Challenge and centered on accurately estimating the RUL for a small fleet of turbofan engines (Chao et al., 2021b). Openly available from the NASA Prognostics Center of Excellence (PCoE) Data Set Figure 1. Turbofan engine schematic, courtesy of NASA Prognostics Center of Excellence (Chao et al., 2021b) Repository (Chao et al., 2021a), N-CMAPSS consists of synthetic run-to-failure trajectories operating under more realistic flight conditions compared to the legacy C-MAPSS dataset used as a popular PHM benchmark (Chao et al., 2021a). The turbofan engines experience multiple failure modes that involve the simultaneous efficiency and/or flow failures of up to 5 rotating subcomponents: fan, low-pressure compressor (LPC), high-pressure compressor (HPC), low-pressure turbine (LPT), and high-pressure turbine (HPT). A schematic representation of a turbofan engine unit is shown in Figure 1.
For predicting the RUL of the turbofan engine units, Lövberg (2021) proposed a neural network-based normalization procedure to effectively denoise the sensor measurements with respect to the dynamic flight conditions. After normalization, the input trajectories were passed into a deep convolutional neural network (CNN) with dilated convolutions in an approach that allows for variable input sequence lengths. Lövberg relied on the provided health state label to sample degraded sequences in their RUL prediction model. DeVol et al. (2021) proposed the integration of inception modules within a CNN architecture to handle the variable trajectory lengths. DeVol et al. reported RUL prediction results using NASA's training-testing split in the N-CMAPSS dataset, streamlining reproducibility and benchmarking for this challenge problem. Solís-Martín et al. (2021) approached this problem by stacking two CNNs in sequence: an encoder model first used for dimensionality reduction and feature extraction, and a secondary model used for RUL prediction. Solís-Martín et al. used Bayesian hyperparameter optimization to tune their models and noted that their prediction results could be improved by reducing overfitting. Biggio et al. (2021) utilized deep Gaussian processes (DGPs) to obtain accurate RUL predictions paired with uncertainty estimates on a subset of N-CMAPSS. Additional studies by Chao et al. (2022) and Berghout et al. (2022) highlighted the potential for physics-based hybridization and transfer learning to improve RUL prediction accuracy.
These approaches benefit from the strengths of deep learning; namely, they allow for effective feature representations to be learned automatically by CNN rather than manually crafted. Clever variations of CNNs, such as Lövberg's approach implementing dilated convolutions and DeVol et al.'s usage of inception modules, have allowed for accurate RUL estimation given varying flight trajectories and input lengths (Lövberg, 2021;DeVol et al., 2021). However, there are key limitations to these approaches that inhibit their potential for practical usage. By focusing solely on RUL prediction, prior methods do not provide a holistic prognosis that predicts the eventual failing component(s). DeVol et al. (2021) mentioned that the resulting RUL predictions lack explainability, and that future work should utilize the labeled failure modes and components provided in the N-CMAPSS dataset to provide a more complete prognosis for turbofan engines.
Our work significantly expands upon past efforts by broadening the research scope to simultaneously predict RUL as well as the eventual failing component(s). Being able to accurately predict and isolate the reason for failure has important implications on maintenance decision-making, equipping operators with the capability to dispatch the appropriate experts and resources in a timely manner. Such predictive maintenance strategies can enable intelligent inventory optimization (Bousdekis et al., 2017) and reduce reactive maintenance costs, which may account for up to 40% of the overall budget in large industries (Bagavathiappan et al., 2013). Previous work attempting to perform fault classification as well as RUL prediction typically carry these objectives in separate stages, with Gupta et al. (2023) devising a method to perform fault classification after RUL prediction, whereas J. Y. Wu et al. (2021) first classified the health state and then performed RUL regression. Additionally, X. Wu and Ye (2016) investigated RUL estimation for solid oxide fuel cells following a separate fault detection stage. To the best of our knowledge, no previous studies have attempted to explicitly optimize RUL regression and component-level classification on a simultaneous basis on a dataset as large and high-dimensional as N-CMAPSS. This is the first attempt at a unified model to effectively accomplish fault detection, isolation, and RUL estimation for the N-CMAPSS benchmark dataset.
To maximize applicability for a real-world scenario, we aim to simultaneously predict three meaningful indicators: 1) the current health state; 2) the eventual failing component(s); and 3) the RUL until catastrophic failure. We accomplish these goals by first simplifying the feature extraction process to enable comparisons amongst state-of-the-art machine learning regressors. Then, we derive and optimize a specialized loss function that balances classification and regression objectives. We also compare the performance of state-of-the-art machine learning regressors and important pre-processing steps such as orthogonalization via principal components analysis (PCA). Our main contributions for this research effort are summarized as follows: Challenge to include health state detection and forecasting eventual failures; 2. Deriving a customized loss function to simultaneously optimize classification and regression PHM objectives; and 3. Accurately predicting health state, eventual failures, and RUL with state-of-the-art regression approaches benchmarked with prior work.
In the following sections of the paper, we will detail our proposed methodology, our reproducible results, and provide comparisons to previous work.

METHODS
First, we will describe the N-CMAPSS dataset, and introduce the input variables and dataset composition in detail. With the expanded goal to predict the current health state and eventual failing components in addition to RUL, our proposed methodology encompasses both classification and regression objectives. Our method is summarized in three steps: 1) feature extraction; 2) feature normalization and orthogonalization via PCA; and 3) training a supervised machine learning model to obtain the final predictions. Figure 2 provides an overview of the flow for the proposed methodology.

Dataset Description
The N-CMAPSS dataset consists of 8 provided subsets and contains 90 engine units in total. In our research, we combine flow and efficiency failures into one general failure category for each mechanical component. Table 1 provides a summary of the failure modes present in each subset. In the dataset, engine units have a lifetime rated typically between 60 and 100 cycles, with the overall objective being to estimate the RUL until catastrophic failure. Each flight cycle is of variable length and is characterized by 18 time series signals: 4 flight data descriptors W = {W 1 , W 2 , W 3 , W 4 } summarizing the dynamic operating conditions, and 14 real-time sensor measurements X s = {X s1 , X s2 , . . . , X s14 }. In addition to the time series signals, each cycle also includes auxiliary variables A = {A 1 , A 2 , A 3 , A 4 } useful for understanding the context of a flight cycle: the unit number, cycle number, a categorical flight class variable F c representing the length of the flight (set to 1 for short flights, 2 for medium flights, and 3 for long flights), as well as a binary health state variable h s (set to 1 for healthy status and 0 for unhealthy status). We note that the simulated engines are flown past unhealthy operation until end of life (i.e., catastrophic failure). Table 2 provides a summary of the variables provided in the dataset. In all, there are a total of 6825 flight cycles in the dataset, with engines averaging approximately 75 cycles per unit.

Feature Extraction
Feature extraction is necessary to reduce the input dimensionality of the dataset. Although there are only 90 turbofan engine units in the N-CMAPSS dataset as per Table 1, the dataset contains over 63 million timestamps and requires reduction for subsequent data processing. As in previous work, we aim to make predictions on a per-cycle basis (DeVol et al., 2021).
In this study, we extract cycle-wide statistical time domain features to summarize the distribution for each time series. These features include: Importantly, we do not use the unit number A 1 and health state A 4 as input features, and we opt to learn the health state as an output instead. All together, this results in 129 total features extracted from the N-CMAPSS dataset. In more general terms, this feature extraction method is applied for n training cycles, with x j 2 R n representing the vector of samples for the j th feature. Finally, the feature vectors are concatenated into a single data matrix containing all p features, X 2 R n⇥p .

Feature Normalization and PCA Orthogonalization
Features extracted from the time series signals may be of different scales and units. As a result, normalization helps ensure that predictions are not influenced by these differences. First, we apply a min-max normalization scheme across all features to map all features in the bounded range [0, 1] as shown in Eq. (1): Once again, we concatenate the feature vectors into a normalized data matrix,X 2 R n⇥p . After obtaining the normalized data matrix, PCA orthogonalization is recommended as a multivariate preprocessing step to obtain a set of uncorrelated variables. PCA is typically used to achieve dimension reduction by retaining the most important principal components (PCs) such that the explained variance is maximized (Jollife & Cadima, 2016). However, we have found that in practice, there is utility to keeping all PCs to improve training results. This is potentially because the features extracted are significantly correlated, and therefore simply using PCA for its orthogonalization benefits may improve the performance of gradient descent-based optimization methods employed in training. In Section 3, we will compare results with and without PCA orthogonalization for all models. PCA can be formulated as a linear transformation using the eigendecomposition of the sample correlation matrix Q 2 R p⇥p of the features fromX, as shown in Eqs.

Output Labeling Scheme
While N-CMAPSS provides h s and failure mode information as possible inputs for a RUL prediction model, we aim to instead predict them as outputs encoded as binary variables. As mentioned previously, these additional outputs will provide a more comprehensive prognosis of the degraded turbofan engine unit. With these new outputs, we require a labeling scheme for training a model. For learning the current cycle health state h s , we borrow the labels provided in the N-CMAPSS dataset, i.e., a label of "1" for healthy operation and "0" for unhealthy operation. We introduce a vector of possible eventual failures y EF = ⇥ y F an y LP C y HP C y HP T y LP T ⇤ T , in which each variable y comp 2 y EF is binary, with the positive label indicating eventual failure as specified in Table 1. For example, for the DS06 subset in which the LPC and HPC components eventually fail (even if the engine is presently healthy), y EF = ⇥ 0 1 1 0 0 ⇤ T . Lastly, for the RUL training label, we follow the N-CMAPSS convention, which provides RUL 2 Z ⇤ as calculated by subtracting the current cycle number from the total lifetime of the engine unit, i.e., RUL = t EOL A 2 . With these definitions, we can prepare the ground truth vector of labels y = ⇥ h s y T EF RUL ⇤ T paired with the features of each cycle.

Training Loss Function and Model Evaluation
Handling classification and regression objectives simultaneously provides additional complexity for training a predictive machine learning model. We propose optimizing a customized loss function that explicitly combines both objectives. First, we base the RUL loss contribution from NASA's scoring criteria (Chao et al., 2021b), which penalizes overestimation of RUL to favor conservative predictions and is defined in Eqs. (4)- (6): in which ↵ is the overestimation penalty equal to 1/13 if the RUL is underestimated (i.e., d RUL i < RUL i ) and equal to 1/10 for overestimations. We note that this makes the loss function non-differentiable. However, modern automatic differentiation packages are able to estimate gradients via subgradients and are more flexible for handling unconventional loss functions (Innes et al., 2019). In our work, we substitute the values of ↵ directly and rewrite Eq. (4) as follows using an indicator function, which allows for easier implementation within automatic differentiation coding environments: This alternative formulation essentially "upgrades" the ↵ penalty from a base value of 1/13 to 1/10 when RUL is overestimated.
In addition to the loss in Eq. (6) for capturing RUL errors, we further incorporate the binary cross-entropy loss BCE to reflect the errors on the q classification outputs. Together, we obtain an overall loss function through a weighted sum of the terms introduced above: where is a tunable scalar weight coefficient. Since the NASA score is based on RUL regression, this term tends to carry a larger magnitude compared to the BCE. In order to achieve a more balanced contribution of the regression and classification parts, we set = 10 for our test cases, although other values can be selected to reflect the relative importance of the regression and classification tasks under the particular PHM situation being considered.
Additionally, because benchmarking comparisons between multiple machine learning regressors have not yet been provided in previous work on the N-CMAPSS dataset (DeVol et al., 2021;Lövberg, 2021;Solís-Martín et al., 2021), we compare the performance of four state-of-the-art machine learning methods: random forests (RFs) (Genuer et al., 2017), extreme random forests (ERFs) (also known as extra trees) (Maier et al., 2015), XGBoost (XGB) (Chen & Guestrin, 2016), and artificial neural networks (ANNs) (Mahamad et al., 2010). Specifically, we will compare the performance of the tree-based ensemble regressors trained to minimize mean squared error (MSE) to an ANN that minimizes our proposed loss function in Eq. (9). For completeness, we will also compare these results to an ANN minimizing MSE. To evaluate the quality of classification predictions, we will use the area under receiver operating characteristics (AUROC) and precision-recall curves (AUPR) metrics to evaluate performance at all possible thresholds. Meanwhile, root-mean-square error (RMSE), the NASA scoring function detailed in Eq. (6), mean absolute error (MAE), and MAE normalized as a percentage of the unit's lifetime will be reported for judging regression quality for RUL predictions.

EXPERIMENTAL SETUP
For reproducibility, results are reported using the built-in N-CMAPSS dataset split, as in past work by DeVol et al. (2021). This dataset split is notable for having a testing set with entire engine units that are unseen in the training set, making the benchmark problem more realistic and challenging. The split follows an approximately 60%-40% training-testing ratio, with 4089 cycles in the training set and 2736 cycles in the test set. The training set is further divided with a 80%-20% training-validation split, with a randomly selected holdout validation set used for early stopping and hyperparameter tuning. Following the feature extraction method detailed in Section 2.2, we employ 129 features. Using the min-max normalization and PCA orthogonalization methods from the popular scikit-learn package (Pedregosa et al., 2012), we normalize the training set and apply these learned transformations to the testing set.
To prepare the regressors minimizing the MSE loss function, it is necessary to scale the labels such that the RUL regression error does not dominate the MSE calculation. This is done by simply multiplying the binary encoded labels in y by 100, thereby putting the binary labels in the same magnitudes as the RUL labels. After training the models, the AUROC and AUPR metrics are then computed using the resulting classification predictions on the test set inŷ to serve as robust in-dicators of performance averaged across all possible thresholds. Table 3 shows the classification and regression results for RF, ERF, XGB, an ANN also trained on MSE (ANN-MSE), and an ANN trained on the custom loss function derived in Section 2.5 (ANN-Flux). The average and standard deviation for the performance on the test set are reported after repeating with 4 randomized validation sets. Results with and without the PCA orthogonalization step are also included, demonstrating the impact of the preprocessing procedure on minimally tuned models. The RF and ERF regressors, implemented using scikit-learn, each contain 100 base estimators. The XGBoost method also uses 100 estimators, with the learning rate ⌘ set to 0.3 and the max depth of a tree set to 6.
Both ANNs share the same shallow architecture of two hidden layers with 64 and 32 neurons each with ReLU activations, employing the ADAM optimizer and trained for 5000 epochs with a batch size of 256. All methods are implemented using the Julia 1.8.5 programming language (Bezanson et al., 2014) and both ANNs are designed using the Flux deep learning backend, which allows for autodifferentiation of custom loss functions (Innes, 2018;Innes et al., 2019). The results in Table 3 may be further improved with a more thorough hyperparameter optimization and merely illustrate the potential for the simultaneous prediction of eventual failures alongside RUL.
Benchmarked on a local MacBook Pro machine running ma-cOS Ventura 13.2.1 with Apple M2 Max CPU and 32 GB of RAM, it takes approximately 60 seconds total to train and evaluate the RF, ERF, and XGB methods. On the same processor, the Flux models take approximately 250 seconds to train. The feature extraction is the longest step in terms of runtime, taking around 300 seconds to load the dataset and extract all 129 features for the training and testing sets.

RESULTS
The ANN-Flux method with the PCA preprocessing step accurately predicts the current health state and the eventual failure component(s) with AUROC and AUPR scores exceeding 0.95 for each output. This is especially notable considering the significant overlap of the failing components depending on the failure modes (see Table 1). In addition, the RUL prediction also outperforms the other techniques tested. The parity plot in Figure 3a) visualizes the ANN-Flux RUL predictions in the testing set versus the ground truth labels. In addition, producing a figure similar to DeVol et al. (2021), Figure  3b) also illustrates the ANN-Flux RUL predictions with the ground truth RUL labels sorted from least-to-greatest.
We note that these RUL predictions are directly output from the ANN-Flux model and further considerations may improve their quality and usefulness in practice. For example, despite the asymmetric NASA scoring function favoring conservative underestimates, the average prediction error still slightly overestimates the ground truth by 0.65 cycles. This contrasts with the training prediction error, which on average underestimates the ground truth by 0.50 cycles. Further adjustments on the overestimation penalty ↵ and the classification loss weight may skew the prediction error towards underestimation. In addition, we have not instituted hard constraints to guarantee nonnegative RUL values.
Post-processing transformations such as the ReLU function can be implemented in the future to rectify the outputs such that all resulting RUL predictions are nonnegative.
Notably, the PCA orthogonalization pre-processing step has a profound impact on classification performance for the eventual failure of the mechanical components. These findings are consistent among all attempted machine learning methods. However, PCA orthogonalization did not appear to improve the regression performance in the same way; 3 out of the 5 attempted methods had increased RMSE when inputs were orthogonalized. This suggests that using PCA to orthogonalize these extracted features is especially useful for binary classification predictions, but may not always lead to better results for minimizing RUL error.
Having additional classification outputs enables explainable analysis of RUL predictions along various slices of the dataset. For example, Figure 4 illustrates the RUL prediction errors for unhealthy versus healthy cycles. Intuitively, the interquartile range for unhealthy operating cycles is substantially narrower, indicating that RUL predictions on average improve throughout the life of the engine unit.
It is also useful to determine whether there are certain components with higher variance in RUL prediction errors; by observing the RUL prediction errors on a per-component basis, operators can glean more information and make targeted decisions based on their confidence of the prognosis. Similar to Figure 4, Figure 5 plots the RUL prediction error spread of the test set for each of the labeled eventual mechanical component failures. Figure 5 demonstrates that the RUL prediction errors have a median centered near 0 for each mechanical component and there is no significant component-based bias identified. Relatively, the compressor failures have a tighter concentration around 0 and the turbine failures are more negatively skewed, indicating more underestimates, but we note that it is difficult to draw definitive conclusions due to overlapping failures.

DISCUSSION
Our findings have broad economic implications beyond engine prognostics, as a similar approach could potentially be applied for other PHM applications. Our approach is enabled by expanding the formulation of the 2021 PHM Data Challenge to simultaneously include classification and regression objectives, taking full advantage of the provided labels in the N-CMAPSS dataset. However, we note that Table 3. Classification and regression results for N-CMAPSS dataset for ensemble methods, with the PCA-orthogonalized XGBoost method generally outperforming the other benchmark methods our work targets predicting eventual failures broken down to the detailed component-level rather than the 7 pre-defined higher-level failure modes. Predicting component-level failure is a more challenging generalization to predicting failure modes, since the solution space is expanded to allow all 2 5 possible combinations of failing components, even though the dataset is only sparsely populated by the 7 pre-defined failure modes. The benefit of this generalization is that a successful model would learn how the same components can fail with dramatically different failure data, which has practical implications for maintenance decision-making and inventory costs. Previous research on this dataset also utilized the labeled health state as an input to improve RUL predictions (Lövberg, 2021); we have relaxed assumptions by instead learning the health state as an additional output.
The computation effort of our approach compared to past work is also noteworthy.     (Biggio et al., 2021) 169,401 7.31 (Song et al., 2022) N/A 6.867 (Berghout et al., 2022) N/A 5.64 ANN-Flux + PCA 10,631 8.20 ± 0.17 To the authors' knowledge, our work is also the first to compare multiple state-of-the-art regression approaches for predicting component failures and RUL estimation for the N-CMAPSS benchmark. We also provide comparisons with and without PCA orthogonalization and for multiple loss functions, with tangible improvements for both RUL and binary classifications with these computational approaches. We hope that our contributions encourage future benchmarking Despite these advancements, important limitations remain that require addressing in future work. Firstly, while RUL prediction is comparable with past work, the prediction errors still have a large variance. Integration with physical modeling is suggested in the future to improve the confidence of RUL predictions. Moreover, failure data are difficult to obtain in practice, and as a result, industrial datasets are often imbalanced (Santos et al., 2018), threatening the utility of fully supervised learning techniques. As a result, more research is required in semi-supervised and unsupervised methods to at least lighten the supervision requirement for AI algorithms to provide accurate prognoses. In addition, while PCA orthogonalization vastly improved the component failure predictions, the derived PC variables lack physical meaning, hindering the explainability of the input features. This step makes the current formulation incompatible with explainable AI (XAI) methods such as SHAP, which attempt to explain black-box model predictions in terms of additive marginal contributions of features (Senoner et al., 2022). While being able to accurately isolate eventual failures on a component-level provides inherent explainability compared to previous efforts, we leave XAI integration for future work.

CONCLUSION
Our work as benchmarked on the N-CMAPSS dataset uniquely demonstrates the potential for an approach that simultaneously detects the current health state, predicts which component(s) will fail, and then estimates the number of cycles until failure. In essence, this integrates the important disciplines of anomaly detection and fault diagnosis-conventionally requiring multiple models-in one prognostic model that makes accurate predictions, even for presently healthy units. Our main contributions and findings for this research effort are restated as follows: 1. Reformulated and expanded the scope of the 2021 PHM Data Challenge to include health state detection and eventual failure prognosis; 2. Customized loss function derived to simultaneously combine classification and regression objectives; 3. Accurately predicted health state and eventual failures, with AUROC and AUPR exceeding 0.94 on average for each classification prediction accomplished with the ANN-Flux methodology; and 4. Comparable RUL RMSE achieved for the same dataset split and with less computational effort required for training when benchmarked against prior work.