A stacked deep convolutional neural network to predict the remaining useful life of a turbofan engine

This paper presents the data-driven techniques and methodologies used to predict the remaining useful life (RUL) of a fleet of aircraft engines that can suffer failures of diverse nature. The solution presented is based on two Deep Convolutional Neural Networks (DCNN) stacked in two levels. The first DCNN is used to extract a low-dimensional feature vector using the normalized raw data as input. The second DCNN ingests a list of vectors taken from the former DCNN and estimates the RUL. Model selection was carried out by means of Bayesian optimization using a repeated random subsampling validation approach. The proposed methodology was ranked in the third place of the 2021 PHM Conference Data Challenge.


INTRODUCTION
Remaining useful life (RUL) is a cross-disciplinary field that belongs to the wider field of Prognostics and Health Management (PHM). This discipline studies the system behavior during its lifetime, that is, from the last check or maintenance performed on it until the system fails or the degradation of the system performance exceeds a certain threshold. Therefore, RUL enables to estimate the future reliability and scans the degradation of the system along the time (Peng, Wang, & Zi, 2018). The final goal is to estimate the remaining time until such failure occurs or a degradation level is reached, given the system working condition at any point of the system's lifetime.
Two main methodologies for RUL estimation can be found in the literature: model-based methods and data-driven methods. Model-based methods establish a degradation model based on physics and statistics to predict the degradation David Solís-Martín et al. This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 United States License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. trend of the system. Model-based methods require a strong knowledge on the physical behavior of each specific component and failure type, thus it is difficult to develop models for the current increasingly complex systems (Z. Zhao, Liang, Wang, & Lu, 2017). Moreover, some assumptions need to be taken on the models, thus those can be biased. For all these reasons, model-based methods have a limited prediction performance and have to be designed ad hoc for each different type of machinery.
On the other hand, data-driven methods use machine learning algorithms to learn the degradation trend of the system using historical condition monitoring data. In contrast to modelbased methods, prior expertise is not required. Their application can be straightforward when enough data of quality are available.
Among the existing data-driven methods, Deep Learning (DL) is one of the most popular and promising fields of study. During the last decade, the use of DL techniques has surged significantly. Especially in complex tasks with highdimensional nonlinear data. DL has had an enormous success in image processing, natural language processing, and signal processing. For this reason, it is not unexpected that DL based approaches are also widespread within the context of Prognostics and Health Management (PHM) research. One type of network that has been used to deal with sequence data is the deep convolutional neural network (DCNN) (LeCun, Haffner, Bottou, & Bengio, 1999). For example, in (B. Zhao, Lu, Chen, Liu, & Wu, 2017) the authors applied DCNN to time series classification over a variety of datasets. Within of the context of PHM, (Babu, Zhao, & Li, 2016) and (X. Li, Ding, & Sun, 2018) authors have applied DCNN to RUL prediction of aircraft turbofan engines.

DATA AND PROBLEM DESCRIPTION
The Commercial Modular Aero-Propulsion System Simulation (CMAPSS) is a modeling software developed at NASA. It was used to create the previous CMAPSS dataset (Saxena, Goebel, Simon, & Eklund, 2008). That first dataset was built by only taking samples from a system that is already degraded. "Therefore, the onset of the fault cannot be predicted; only the evolution of the fault can" (Arias Chao, Kulkarni, Goebel, & Fink, 2021).
A new dataset, named N-CMAPSS, has been created providing the full history of the trajectories starting with a healthy condition until the failure occurs. A schematic of the turbofan model used in the simulations is shown in Figure 1. All rotation components of the engine (fan, LPC, HPC, LPT, and HPT) can be affected by the degradation process. This model is defined as a nonlinear system denoted by: where x s are the physical properties, x v are the unobserved properties, w is the scenario-descriptor operating conditions and θ are the health model parameters.

Data
The dataset provided for the challenge consists of data from 90 simulated flights (extracted from the N-CMAPSS 1 (Arias Chao et al., 2021)). Seven different failure modes, related to flow degradation or subcomponent efficiency that can be present in each flight have been defined. The flights are divided into three classes depending on the length of the flight. In class 1, fall flights with a duration from 1 to 3 hours, class 2 is composed of flights between 3 and 5 hours, and class 3 contains flights that take more than 5 hours. Each flight is divided into cycles, covering climb, cruise, and descend operations.

Problem definition
The problem revolves around the development of a model G capable of predicting the remaining useful life Y of the sys-1 The N-CMAPSS dataset can be downloaded from the NASA data repository where y andŷ = G(x s , w, a) are, respectively, the expected and estimated RUL. S is a scoring function defined as the average of the root-mean-square error (RMSE) and the NASA's scoring function (N s ) (Saxena et al., 2008): being M the number of samples and α equal to 1 13 in casê Y < Y and 1 10 otherwise.

PROPOSED METHODOLOGY
This section introduces the different phases or processes in which the proposed methodology can be decomposed.

Data normalization
The 20 variables used to develop the different models (see table 1) have different scales. For that reason, directly feeding the proposed networks with such data will slow down the learning and convergence of the models. Hence, a data normalization step, before the training stage, is required to ho- mogenize the variables into a common scale. More precisely, the standard normalization is used in this paper: where x f is the data of a feature f , and µ f and σ f are its mean and standard deviation, respectively. Note that the mean and variance are computed for each training cross-validation set and are also used to normalize the validation sets.

Time window processing
After the data normalization, the inputs of the network are generated using a sliding time window through the normalized data. The size of the window, denoted as L w , is a parameter to be selected during the model selection stage. The inputs generated can be expressed as Figure 2) and the paired ground-RUL label is Y t . With this approach, for each unit, T k − L w samples will be considered. Where T k is the total run time in seconds of each unit.
It is common to set a constant RUL at the beginning of the experiment (H. Li, Zhao, Zhang, & Zio, 2020), considering a healthy condition of the system during that early stage. In this work, this approach has not been followed since this kind of assumptions can introduce a bias in the model. Instead, the ground-RUL label has been set as a linear function of cycles from the RUL of each unit where T U L k is the total useful life of the unit k in cycles and C k t is the number of past cycles from the beginning of the experiment at time t.

Level 1 and level 2 model: Convolutional Neural Networks
The modeling phase consists of two levels (see Figure 3). In the first level, the goal is to find a good model to produce a time-windowed encoding of the raw input signals. This en- Another goal of this encoding step, besides the dimension reduction of the input, is to remove as much noise as possible.
Such an encoding is used as the level 2 model input. The goal of this second level model is to provide an estimation of the RUL.
For both level models, Deep Convolutional Neural Networks (DCNN) have been selected. DCNN is a specialization of fully connected networks (FCN) that have been widely applied in image processing, natural language processing, and speech recognition with great success. The CNN uses parameter sharing and subsampling to extract feature maps with the most significant local features. The main operations in a DCNN are convolution and polling . The convolution operation implements parameter sharing and local receptive fields. The equation of the convolution is as follows: where I is an input matrix, K is the kernel matrix (or convolution's parameters) of size n x m, d is the dilation rate, and S is the result of the convolution, called feature map. The kernel matrix is slipped across the input matrix looking for a pattern present in any place of it. Thereby, comparing with a FCN, the number of weights is reduced and, additionally, the overfitting chance. The pooling operation performs a downscaling by applying a statistic operation to each region of the input, which was previously divided into rectangular pool- ing regions. The pooling operation serves a few porpouses: reduces the computational requirements for the upper layer since the feature maps are downscaled, reduces the number of connections (parameters) for the upper fully connected layers, provides a form of spatial transformation invariance and helps mitigating the overfitting risk.

Cross validation
Neural networks have a high number of parameters to be adjusted. Additionally, the size of the sliding window has to be optimized too. To deal with the overfitting problem during the model selection phase, a bunch of validation techniques can be found in the literature. The classical single hold-out strategy is discarded since it can easily overfit the hold-out set and lead to poor generalization results. The k-fold crossvalidation has the disadvantage that the size of the validation set decreases as k (the number of folds) is increased. There exist alternatives as the k-fold repeated random subsampling validation that overcome this issue (see Figure 4). The later strategy has been selected in this work, with k = 5 and a validation fold size of 30% of the training set, that is, 27 random units from a total of 90.

Model Selection
The goal of the model selection phase is to obtain an optimal parameter configuration for each model. To this aim, Bayesian optimization has been selected as the optimization strategy. The models were trained using the RMSE as the loss function, since the NASA score is not differentiable. However, the loss function used in the Bayesian optimization to decide the next set of model parameters to be tested, is the score S, defined in equation (3).
The well-known early stopping method, with a patience factor of 8 epochs, is used as a signal to finish the model training process. Hence, the training of the model will be stopped if the training loss on the validation fold has not been improved during the last 8 training epochs. Additionally, the learning rate is decreased by a factor of 0.1 when no improvement is seen on the validation loss in the last 3 training epochs.

2D Convolution Batch Normalization
Activation function Pooling 2x2 Dropout ReLU output Fully connected layer Figure 5. Level 1 networks architectures tested.

EXPERIMENTS AND RESULTS
This section describes the settings and parameters used to find the best models. Then the results of each level are compared and the final solution is described. Finally, this approach is applied to the hidden test data.

Level 1: encoding
The encoding model at level 1 of the stacking aims to carry out a dimension and noise reduction on the raw signals. To achieve this task, DCNN architecture was used. The classical DCNN architecture, which is shown in Figure 5, can be divided into two parts. The first one is a stacking of N b blocks composed of convolution and pooling layers. The goal of this first part is to extract characteristics potentially useful for the task. The second part consists of fully connected layers and in this case will perform the regression of the RUL.
To obtain an optimal hyperparameter configuration, Bayesian optimization was executed during 100 iterations. The first 10 iterations, consist in models with randomly selected hyperparameters. These first 10 random points of the model hyperparameter search space are used by the Bayesian process to create the initial estimation of the error surface. After these 10 iterations, the optimization process applies Bayesian rules to select the most promising point (model hyperparameter) to be tested. Table 2 summarizes the input parameter ranges and the best parameter configuration found for the L1 model. Figure 7 (top) shows the predictions for 4 units made by the DCNN model. The confidence intervals have been computed as [µ p − 3σ p , µ p + 3σ p ] where µ p and σ p are the mean and variance, respectively, of the cross-validation predictions. As it can be observed in the charts, the models are more reliable when the failure occurs earlier. In the first cycles, the prediction seems to be almost constant, until the model is able to catch the descending trend.

Level 2: RUL estimation
The aim of the level 2 model of the stacking is to perform the final RUL predictions. The input of this model is the encoding of the raw signal generated by some of the level 1 models. Therefore, a first step is needed to generate the encoding for the train and test sets per fold. The encodings are extracted from the second full connected layer of 100 neurons.
Each sample input for each candidate level 2 model will be generated using a sliding window over the encodings. Since the encodings are a dimension reduced version of the raw signal, the level 2 model can expand the receptive field of the model and learn from the trend along the time. The parameter step defines the gap in seconds between encodings. Figure 6 shows how the inputs for the models are obtained. The concept of image channel is used to compose the input. Each channel has 100 vector encodings. Therefore, the total number of encodings in the input is 100 · chanells.
The level 2 model is also a DCNN. The cross-validation schema and parameter ranges to be optimized are the same considered in the level 1 model. Addicionaly, three more pa-step Figure 6. Sliding window for l2 models. Left: window over the raw signals to generate the encodings e i . Right: encoding input shape for the DCNN. rameters need to be considered in this L2 model optimization, namely f c 2 , step, and channels (see Table 2). Table 2 summarizes the best parameter configuration found for the level 2 model, which is quite similar to that of level 1. This could be due to that the level 1 hyperparameters are provided to the optimization process as seed. There exist differences only in the training and regularization parameters. The rest of the architecture is identical. Regarding the additional parameters, it is interesting to note that the value selected for the parameter step is 989. Since the level 1 model provides a fairly good estimation of the RUL, the level 2 model only needs a set of sparse encodings to capture the trend and improve the RUL. The cross-validation score of this model is 3.44, while the ensemble score (that is, taking the average of predictions before computing the score), is 2.95. This gap between the cross-validation scores and ensemble predictions means that the predictions of the models have a good level of uncorrelation.
Figure 7 (bottom) shows the predictions and confidence intervals of the level 2 models. Comparing these with the level 1 models (Figure 7 top), one can see that the predictions of the stacked models have been smoothed and improved. Figure  8 presents the relation between the ground truth RUL and the score obtained by each stacked model. Finally, figure 9 shows the performance of each model in each class. Note that the shorter are the flights, the better the model performs. In the same way, the improvement in the level 2 model performance is higher for shorter flights. Figure 8. The chart shows the relation beetween RUL and the score (top), the prediction and the score (middle) and the prediction and RUL (bottom).

RUL prediction on test data
A common approach to obtain the final model consists in training it using all available data. In this approach, a number of training epochs must be selected, thus the validation data used during the model selection would also be part of the final training set. To select the number of epochs to train, the mean value among the best epoch obtained for each fold is considered. Usually, it would be necessary to save part of the training set as the final test set to validate the performance of the model. Hence, the number of samples in the training/validation set is reduced. A different approach has been used in the proposed methodology, which consists in using all models trained per fold as an ensemble. This approach has two main advantages. The first one is that it is possible to obtain a confidence metric of our final ensemble model performance by means of cross validation. The second one, related to the first, is that the predictions per sample could be used to create a confidence interval of the final mean prediction as it is shown in the Figure 7 and 8.

CONCLUSIONS
This work has exposed a robust methodology to estimate RUL without the need of expert knowledge on the system studied. The success of this methodology has been demonstrated with the results obtained in the 2021 PHM Conference Data Challenge (ranked in the third place). The goal of the challenge was to predict RUL in turbogan aircraft engines, using data generated with a CMAPSS simulator. Figure 9. Performance of the models per class.
In this work, the process of obtaining the final solution is divided into two learning stages. In the first one, an encoding of the raw data is learned and used as input of the second learning stage to obtain the final model capable of estimating RUL. The finals results show that using one of the most basic architectures found in the literature is enough to achieve an excellent outcome.
A number of possible improvements of the current solution will be the target of future research: 1) Smoothing the RUL among cycles could smooth the error space, thus helping in the learning process. 2) In the proposed methodology, the inputs having a number of cycles lower than that required by the network, have been excluded from the training process.
In the case of the validation set, these inputs were filled with the earliest encoding. It is expected that training the networks applying the same filling will help reducing the overfitting. 3) Another possible improvement could be to balance the training folds so each failure type is properly represented in each validation set. 4) Train the level 2 model with a random gap between encodings. This allow to expand the training set, have an additional mechanism to compute the confidence intervals, and increase the number of predictions to calculate the final averaged RUL. 5) Finally, it would be interesting to study how predictive are the class and the health state features.
This work has been developed with a focus on reproducible research. To this aim, the parameters obtained and the parameter ranges used for model selection have been well described. In addition, the source code to train the models and validate the results can be found in the source code repository https://github.com/DatrikIntelligence/Stacked-DCNN-RUL-PHM21.

ACKNOWLEDGMENT
This work has been supported by Grant PID2019-109152GB-I00/AEI/10.13039/501100011033 (Agencia Estatal de Investigación), Spain and by the Ministry of Science and Education of Spain through the national program "Ayudas para contratos para la formación de investigadores en empresas