Evaluating Image Classiﬁcation Deep Convolutional Neural Network Architectures for Remaining Useful Life Estimation of Turbofan Engines

Accurate estimation of the remaining useful life (RUL) is a key component of condition-based maintenance (CBM) and prognosis and health management (PHM). Data-based models for the estimation of RUL are of particular interest because expert knowledge of systems is not always available, and physical modeling is often not feasible. Additionally, using data-based models, which make decisions based on raw sensor data, allow features to be learned instead of manually determined. In this work, deep convolutional neural network (CNN) architectures are investigated for their ability to estimate the RUL of turbofan engines. To improve the accuracy of the models, CNN architectures, which have proven successful in image classification, are implemented and tested. Specifically, the blocks used in the Visual Geometry Group (VGG) architecture, inception modules used in the GoogLeNet architecture, and residual blocks used in the ResNet architecture are incorporated. To account for varying flight lengths, the input to the models is a window of time series data collected from the engine under test. Window locations at the climb, cruise, and descent stages are considered. To further improve the RUL estimations, multiple overlapping windows at each location are used. This increases the amount of training data available and is found to increase the accuracy of the resulting RUL estimations by averaging the estimates from all overlapping segments. The model is trained and tested using the new Commercial Modular Aero-Propulsion System Simulation (N-CMAPSS) data set, and high prognosis accuracy was achieved. This work expands on the model developed and used in the 2021 PHM Society Data Challenge, which received second place.


INTRODUCTION
Remaining useful life (RUL) is the remaining time for which an asset will be productive (Si, Wang, Hu, & Zhou, 2011), and the RUL is a key component of condition-based maintenance (CBM) and prognosis and health management (PHM) (Jardine, Lin, & Banjevic, 2006). Knowing the RUL of an asset allows for the optimal scheduling of maintenance and ordering of spare parts, which can reduce asset downtime and increase profitability. RUL estimation also has safety and environmental implications by preventing failures that could put users in danger and extending the life of an asset, thereby reducing the need for new equipment (Si et al., 2011).
While the RUL of an asset is affected by many variables and cannot be exactly known, there are several modeling techniques for estimating the RUL. The three primary approaches for estimating the RUL of an asset are: physicsbased, data-based, and hybrid (Sikorska, Hodkiewicz, & Ma, 2011). Physics-based methods create a model of the system using an in-depth understanding of the underlying processes. While physics-based models have been shown to be successful (Bolander, Qiu, Eklund, Hindle, & Rosenfeld, 2009), they are often prohibitive due to the time and system understanding required to create them. Additionally, physics-based models tend to be specific to a failure mode, and failure modes must be well understood a priori (Sikorska et al., 2011). Data-based methods, including statistical and artificial intelligence (AI) methods, do not require knowledge of the underlying systems, but instead rely on the availability of a data set that captures the performance of the system. Finally, hybrid models merge these two approaches to reduce the underlying knowledge and data requirements of the two individual methods (Sikorska et al., 2011). While hybrid approaches work in some applications (Kong et al., 2020), they increase the modeling complexity by requiring the two models to be developed and merged.
In this work, modeling methodologies for estimating the RUL of turbofan engines are investigated. This was originally completed as part of the 2021 PHM Society Data Challenge, which used the new Commercial Modular Aero-Propulsion System Simulation (N-CMAPSS) data set (Chao, Kulkarni, Goebel, & Fink, 2021). This data set contains a large amount of run-to-failure data, so this work takes a data-based approach. Past research on the original CMAPSS data set demonstrated the applicability of convolutional neural networks (CNNs) (Li, Ding, & Sun, 2018;H. Yang, Zhao, Jiang, Sun, & Mei, 2019), long short-term memory (LSTM) (da Costa, Akeay, Zhang, & Kaymak, 2019;Zheng, Ristovski, Farahat, & Gupta, 2017;Wu et al., 2020), and hybrid methods, which merge CNN and LSTM (Zhao, Huang, Li, & Iqbal, 2020) for RUL estimation. The use of convolutions allows for a reduction of the number of model parameters relative to a neural network without convolutions and aids in the extraction of features (Albawi, Mohammed, & Al-Zawi, 2017). Additionally, convolution layers have the ability to realize complex relationships between different sensor readings. Specifically, this work focuses on applying CNN architectures developed for image classification to RUL estimation. Image classification architectures are of interest because if they can be leveraged for RUL estimation, then they have the potential to quickly advance the field of prognostics. Additionally, the theoretical justifications for using CNN on images, including the ability for filters to be learned and applied across an entire image, also apply to sensor data.
The rest of the paper is organized as follows. In section 2, the problem is introduced and the data set is described. In section 3, the methodology is detailed, including the data preprocessing and model architecture. Section 4 details the results of each trained model. Finally, sections 5 and 6, contain discussion and conclusions.

PROBLEM DESCRIPTION
The work was initiated as part of the 2021 PHM Society Data Challenge. The challenge description is thus the same as the problem description presented in this section. This work builds on the result of the Data Challenge (DeVol, Saldana, Fu, & Woodruff, 2021) by including results from multiple model architectures and various model inputs including the flight stage and incorporating multiple windowed inputs.

Data Set Description
In this work, the new Commercial Modular Aero-Propulsion System Simulation (N-CMAPSS) data set is used to test a data-driven method for RUL estimation. This data set contains realistic run-to-failure data of turbofan engines (Chao et al., 2021). The N-CMAPSS data set offers higher fidelity data than the original CMAPSS data set by incorporating real recorded flight conditions and relating the degradation process to its operation history to extend the degradation model (Chao et al., 2021).
The data set models the failure trajectories of eight different failure modes that affect either the efficiency or flow of one or more of the sub-components of the engine. One of the failure modes included in the N-CMAPSS data set was not included in the 2021 PHM Society Data Challenge, so it was not used in the development of the tested models. The seven failure modes used in the data challenge are shown in Table 1. The failure modes span across all the rotating sub-components: fan, low-pressure compressor (LPC), high-pressure compressor (HPC), high-pressure turbine (HPT), and low-pressure turbine (LPT) and can affect either efficiency (E) or flow (F).
The flight durations are divided into three classes based on their length, but this work considers them together by using a window function, which is discussed in section 3. Each flight has an unknown initial condition and is run to failure, so the number of flights varies for each unit. Across all failure modes, the data set contained 90 units with data from a combined 6,825 flights. The data file for each flight contains scenario descriptors (w), measurements (x s ), an RUL label (y), and auxiliary data. The variables contained within the scenario descriptors, measurements, and auxiliary data are detailed in Tables 2, 3, and 4, respectively.

Problem Definition
The goal of this work is to develop a model that estimates the RUL of a turbofan engine given a data set D = {w i , x si , y i } N i=1 , which contains run-to-failure data for N total flights of turbofan engines subject to different failure modes. The length of the sensory signals w and x s is not constant between flights, so the model should be able to incorporate variable lengths of input data. The performance of the model is evaluated using a combination of the rootmean-square error (RMSE) and NASA's scoring function (Saxena, Goebel, Simon, & Eklund, 2008), calculated from the actual (y) and predicted (ŷ) RUL values. RMSE is calculated following Eq. (1), where m v⇤ is the number of test samples. NASA's scoring function (s c ) is calculated using Eq. (2), where ↵ is defined in Eq. (3). With ↵ defined this way, under-estimations are preferable to over-estimations. The values calculated in Eqs. (1) and (2) are combined using Eq. (4) to get the final score. The values of each of these equations as a function of the difference between the actual and predicted RUL values is shown in Figure 1.
10 X X X X DS07 10 X X DS08 25 X X X X X X X X X X Total pressure in bypass-duct psia P2 Total pressure at fan inlet psia P21 Total pressure at fan outlet psia P24 Total pressure at LPC outlet psia Ps30 Total pressure at HPC outlet psia P40 Total pressure at burner outlet psia P50 Total pressure at LPT outlet psia Health state - Figure 1. Evaluation scores as a function of the difference between the actual and predicted RUL values.

Data Pre-Processing
In the data pre-processing phase, the scenario descriptors are combined with the measurements. This is done because the scenario descriptors can provide context to the measurement readings. For the time series data of a flight, which contains m values, the scenario descriptors w 2 R m⇥4 and measurements x s 2 R m⇥14 are combined to form x 2 R m⇥18 .
The flight duration is not consistent across each flight, so the length of each x s and w varies. To account for this, a windowing function is used, which crops the measurement data and scenario descriptors to a specified length. Multiple loca-tions for this window were considered, including the climb (beginning), cruise (middle), and descent (end) portions of the flight. After applying the window function, the length of the input data is consistent, and the number of model parameters is reduced compared to a model that considers the whole flight. Three window sizes of 30, 60, and 120 samples are considered. In addition to considering a single window of data from each flight, three consecutive windows with 50% overlap were also considered. This triples the amount of training data by creating three training points from each flight, which are each associated with the same RUL. Three windows with 50% overlap were considered because the data within each windowed segment is expected to be similar since the outside segments are adjacent and share half of their data with the middle section. This allows each of the three windowed segments to be treated as separate training samples with the same labels which would not be possible if more dispersed windows were used.
Finally, the input data is standardized so that it has zero mean and unit standard deviation. This is done independently for each of the 18 features. The mean and standard deviation for each feature are calculated from the training data, then applied to both the train, validation, and test data to calculate the standardized input by where µ j and j are the mean and standard deviation of the j th feature, x i,j is the i th value of the j th feature, and X i,j is the final standardized value input to the model.

Model Architectures
In this work, multiple deep convolutional neural network (CNN) architectures are investigated for their ability to estimate the RUL of turbofan engines. Three primary architectures that have been shown to be successful in classifying images were evaluated. The first uses blocks of CNN layers similar to the Visual Geometry Group (VGG) architecture (Simonyan & Zisserman, 2014). The second makes use of inception modules introduced in the GoogLeNet architecture (Szegeandy et al., 2015). The final architecture uses residual blocks based on the ResNet architecture (He, Zhang, Ren, & Sun, 2016). All of these models were originally developed for image classification, so the convolutional layers are two dimensional. To adapt these architectures for use on the time series data from the N-CMAPSS data set, one-dimensional convolutional layers are used. The convolutions take place along the time axis.

VGG Architecture
The first architecture tested makes use of blocks of CNN layers similar to the VGG architecture. In the VGG architecture, only small (3x3) convolution filters are used, and each block of convolution layers is followed by a max pooling layer, where the input is down-sampled by taking the maximum over a 2x2 window with a stride of 2 (Simonyan & Zisserman, 2014). An example of an architecture making use of this configuration and adapted for the time series data in the N-CMAPSS data set is shown in Table 5. To adapt for the N-CMAPSS data set, the architecture uses one-dimensional convolutions, as opposed to the original two-dimensional convolutions, so that the input can be a two-dimensional matrix of columns of sensor data instead of an image.

Inception Architecture
The next model architecture makes use of inception modules. Inception modules contain parallel convolutional layers of different sizes and a single max pool layer. The outputs of each of these parallel layers are concatenated and sent to the next layer. The original inception network used two dimensional convolutions of sizes 1x1, 3x3, and 5x5 and a 3x3 max pool layer (Szegeandy et al., 2015). The use of variablesized convolutions allows for variable-sized features to be detected within a shallower neural network. This version of the inception module is shown in Figure 2a. To reduce the dimensionality of the model, convolutions of size 1x1 can be incorporated before each convolution layer and after the max pool layer. These convolutions are performed with a stride of one so that the first and second dimension remain unchanged, but the third dimension is reduced. An inception module that incorporates dimensionality reduction is shown in Figure 2b.
An example of a model architecture making use of these inception modules and adapted for the time series data in the N-CMAPSS data set is shown in Figure 3.

Residual Network Architecture
The final network architecture explored in this work makes use of residual blocks. The residual blocks use a skip connection, which adds the input of a stack of two convolution layers to the output of those two layers. These shortcut connections, which do not add model parameters, are able to increase the accuracy of deep neural networks (He et al., 2016). An illustration of a residual block is shown in Figure 4.
When using the simple implementation shown in Figure 4, errors can occur when the output of the two convolution layers is not the same as the input. To correct for this, the shortcut can be padded with zeros, or a 1x1 convolution layer can be used to match the shortcut to the output of the convolution layers (He et al., 2016). An example of a model architecture making use of residual blocks and adapted for the time series data in the N-CMAPSS data set is shown in Figure 5. This model uses single dimensional convolutions on the shortcut connection to correct for mismatches in the number of filters.

Model Training
Across all the failure types and tested units, flight data from 6,825 flights were used for training and testing. Within the N-CMAPSS data set, this data is split with into development and test sets with data from 4,089 flights (60%) in the development set and data from 2,736 flights (40%) in the test set. In this work, the development data was further split using 5fold cross validation. The average validation accuracy across the five folds was used to tune the hyperparameters for each  model architecture, and all validation scores reported in this work are the average across the five folds. After hyperparameter tuning, the test data was used on the final trained model from each network architecture. In the instances where three overlapping windows are used, the amount of training data is increased threefold, but the amount of test data remains the same. This is because in training, the three windows are treated independently as three separate samples each having the same RUL label. In validation and testing, the average RUL estimate over the three windows is used as the final prediction for the unit under test.
The models were implemented in Python (version 3.8.12) using the Keras (Chollet et al., 2015) library (version 2.4.3) and Tensorflow (Abadi et al., 2015) backend (version 2.3.0). In training, the accuracy of the model was evaluated by taking the mean squared error (MSE) between the predicted and ac- Figure 5. Example residual network architecture adapted to time series data. The convolutional layer parameters are denoted as "Convhreceptive field sizei-hnumber of channelsi".
tual remaining useful life of all the samples. MSE was used in training for ease of implementation, but score values as described in Eq. (4) were computed on the validation and test data sets. During hyperparameter tuning, training was conducted for 40 epochs, and once the final architecture was set, the final models were trained for 60 epochs. Hyperparameters included the number of layers, number of nodes in each layer, batch size, and drop-out rate were set using a gridsearch (L. Yang & Shami, 2020). For the model architectures, the search space started with minimal models with few layers and nodes and the complexity was increased until overfitting was observed. The search space included batch sizes of 32, 64, and 128 and dropout rates of 40%, 50%, and 60%. Following past research, a dropout of 50% was used as the midpoint for the dropout rate search space (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014). Ultimately, a batch size of 64 and dropout rate of 50% were used for all models. For both training and testing, the Adam optimizer and a learning rate of 0.001 were used. The Adam optimizer was used because of its recent broad adoption in deep learning (Kingma & Ba, 2014;Ruder, 2016). Because the Adam optimizer uses an adaptive learning-rate, the default initial learning rate of 0.001 can be used without tuning (Ruder, 2016).

VGG Architecture
In the development of the VGG-based model, network configurations with different numbers of convolution layers and filters were tested. The details of the best performing architecture, based on the validation score, is given in Table 6.
After the last convolution layer, the filters are flattened and fully connected to a hidden layer with 128 cells. Dropout at a rate of 50% is included in this layer to reduce overfitting. In the last layer, a single node with a linear activation outputs the RUL estimation of the flight data under investigation. The model performed best when trained with 3 overlapping segments of 60 sensor readings from the climb stage of the flight. The training and validation scores of this architecture, trained with different input configurations, are compared in Table 7. The flight stages were explored equally, but more results for the climb stage are shown in Table 7 because models trained with data from the climb stage produced lower scores. This trend continued for all model architectures investigated. In total, this model has 286,593 trainable parameters, and the breakdown of these parameters by layer is given in Table 6. A parity plot of the final VGG model retrained with all of the training data is shown in Figure 6.

Inception Architecture
In testing the inception architecture, the number of inception modules and fully connected layers, as well as the numbers of filters and nodes in each were varied. The details of the architecture that yielded the best score on the validation data is shown in Table 8. Three inception modules are used in the network. In the first module, the dimensionality reductions before the convolution layers are not included so that the first convolutions see the raw signal. This is done because the input data is relatively small with 18 sensor readings, so the benefits of dimensionality reduction are minimal.
In the second and third modules, the dimensionality reduction is included because the stacked filters from the first inception module expand the dimensionality. After the last inception module, the convolution filters are flattened and fully connected to a hidden layer with 256 cells. Dropout at a rate of 50% is included in this layer to reduce overfitting. Finally, in the last layer, a single node with a linear activation outputs the RUL estimation of the flight data under investigation. The training and validation scores of this architecture, trained with different input configurations, are compared in Table 10. In total, the model has 2.05 million trainable parameters. The breakdown of these parameters by layer is shown in Table 9. A parity plot of the final inception model retrained with all of the training data is shown in Figure 7.

Residual Network Architecture
To test the performance of residual architecture, models with varying numbers of residual blocks and fully connected layers, and varied numbers of filters and nodes in each were trained. The details of the architecture that yielded the best score on the validation data are shown in Table 11. This model was comprised of two residual blocks after which the convolution filters are flattened and fully connected to a hidden layer with 128 cells. Dropout at a rate of 50% is included in this layer to reduce overfitting. Finally, in the last layer, a single node with a linear activation outputs the RUL estimation of the flight data under investigation. The training and validation scores of this architecture, trained with differ-  Table 12. In total, this model contained 1.03 M trainable parameters when setup for a 60x18 input. Unlike the other architectures tested, for the residual-based architectures the best performance was for the shorter, 30 sample input as opposed to the 60. The improvement, however, was minimal so the 60 sample input was used for easy comparison back to the other models. A parity plot of the final residual-based model retrained with all of the training data is shown in Figure 8.

Comparison
The train and test scores for the final model from each category are shown in Table 13. The means from 10-fold cross validation are reported with their standard deviation. The train score is from the final model trained on all training data, and the test score is calculated on the previously unused test set. A one-way ANOVA was performed on the crossvalidation results and found that the differences between the models were not significant (F(2,27)=0.957 p=0.40). The ratio of the standard deviations was 1.4, and the Q-Q plot used to verify normal distributions is shown in Figure 9. It is worth noting that the overlapping training sets of 10-fold cross validation may exhibit increased type I error when used for statistical tests (Dietterich, 1998). In addition to each

DISCUSSION
The results of this work presented in Table 13 demonstrate that all of the CNN architectures were able to predict the RUL with similar accuracy. The inception architecture offered the greatest accuracy, but the increase in its performance compared to the simpler VGG was minimal, considering the increase in trainable parameters and training time. This indicates that a very deep model is not required for RUL estimation, so the benefits that the residual-and inceptionbased architectures offer are not realized. The applicability of CNNs on the N-CMAPSS data set, demonstrated in this work, and their past success on the original CMAPSS data set (Sateesh Babu, Zhao, & Li, 2016;Li et al., 2018;H. Yang et al., 2019), show the broad applicability that they have in RUL estimation.
To determine if the differences in the performance of the models when using the climb portion was significant, a oneway ANOVA was performed on the model scored obtained   Table 15. The result was that the flight stage was found to be a significant factor for all model architectures (VGG F(2,27)=15.7 p=3.0e-5, Inception F(2,27)=16.9 p=1.7e-5, and Residual F(2,27)=23.9 p=1.0e-6). The ratio of the standard deviations for the VGG, Inception, and Residual architectures were 1.5, 1.3, and 1.3 respectively. Q-Q plots used to verify normal distributions are shown in Figure 10-12.
Tukey's HSD Test for multiple comparisons found that for the VGG architecture, the model difference was significant when comparing all flight stages. For the Inception and Residual architectures, the climb stage was significantly different from the cruise and descent, but the cruise and descent difference was not statistically significant. The p-values for the Tukey test are shown in Table 16.
From the parity plots in Figures 6 -8 Figure 10. Q-Q plot for comparing the performance of various flight regions when using the VGG model architecture.
of 40 units and 1,260 data points above an RUL of 40 units. Note that the models were not retrained with this split data, only reevaluated. Tables 17 and 18 highlight the increase in model accuracy as the true RUL decreases. This is not surprising, as one may expect it to become easier to estimate the RUL as the signs of wear become more obvious. With this trend in mind, future modeling efforts may find it worthwhile to focus efforts on making accurate estimation only after an engine has been used for a certain number of cycles. This has the potential to increase the model accuracy as the engine nears the end of its life, where an accurate RUL estimate becomes more valuable.
It is worth noting that the parity plots for each model show negative estimated RUL, which does not make practical sense. This would not occur if the model output used a rectified linear unit (ReLU). A ReLU is linear for inputs greater than zero and zero otherwise (Agarap, 2018). Training the models with a ReLU activation on the output node is not practical, so instead the model is trained with a linear activation function. The parity plots show the negative values to illustrate the raw model outputs.   al., 2019), and its success here on the N-CMAPSS further validates its applicability. While accurate RUL estimations can be made using data from a single flight, the use of historical data may improve the accuracy of the RUL estimations. Future work should investigate incorporating historical data using recurrent architectures, such as LSTM.
One of the limitations of this work is that RUL estimations are not explainable, or the model is not able to explain why an RUL estimation was made. In future work, this could be improved upon by incorporating the ability of the model to predict what component is causing the RUL to be reduced. Since the N-CMAPSS data set contains multiple failure modes across all components, and the failure mode of each unit is labeled, future work on the N-CMAPSS data set should consider labeling the failing component, in addition to estimating the RUL. Another limitation with this approach is that run-to-failure data from a variety of failure modes is needed. In cases where this is not applicable, meth-  (Coble & Hines, 2011), should be considered instead. Finally, the techniques used in this work to improve model accuracy, such as averaging the result of overlapping windows, cannot be guaranteed to extend to other systems. Future work should consider the conditions for applying these techniques to other systems.

CONCLUSION
This work demonstrated the applicability of adapting popular image-based, two-dimensional CNN for the use of estimating RUL from one-dimensional time series data. The residualand inception-based architectures were able to increase the number of trainable parameters without overfitting, but only the inception architecture showed an improvement, although small, over the VGG-based model. Adapting these architectures for time series data allows for RUL estimation to take advantage of the recent advances in CNN architectures that have been driven by image classification. The performance of the residual-and inception-based architectures relative to the VGG-based model indicate the complexity of image classification architectures has surpassed what is necessary for RUL estimation of turbofan engines and thus further advances to image classification architectures may not benefit RUL estimation.
This work also investigated which portion of the flight (climb, cruise, or descent) contained the most information for RUL estimation. For all model architectures investigated, the climb portion was found to be better than the cruise or descent portions. Additionally, the use of overlapping segments was in- vestigated. Using overlapping segments increased the amount of training data available, and the final estimate of an engine's RUL was determined by averaging the model output for each of its inputs. This approach was found to increase the average accuracy for all tested models and presents a potential solution that other researchers can use when working with windowed segments of time series data.