Remaining Useful Life Estimation Using Neural Ordinary Differential Equations

Data-driven machinery prognostics has seen increasing popularity recently, especially with the effectiveness of deep learning methods growing. However, deep learning methods lack useful properties such as the lack of uncertainty quantification of their outputs and have a black-box nature. Neural ordinary differential equations (NODEs) use neural networks to define differential equations that propagate data from the inputs to the outputs. They can be seen as a continuous generalization of a popular network architecture used for image recognition known as the Residual Network (ResNet). This paper compares the performance of each network for machinery prognostics tasks to show the validity of Neural ODEs in machinery prognostics. The comparison is done using NASA’s Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) dataset, which simulates the sensor information of degrading turbofan engines. To compare both architectures, they are set up as convolutional neural networks and the sensors are transformed to the time-frequency domain through the short-time Fourier transform (STFT). The spectrograms from the STFT are the input images to the networks and the output is the estimated RUL; hence, the task is turned into an image recognition task. The results found NODEs can compete with state-of-the-art machinery prognostics methods. While it does not beat the state-of-the-art method, it is close enough that it could warrant further research into using NODEs. The potential benefits of using NODEs instead of other network architectures are also discussed in this work.


INTRODUCTION
Machinery prognostics is defined as the process used to estimate the remaining useful life (RUL) of machinery or its components. To estimate the RUL, data is needed that indicates the machinery's health and a prognostics method is needed which will extrapolate and find the RUL based on the Marco Star et al. This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 United States License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. current health. There are three main categories of machinery prognostics methods, physics-based, data-driven and hybrid. For this work data-driven methods are employed as they have become increasingly popular as more data is gathered on machinery using sensors . Data-driven models optimize their parameters based on the historical data inputs to estimate the RUL, which is referred to as training the model.
Machine learning techniques have become increasingly used in data-driven machinery prognostics. Implementation of these techniques generally involves pre-processing the historical data, extracting features from the data that correlate well with the RUL and training the machine learning model on the data and features. However, deep learning methods do not require feature extraction as the deep neural networks can automatically extract features from raw data (W. Zhang, Yang, & Wang, 2019).
The lack of a feature extraction step is not the only benefit of using deep learning methods in machinery prognostics; they perform well compared to other data-driven methods and require no expert knowledge of the system. For example, recurrent neural networks (RNNs) are well suited for extracting features from time-series signals, such as sensor signals, and have been shown to achieve state-of-the-art performance in this task. In machinery prognostics, RNNs have been used to directly model and forecast the degradation of equipment (Y. Zhang, Xiong, He, & Pecht, 2018;Zhou, Huang, Pang, & Wang, 2019), or extract features directly from time-series inputs and output a RUL estimate (Wu et al., 2019;Huang, Huang, & Li, 2019;Listou Ellefsen, Bjørlykhaug, AEsøy, Ushakov, & Zhang, 2019). While RNNs are well suited for time-series problems other deep neural networks have also been utilized for estimating the RUL.
The convolutional neural network (CNN) is a network which is normally used in image recognition and does not need to do computations sequentially like the RNN. Due to the CNNs high performance in image analysis and recognition tasks it is common to frame the machinery prognostics task as an im-age recognition task. This is usually done by applying a timefrequency transform on the sensor signals and creating a spectrogram from the output coefficients and using that as the image with a corresponding RUL as a target value. For example, (Zhu, Chen, & Peng, 2019) used the wavelet transform on the input sensor data to transform it into this time-frequency format and then used a CNN to estimate the RUL. Other transforms have also been used such as, short-time Fourier transforms (STFT) (Li, Zhang, & Ding, 2019), Empirical Mode Decomposition (EMD) (Yang, Yao, Ye, & Xu, 2020), regular Fourier transform (Wang, Zhao, Ma, Chang, & Mao, 2019) and simply time-windowed raw data (Li, Ding, & Sun, 2018). All these methods applied the CNN to the transformed data to output a RUL.
Deep learning methods can estimate the RUL to a high accuracy, however, the methods mentioned so far did not focus on interpretability of the results. These methods can be a black box where the inputs are fed into the network and an output is given that can be accurate assuming the correct data was used as an input. Most deep learning methods in machinery prognostics only provide point estimates without incorporating uncertainty into the RUL estimate (Peng, Ye, & Chen, 2019). The uncertainty quantification is important in machinery prognostics for decision making due to the influence different sources of uncertainty have on the RUL prediction (Sankararaman, 2015). Some techniques have however been used to incorporate uncertainty into RUL estimates using machine learning or deep learning methods.
When using a measurable variable which somehow indicates the component's health, a Bayesian filter, such as the Kalman or Particle filter, can be directly applied to update the variable based on newly observed measurements. Hence, a machine learning model may be used to model the health indicator and extrapolate to some threshold value indicating the end of its useful life while the Bayesian filter updates the probability distribution given new measurements. For example, machine learning models such as the Relevance Vector Machine (RVM) can be used to mathematically model the battery capacity degradation while using the particle filter to update the RUL estimates based on new measurements (Saha, Goebel, Poll, & Christophersen, 2009;Hu & Luo, 2013). (Guo, Li, Jia, Lei, & Lin, 2017) did not have a measurable health indictor but constructed one with a known threshold using bearing vibration data as an input to a RNN whose output was a one-dimensional health indicator variable. This health indicator had a known threshold and a simple exponential model was fit to the health indicator and used to extrapolate to the threshold value. The exponential model parameters could be described using normal distributions with more ease than a deep neural network, which has considerably more parameters. RNNs have also been altered to model physics-based differential equations which describe health indictors such as crack length and modelling grease degradation (Yucesan & Viana, 2019;Dourado & Viana, 2019;Nascimento & Viana, 2019). These can similarly include uncertainty in their parameters due to the relative simplicity of the final degradation model. Deep neural networks can incorporate uncertainty into their structure and these are known as Bayesian neural networks.  went through how to alter various state-of-the-art deep neural networks into Bayesian neural networks. However, most popular Bayesian networks rely on either increasing the amount of parameters in the network (Blundell, Cornebise, Kavukcuoglu, & Wierstra, 2015) or by utilizing Monte Carlo simulations when evaluating the network (Gal & Ghahramani, 2016) making them slower than a standard deep neural network. Most of these methods that account for uncertainty either, • Use Monte Carlo simulation thereby increasing the computational cost • Require a physics-based mathematical description of the degradation mechanics • Require the output to be a measurable or known physicsbased health indicator Hence, it would be useful to investigate avenues that allow for deep learning methods to account for uncertainty without these drawbacks.
Recently a type of neural network architecture known as Neural Ordinary Differential Equations (NODEs) has been introduced (Chen, Rubanova, Bettencourt, & Duvenaud, 2018). NODEs essentially describes the input to output variable transformation by a trajectory through a vector field i.e. the networks hidden variable dynamics. The vector field is defined by a neural network and the trajectory is solved using numerical ODE solver schemes such as Euler's method or Runge-Kutta methods. The investigation of these vector fields which define the hidden variable dynamics, could be a research avenue that has the potential to improve interpretability of the network or quantify uncertainty (e.g. stochastic dynamics are used instead). Existing knowledge on differential equations could open interesting research avenues to alter the NODEs architecture to give meaningful RUL estimates and reduce the black-box nature of the network. The aim of this paper is to show the applicability of NODEs in machinery prognostics by applying them to the popular C-MAPSS turbofan engine dataset and comparing their performance to other methods. This is largely to show how NODEs hold up to state of the art machinery prognostics methods. Hence, in this study NODEs will function like most black-box neural networks without a focus on its dynamics to keep the experiment simple. The idea is to first investigate if NODEs can perform well on a RUL estimation problem given degradation data. If it performs well this may increase incentive and discussion of how to leverage their properties to benefit machinery prognostics tasks. Section 5 provides a discussion on the useful properties NODEs have and how they might be leveraged for machinery prognostics given further research. Another network known as the Residual Network (ResNet) (He, Zhang, Ren, & Sun, 2016), a type of CNN popular in image recognition tasks, will also be included in the experiments. This is mainly due to the fact NODEs were introduced as a continuous version of the ResNet so they are expected to perform similarly. The NODE-based network will be a CNN so it is easily comparable with a ResNet and the input sensor data will be transformed to spectrograms using a short-time Fourier transform (STFT). Discussion of the results, including possible research directions for utilizing or extending NODEs for machinery prognostics applications, are given after the results of the experiments.

Short-Time Fourier Transform
The short-time Fourier transform (STFT) is a way of analyzing a non-stationary signal by approximating discrete parts of the signal as stationary. This is done through a sliding window that acts over the signal and applies the Fourier transform on the windowed signal to extract frequency information at that time. Mathematically this can be stated as Eq. (1), Where f (t) is the signal, w(t ⌧ ) is the window function, X(!, t) is the resulting time-frequency signal.
The STFT was used for simplicity and does not require the selection of a basis function as is the case with the Wavelet transform. Different basis function selections can influence the results significantly; this influence was avoided by using the STFT instead of the Wavelet transform .

Convolutional Neural Networks
Convolutional Neural Networks (CNNs) (Yann Lecun, 1989) are a feedforward neural network which use filters to extract features and relate data points which are spatially close on the grid. The filter is made up of learnable parameters called weights which are optimized during the training of the network. The main component of the CNN is the convolutional layer; this layer performs the convolution operation between the filter weights and the layer input data.
In this case, the inputs are a grid of intensity values found from the time-frequency transform. Hence, the example of the convolution layer will involve a 2D grid of input values. The filter in the 2D case is a matrix of weight parameters that act on the input values through the convolution operator. In this 2D case the convolution operation will generate output values that are elements of an output matrix whose size is dependant on hyperparameters such as filter size and stride (amount of spaces the filter "slides" over). The convolution operator in this case will involve the filter "sliding" over the input values in the grid and multiplying the filter values with the inputs element-wise and summing these to a single value (Dumoulin & Visin, 2016). The convolution layer operations are showcased in Figure 1. The parameters in the filter are the trainable parameters which are being optimized to produce the desired output given the inputs. The CNN uses multiple filters and therefore give multiple outputs which are often flattened to a vector and becomes the input to standard feedforward neural network which produces an output of the desired size. Notice that the CNN essentially trains different filters which all give weighted averages of the local area of the input data. In this work, the local area represents the time and frequency coordinates while the values themselves represent the intensity of the frequencies at that time which indicates the energy of the signal.

ResNet
The Residual Network (ResNet) (He et al., 2016), is a convolutional neural network which learns the residual or difference in the hidden variables. The equation for the ResNet is given by Eq. (2), Where x i is the variable at the ResNet network layer indexed by i = 1, ..., n, f i is a neural network used by the ResNet at layer i and ✓ i is the network parameters at layer i.
Equation 2 describes a ResNet block which can be illustrated using the diagram shown in Figure 2.
The addition of the inputs x i to the neural network output f (x i , ✓) results in the ResNet block having a "skip connection". Therefore, the inputs (x i ) as well as the network outputs (f i (x i , ✓ i )) influence the final output of the ResNet block (x i+1 ).
Notice that the network f i describes the difference or residual Where x i is the input data and ✓ are the network parameters. Note that the rectangular block represents the neural network L represents element-wise addition and x i+1 are the outputs.
between the two hidden variables i.e. x i+1 x i , hence the name ResNet. This formulation was first applied to CNNs (f i was a CNN) and resulted in improved performance on image recognition tasks (He et al., 2016).

NODE
Neural Ordinary Differential Equations (NODEs) (Chen et al., 2018) were first introduced as a continuous version of the ResNet. The equation describing NODEs is given by Eq. (3), Where, x t is the variable at the layer indexed by t = 1, ..., n, f is the neural network describing the ODE i.e. dxt dt and ✓ represents the network parameters. Figure 3 shows a simple illustrated example of how NODEs take input variables and transform them to an output variable. Figure 3. A visual example of how NODE propagates the input variable x 0 through the network, which is described by an ODE. Each solid arrow is an evaluation of the network dxt dt = f (x t , t) which is used to find the next step or next network hidden layer variable, where t is the depth of the network. Since, there are 5 arrows shown this is equivalent to a network with 5 hidden layers. The dotted path labelled x(t) is the trajectory the variables take through the NODE and x N is the final output of the NODE.
Here, the network f is constant throughout each layer or evaluation while the ResNet had a different network block for each layer f i . This is the major difference between NODEs and other neural networks, NODE uses a neural network to describe a larger "network". The inputs are transformed to the network output variables using a numerical differential equation solver and the network f acts as the function describing the first order ODE dx dt . In this case the input variables are propagated through the vector field described by the neural network f . By choosing the number of steps (or step-size) in the ODE solver the amount of "layers" or functional evaluations are controlled. Since the parameters of the neural network ✓ do not change as the variables are propagated through the vector field; the network will be denoted as f (x t , t) instead of f (x t , t, ✓) for simplicity.

DATA
To compare the ResNet and NODEs they will both be trained on a prognostics dataset to estimate the RUL of machinery. This section will explain the contents of the dataset, additionally it covers how the data was prepared to become the inputs of the neural networks, as well as how the RUL targets were prepared for the supervised learning task.

C-MAPSS dataset
The data used here is NASA's Commercial modular aeropropulsion system simulation (C-MAPSS) dataset (Saxena & Goebel, n.d.). This dataset was produced through simulations of sensor readings for aircraft gas turbine engines. The RUL values and therefore the time of failure is known for each simulated engine and was found when the simulated variables reached a certain threshold determined by a failure criteria. There are four separate datasets which vary in difficulty with respect to RUL estimation, a summary of each dataset is given in Table 1. The possible failure modes that could occur in this dataset set are either the high-pressure compressor (HPC) or fan degrading below a certain threshold. Each dataset includes training and testing trajectories (time-series data) which contain the sensor values and operating conditions of a number of different units/engines. The training set is simulated until failure, hence, the final time point is the time of failure (i.e. RUL = 0 at the final time). The testing dataset contains trajectories until a random time point and the RUL at that time point is supplied for each trajectory. The trained network can be evaluated on this testing set and the estimated RUL can be compared with the true RUL value given in the testing dataset.
Each dataset consists of multiple time series for both the training and testing datasets. There are 26 columns of data for each unit. The contents of each column are specified in Table 2.
Using the CMAPSS dataset the two CNNs, both the ResNet and NODEs (now referred to as NODE-CNN) architectures will be trained using the training dataset and have their per-  (1-21) formance compared on the testing dataset. Note that, NODE-CNN will have a CNN architecture simply for ease of comparison to the ResNet. Specifics on evaluating the performance of the network on the test data will be mentioned in section 5 (Results and Discussion).

Preparing Data
For the prognostics algorithm used in this study, the data must be prepared to be suitable as an input of the CNNs used (both the ResNet and NODE-CNN). Here suitability means that the CNN can easily extract the relevant features from the data and correlate the features to the RUL. For example, the timeseries signals can be converted to a 2D time-frequency spectral plot using time-frequency methods such as wavelet transforms or short-time Fourier transforms Li et al., 2019). This would convert the time-series data into an image format from which the CNN can extract features. For this experiment a similar process to  was used which involved the following general steps,  Figure 4. A single time-series signal which is transformed to the time-frequency domain using the STFT. The frequency intensities from the Fourier coefficients is shown in the image that results from the time-frequency transform as they do not capture useful degradation information of the machine for estimating the RUL. To drop the relevant sensor signals from the dataset a small threshold standard deviation was chosen (e.g. 1 ⇥ 10 8 ). Any sensor signals whose standard deviation was below this threshold were not used. Some sensor signals were constant throughout but had large outliers; these standard deviations passed the threshold test. It was found these signals did not help performance and therefore another criteria was added to ensure these sensors were also dropped. This criteria simply checked how many unique values were in the sensor signal, if there were less unique values than a threshold value (less than 3 unique values in this case) it was dropped. This procedure identified a group of sensor signals which were better suited for the machinery prognostics task. The sensors which were dropped are listed in Table 3. It is common in prognostics to normalize the time series data so all sensor signals are similar in magnitude. This is often done by taking the means and standard deviations of the sensor signals for each unit and applying Eq. (4), Where,x Normalization is often carried out this way, by calculating the mean and standard deviation based on the sensor data for a particular unit. However, with multiple operating conditions such as in the datasets FD002 and FD004 (see Table 1), (Pasa, Medeiros, & Yoneyama, 2019) showed it is often more useful to normalize the sensor data based on a particular operating condition. Hence, for these datasets the normalization of the data can done by applying Eq. (5) (Pasa et al., 2019), In this case, µ d and (c) d are the mean and standard deviation of sensor signal d given it is in operating condition c, N c denotes the total number of operating conditions (e.g. 6 for FD002 and FD004), is the element-wise multiplication operator and c = 1 when the data at that time point is in the operating condition c otherwise c = 0. The different operating conditions can be found through different combinations of the three operational setting variables. For FD001 and FD003 the settings have little change during operation and hence, only have one operating condition. FD002 and FD004 have 6 distinct clusters/combinations the operational settings could fall under, hence, they have 6 operating conditions. A clustering algorithm can be used to classify the sensor values at any time into one of these operating conditions. In this case the K-means clustering algorithm was used for simplicity. The comparison between using the operating condition normalization (Eq. (5)) versus the standard signalbased normalization (Eq. (4)), applied on the same unit is shown in Figures 5 and 6. Secondly, a short-time Fourier transform is performed on these signals to achieve a time-frequency representation i.e. a plot of intensities (based on Fourier coefficients) along the timefrequency axes. The transform is applied to the signal starting from the initial time to the current time. These time- Figure 6. Normalization based on operating condition mean and standard deviation applied on unit 1's sensor values in the training dataset frequency output intensities are stored in a matrix whose size based on the time window size chosen. Each of the signals also have varying length and hence to store batches of input data and have consistent output sizes for the CNN, the inputs should have the same size. To achieve this bilinear interpolation will be used to convert the time-frequency signals to a 30 ⇥ 30 size "image-like" format.
The testing and training data preparation differed in one major aspect; the testing signals were converted to spectrograms using the entire signal while the training signals were broken up into smaller signals then converted. The reason for this is that the testing signals have a corresponding RUL value and are not at their end of life, while all the training signals would have a target of RU L = 0 if the entire signal was used. Hence, the training signals were split into multiple signals based on a hyperparameter called split which determined the proportion of the signal to converted into a spectrogram and used as an input. This split hyperparameter calculated the proportion of the signal to be used through Eq. (6), Here, p i is a multiplier that determines the proportion of the signal used. For example if L is the total length of the signal then L i = L ⇥ p i would be the length of the cut signal used to create a spectrogram. Note that split cuts the training signal into a number of smaller signals through the different values of i = 1, ..., split. The denominator of the equation is also split + 1 so that the entire signal is never used as the entire signal has a corresponding RUL value of zero which is not useful.
Finally the target RULs were altered to incorporate a maximum RUL value. The maximum RUL signifies that the net-work cannot estimate the RUL beyond this point as the input data is not in the degradation stage. This was noticed when training Multilayer Perceptron Networks (MLP) on the C-MAPSS dataset as the networks early RUL estimates would stay relatively constant up to a point then start to decline (Heimes, 2008). Hence, here the target RULs all had a maximum value of r max = 130 cycles which also allows for more direct comparison to other prognostics methods which apply the same r max . By setting a maximum RUL value not only is the predictive performance increased, it also signifies the RUL value where the network cannot make a reasonable RUL prediction for higher values. The point at which the RUL starts to decline from r max also signifies the start of degradation as the network now has data which it can use to predict the RUL with more confidence.

METHODOLOGY
This section states the details of the experiments performed on the data using the different network architectures. The general process used to estimate the RUL will be stated as well as the specifics of the hyperparameters used to construct the network. Both the ResNet and NODE-CNNs have similar processes for estimating the RUL, therefore, a general overview is given below which is relevant to both architectures. The general steps involved with transforming the input data into an output RUL estimate are as follows, 1. A separate CNN is used to downsample (reduce the dimensions of) the prepared input data described in the previous section.
2. This step applies either the ResNet or NODE to extract features from the downsampled input data.
3. The final output from the ResNet/NODE is flattened to a vector and the input to a standard feedforward MLP network. The output of this MLP is the RUL estimate of the machinery.
Bayesian optimization tools were used to optimize the hyperparameters. The software package Ax which uses the package BoTorch as a backend (Balandat et al., 2019) was used to tune the hyperparameters using Bayesian optimization. Bayesian optimization uses a curve-fitting method known as Gaussian process regression to optimize the hyperparameters based on some criterion (Frazier, 2018). It takes a vector of hyperparameters as inputs and outputs some criterion or loss value. In the deep learning case the hyperparameters are used to train the neural network; after training the criterion can be calculated and this represents the point on a curve. The goal of Bayesian optimization is to find the vector of input hyperparameters that minimizes the criterion which is the equivalent of finding the minimum of the curve. Each new criterion calculated for a set of input hyperparameters is treated as a data point describing the curve and Bayesian methods are used to update the curve based on this new data.
The criterion used for hyperparameter optimization in this work is the normalized score function (Eq. (7)). The score function was designed to result in higher loss values for RUL estimates that are larger than the actual RUL (Saxena, Goebel, Simon, & Eklund, 2008). This is done to encourage more conservative estimates so that failures do not occur before the prediction from the estimated RUL.
Whereŷ i is the estimated RUL of unit i from the network, y i is the actual RUL of unit i and N is the total amount of units in the dataset. Notice here an average score is used (referred to as normalized score), hence it is the score found in (Saxena et al., 2008) divided by the total amount of units N .
The criterion is evaluated on a validation set, separate from the dataset the network is trained on when tuning the hyperparameters. Here the validation set was acquired from the original training data which is further split into a training dataset and a validation dataset. An 80%/20% (training/validation) split was used on the original training data in this particular case. The new training set is used to optimize the network during the hyperparameter optimization while the validation set acts as a testing dataset that the network did not see during training. The validation loss therefore represents a testing loss that can be used to optimize the hyperparameters. Note that the testing dataset is left alone and untouched by any optimization process to avoid overfitting.
This hyperparameter optimization method was chosen as it is more computationally efficient than other optimization methods such as random search which randomly tests different hyperparameter combinations. Bayesian optimization reduces the amount of time required to find well performing hyperparameters compared to other methods which may find more optimal hyperparameters but require a longer time to search for them. The hyperparameters which are being tuned are, • Learning rate • Batch size • Time window for STFT • split (see Eq. (6)) • Weight decay (L2 regularization on the network parameters) • Training epochs Once the hyperparameters are chosen the networks could be trained on the full training dataset. The network is trained using the Mean Squared Error loss function (Eq. (8)).

ResNet
For the ResNet experiment the time-frequency data was the input to an initial set of downsampling convolutional layers which extracted features and reduced the dimensionality of the data. The output of the downsampling layers became the inputs to the first ResNet block, while the outputs of one ResNet block were the inputs of the next. The output of the final ResNet block was flattened and passed through a feedforward network which would output the RUL estimate. Table  4 shows the basic layout of the ResNet architecture used as well as the size of the output tensors of each layer. Note that the network sizes were simply chosen based on other examples of CNNs used in machinery prognostics tasks and further optimization could be done if necessary.  Figure 7. An illustration of the ResNet architecture layout used in this experiment. Note each rectangular block is a CNN and the L is an addition operator and hence affects the overall performance of the trained network. Other hyperparameters such as the kernel/filter size of the convolutional layers or the number of filters the convolution layers use (convolution channel size), directly affects the network architecture and therefore also affect performance. The ResNet hyperparameters used in the experiments are stated in section A of the appendix.

NODE-CNN
Much like the ResNet described previously, the NODE-CNN layout ( Figure 8) involves a downsampling layer that reduces the dimensions of the input data and extracts features. The main difference between the ResNet and the NODE-CNN network layout is that the ResNet Blocks described in the previous section are replaced with a NODE-CNN block. This block is effectively a CNN whose output (the rate of change of the hidden network variables) is used in a numerical ODE solver and propagates the hidden state forward one layer. Finally, like the ResNet a feedforward network is used to transform the output of the previous CNN network to a single value which is the RUL estimate. The NODE-CNN layout and data output sizes of each layer are shown in Table 5. Similarly to the ResNet sizes the hidden layer sizes here are not optimized but simply chosen based on similar CNNs used in prognostics tasks.
The NODE-CNN is similar to the ResNet, however, the ResNet blocks can be seen as using Euler's method to solve for each blocks output. For this NODE-CNN the Runge-Kutta method is used instead; this improves how the trajectory of the differential equation is propagated through the network layer space. The reason for using Runge-Kutta instead of an adaptive solver is due to the more reliable performance of using a fixed number of steps (the solver will not stop half-way through due to stiffness of the ODE). It is also used because of the size of the problem and the networks involved which caused large computation times when using adaptive solvers. Hence, the Runge-Kutta (RK) solver was used to train the network reliably and in a reasonable amount of time.  Figure 8. An illustration of the NODE-CNN architecture used for this experiment. Note that f (x, t) is the CNN that describes the gradient at any point (x, t) and is used in the numerical ODE solver to propagate the trajectory forward. In this case t is not time but the depth of the NODE network. ResNet, hyperparameters were chosen through the Bayesian optimization procedure. The NODE-CNN hyperparameters used in the experiments are stated in section B of the appendix.

RESULTS AND DISCUSSION
To compare the performance of the models the RMSE between the RUL estimates for the testing dataset and the actual RUL values are used as a performance indicator. The RMSE can be calculated as shown in Eq. (9), The RMSE is commonly used in many similar works when dealing with the C-MAPSS dataset allowing for direct comparison between different methods. RMSE values are acquired for each of the C-MAPSS testing datasets were calculated using Eq. (9) ( Table 6). The other testing performance criterion used was the score originally introduced in (Saxena et al., 2008). A normalized version of this score is shown in Eq. (7) which was used for hyperparameter tuning, however, other prognostics methods use the non-normalized version (Eq. (10)) and it will therefore be used for comparison here. Table 7 shows the (non-normalized) scores compared with the other prognostics methods. Table 8 shows the times taken to train the networks as well as the times taken to evaluate each testing dataset. Data preparation occurred before training and hence, these times are not fully indicative of real time performance. However, the times taken for the STFT to generate the images are given in the table as well. Different hyperparameter choices (e.g. time window) affected the time values, hence, the times for each network and dataset was given. Also note due to the different amount of spectrograms/images being generated based on the split hyperparameter, the total times can differ significantly; for ease of comparison the time taken to generate one image was stated in the table under the column "STFT times".
Each unit in the testing dataset went through the same data preparation process as the training set which involves normalizing based on the operating condition and applying STFT on the signals. Figure 9 shows the box and whisker plots that were generated using the difference between estimated and target RUL values for each unit in the testing datasets. The figure gives an idea of the spread of errors between estimated RUL and the true RUL values over all the units in each testing dataset. Note that the negative values correspond to an estimate that was less than the true RUL, while positive values indicate a larger RUL estimate. Finally, Figure 10 shows how the RMSE on each testing dataset alters by adjusting number of steps in the numerical ODE solver. In this case the variable associated with "time", describes the depth of the NODE network. The network depth is controlled by choosing an interval (t 2 [0, 1]) and splitting the interval into a even number of steps. Hence, increasing the number of steps increases the resolution, which in standard ODEs would increase the ac-curacy at the expensive of computational time. For this test, the NODE-CNN was retrained with 10 steps using 50 epochs, so there were more step options to test before reaching zero steps when reducing number of steps. Note, the newly trained networks do not have optimal RMSE values; but this experiment is only concerned with the relative differences in RMSE as the number of steps changes. In regards to hardware, the training and testing of the neural networks were performed on a NVIDA GeForce RTX2080 GPU.

Discussion
The ResNet and NODE-CNN showed good performance when compared to the other methods in Tables 6 and 7. In particular, ResNet and NODE-CNN performed especially well on the FD002 and FD004 datasets. This is most likely due to the fact that these datasets have multiple operating conditions and it was shown in (Pasa et al., 2019) that performing operating condition based normalization (Eq. (5)) improves performance in this case. Hence, ResNet and NODE-CNN outperformed the other methods on the FD002 and FD004 datasets which did not use operating condition based normalization (only the methods from (Pasa et al., 2019) used this normalization technique). When comparing the performance on datasets FD002 and FD004 to (Pasa et al., 2019), it can be seen that NODE-CNN outperformed their regular (non-ResNet) CNN. The improved performance could be due to the fact that ResNet and NODE outperform regular CNNs in image recognition tasks (He et al., 2016;Chen et al., 2018), hence, the architecture of NODE-CNN may simply be an improvement to the simpler CNN architecture used in (Pasa et al., 2019). For the datasets with only one operating condition (FD001 and FD003) also had relatively good performance compared to state-of-the-art techniques. However, the best results for these datasets was achieved by (Listou Ellefsen et al., 2019) who used a Restricted Boltzmann Machine to extract degradation features and input them into an LSTM, which estimated the RUL. It should also be noted that (Listou Ellefsen et al., 2019) did not employ the same operating condition based normalization as was done in this study. Hence, it may still retain state-of-the-art performance for the FD002 and FD004 datasets if the sensor signals were normalized this way. The multi-scale CNN  also outperformed the NODE-CNN and ResNet in terms of the RMSE measure for FD001 and FD003, but NODE-CNN and ResNet had better scores. This could be due to the fact both ResNet and NODE-CNN had their hyperparameters tuned to minimize the normalized score criterion and hence would produce more conservative RUL estimates; but not necessarily better RMSEs. However, a more detailed analysis would be required to state this as fact. While not outperforming all these techniques completely; the aim of showing NODEs and ResNets could compete with state-of-the-art techniques has been achieved.
Using the intuition of an ODE modeling a dynamic system, it may be expected that changing the time step-size used in the numerical ODE solver would affect the accuracy of the NODE-CNN. The altering of this step-size would normally provide a trade-off between accuracy and computation time. (Queiruga, Erichson, Taylor, & Mahoney, 2020) mentioned that the solution accuracy of NODEs does not change with step-size in the same manner as a standard ODE would. For their tests they used NODEs to model the dynamics of a mechanical system; while here NODE-CNN was used to model some unknown and more abstract dynamics of the hidden variables in a neural network. Hence, their accuracy refers to the error between the "true" trajectory and the NODEs trajectory at each time-point; while the error discussed here is the error between the final estimates the network produced and some known value. Their results showed that NODEs using 4th order Runge-Kutta did achieve some increased accuracy with increased resolution (decreasing time step) but not significant compared to what is expected by a numerical integrator ODE solver. Figure 10 shows the number of steps did not affect the testing loss on the final outputs except in extreme cases (number of steps altered by a factor of 10 from the original training value) before the error increased. This could be due to the fact the number of steps corresponds to the depth of the NODE-CNN and not necessarily an increase in resolution (and therefore accuracy). This suggests one must be careful when designing the network and specifying the problem if a relationship between step-size and solution accuracy is to be achieved. Here the network was used as a standard blackbox, feedforward network and so it did not benefit from the accuracy-time trade-off. (Queiruga et al., 2020) also mention how NODEs can be altered to achieve this property; section 5.2 discusses this further.   Figure 9. Box and whisker plot for the ResNet (9a) and NODE-CNN (9b) RUL differences (ŷ = estimated RUL and y = true RUL) over all the units for each testing dataset

Future Work
The continuous nature of NODE allows for adaptive stepsizes or increasing/decreasing the resolution of step sizes. This flexibility could provide certain benefits when it comes to the network performance. Although as mentioned in the discussion above, one must be careful with how the problem is framed if this property is to be achieved. (Queiruga et al., 2020) essentially achieve this by making the network parameters a function of time and refining step-sizes as training time increases. A network could be trained with higher resolution (low step-size over a specific interval) and then the step-size could be reduced or customized depending on the computing power available or required accuracy of the task. Simi- Figure 10. Graphs showing testing performance as measured by the RMSE metric verses the number of steps used for the numerical integration method (4th order Runge-Kutta). Here the network was retrained using 10 steps in the ODE solver otherwise the same hyperparameters mentioned in Appendix B were used, but 50 epochs are used for faster training as the relative performance is of interest here. During testing the number of steps was altered and the RMSE was recalculated for the testing dataset as shown on the plot. The vertical dotted line indicates the number of steps the network was trained on. lar topics on how to train NODEs with higher resolution and then lowering it while retaining an acceptable accuracy can be found in (Queiruga et al., 2020). Another benefit could come from the differential equation formulation of the network. This allows for interesting avenues to explore by exploiting the properties of differential equations. For example, NODEs can be extended to neural stochastic differential equations to incorporate uncertainty into the network outputs (Li, Wong, Chen, & Duvenaud, 2020). Uncertainty quantification of RUL estimates is considered a challenge in machinery prognostics (Lei et al., 2018), hence, this differential equation approach to neural networks could provide for interesting solutions to this problem.

CONCLUSIONS
It has been shown that the ResNet and CNN-NODE can perform competitively to other state-of-the-art machinery prognostics methods. This could lead to NODE being utilized and adapted for machinery prognostics tasks due to the potential benefits of treating neural networks as differential equations.
The main benefits discussed were adaptive network sizes and quantifying uncertainty in estimates. NODEs describe a vector field which in this case propagates the hidden network variables; starting from the inputs and letting the variables at the end of the trajectory become the outputs. The amount of steps to get from the initial layer to the final layer however can be chosen by the user. Hence, there is a possibility to train a network and then depending on the computing power available picking the appropriate step-size for the ODE solver at the cost of the accuracy of the solution. Ideas like this have been explored in (Queiruga et al., 2020) but have yet to be applied in a machinery prognostics setting. The other benefit of using NODE or extending NODE to other differential equation types is that differential equations are well understood compared to deep neural networks. With knowledge of differential equations NODE could be extended to solve stochastic differential equations (Li et al., 2020) and include confidence intervals for the RUL estimates. By including confidence intervals for RUL estimates this would capture uncertainty and would allow for improved decision making capability in machinery prognostics tasks.

A. ResNet Hyperparameters
This appendix contains the hyperparameters used for training the ResNet on different datasets. Table 9 states the values of these hyperparameters.

B. NODE-CNN Hyperparameters
This appendix contains the hyperparameters used for training the NODE-CNN on different datasets. Table 10 states the values of these hyperparameters.