Data-driven Prognostics with Predictive Uncertainty Estimation using Ensemble of Deep Ordinal Regression Models

Prognostics or Remaining Useful Life (RUL) Estimation from multi-sensor time series data is useful to enable condition-based maintenance and ensure high operational availability of equipment. We propose a novel deep learning based approach for Prognostics with Uncertainty Quantification that is useful in scenarios where: (i) access to labeled failure data is scarce due to rarity of failures (ii) future operational conditions are unobserved and (iii) inherent noise is present in the sensor readings. All three scenarios mentioned are unavoidable sources of uncertainty in the RUL estimation process often resulting in unreliable RUL estimates. To address (i), we formulate RUL estimation as an Ordinal Regression (OR) problem, and propose LSTM-OR: deep Long Short Term Memory (LSTM) network based approach to learn the OR function. We show that LSTM-OR naturally allows for incorporation of censored operational instances in training along with the failed instances, leading to more robust learning. To address (ii), we propose a simple yet effective approach to quantify predictive uncertainty in the RUL estimation models by training an ensemble of LSTM-OR models. Through empirical evaluation on C-MAPSS turbofan engine benchmark datasets, we demonstrate that LSTM-OR is significantly better than the commonly used deep metric regression based approaches for RUL estimation, especially when failed training instances are scarce. Further, our uncertainty quantification approach yields high quality predictive uncertainty estimates while also leading to improved RUL estimates compared to single best LSTM-OR models.


INTRODUCTION
In the current digital era, streaming data is ubiquitous.In the context of Industrial Internet of Things, remote health monitoring services driven by sensor driven data analytics are becoming increasingly popular.Data-driven approaches for anomaly detection, diagnostics, prognostics and optimization have been proposed to provide operational support to engineers, ensure high reliability and availability of equipment, and to optimize the operational cost (Da Xu et al. (2014)).
Typically, a large number of sensors (order of hundreds or sometimes thousands) are installed to capture the operational behavior of complex equipment with various sub-systems interacting with each other.
In this work, we address two important practical challenges in deep learning based RUL estimation approaches.The challenges addressed and the corresponding key contributions of this work are as follows: Challenge-I: Deep neural networks are prone to overfitting and typically require a large number of labeled training instances to avoid overfitting.If failure time for an instance is known, a target RUL can be obtained at any time before the failure time.However, labeled training instances for RUL estimation are few as failures are rare.Also, any operational instance (or any instance for which failure time is not known, or which has not failed yet) is considered to be censored as target RUL cannot be determined for such an instance.
We note that deep RNNs (Heimes (2008); Malhotra, TV, et al. (2016); Gugulothu et al. (2017); Zheng et al. (2017); Zhang et al. (2018)) and Convolutional Neural Networks (CNNs) (Babu et al. (2016)) based approaches formulate RUL estimation as a metric regression (MR) problem where a normalized estimate of RUL is obtained given time series of sensor 1 arXiv :1903.09795v3 [cs.LG] 2 Apr 2019 data via a non-linear regression metric function learned from the data.This MR formulation of RUL estimation cannot directly leverage censored data typically encountered in RUL estimation scenarios.
Key Contribution-I : In addition to using failed instances for training, we propose a novel approach to leverage the censored instances in a supervised learning setting, in turn, increasing the training data and leading to more robust RUL estimation models.We cast RUL estimation as an ordinal regression (Harrell (2001)) problem (instead of the typically used metric regression formulation) and propose LSTM-OR (Long Short Term Memory Networks based Ordinal Regression) based RUL Estimation approach.We show that partially labeled training instances can be generated from the readily available operational (non-failed) instances to augment the labeled training data in the ordinal regression setting to build more robust RUL estimation models.We empirically show that LSTM-OR outperforms LSTM-MR by effectively leveraging censored data when the number of failed instances available for training is small.

Challenge-II:
The black-box nature of deep neural networks makes it difficult to interpret the predictions/estimates , and in turn, gauge the reliability of the predictions.It is, therefore, desirable to quantify the predictive uncertainty in deep neural network based predictions of RUL -it can aid engineers and operators in risk assessment and decision making while accounting for the reliability of predictions.

Key Contribution-II:
We propose a simple yet effective approach to quantify uncertainty based on an ensemble of LSTM-OR models (using similar idea as in Lakshminarayanan et al. (2017) as detailed in Section 5).Ensemble of deep LSTM-OR models leads to improved RUL estimation performance, and also the empirical standard deviation (ESD) of the predictions from LSTM-OR models provides an approximate measure of uncertainty.We empirically show that when ESD (i.e. the uncertainty in estimation) is low, the corresponding error in estimation is also low; making ESD a useful uncertainty quantification metric.

Organization of the paper:
We provide an overview of related literature in Section 2. In Section 3, we briefly introduce deep LSTM networks as used to build our deep OR models.We provide details of LSTM-OR and uncertainty quantification approaches in Sections 4 and 5, respectively.We provide experimental evaluation details and observations in Section 6, and finally conclude in Section 7.

RELATED WORK
Trajectory Similarity based RUL estimation: An important class of approaches for RUL estimation is based on trajectory similarity, e.g.Wang et al. (2008); Khelif et al. (2014); Lam et al. (2014); Malhotra, TV, et al. (2016); Gugulothu et al. (2017).These approaches compare the health index trajectory or trend of a test instance with the trajectories of failed train instances to estimate RUL using a distance metric such as Euclidean distance.Such approaches work well when trajectories are smooth and monotonic in nature but are likely to fail in scenarios when there is noise or intermittent disturbances (e.g.spikes, operating mode change, etc.) as the distance metric may not be robust to such scenarios (Gugulothu et al. (2017)).
Metric Regression based RUL estimation: Another class of approaches is based on metric regression.Unlike trajectory similarity based methods which rely on comparison of trends, metric regression methods attempt to learn a function to directly map sensor data to RUL, e.g.Heimes (2008); Benkedjouh et al. (2013); Dong et al. (2014); Babu et al. (2016); Gugulothu et al. (2017); Zheng et al. (2017); Vishnu et al. (2018).Such methods can better deal with non-monotonic and noisy scenarios by learning to focus on the relevant underlying trends irrespective of noise.Within metric regression methods, few methods consider non-temporal models such as Support Vector Regression for learning the mapping from values of sensors at a given time instance to RUL, e.g.Benkedjouh et al. (2013); Dong et al. (2014).
Temporal models for RUL estimation: Deep temporal models such as those based on RNNs (Heimes (2008); Malhotra, TV, et al. (2016); Gugulothu et al. (2017); Zheng et al. (2017)) or Convolutional Neural Networks (CNNs) (Babu et al. (2016)) can capture the degradation trends better compared to nontemporal models, and are proven to perform better.Moreover, these models can be trained in an end-to-end learning manner without requiring feature engineering.Despite all these advantages of deep models, they are prone to overfitting in often-encountered practical scenarios where the number of failed instances is small, and most of the data is censored.Our approach based on ordinal regression provisions for dealing with such scenarios, by using censored instances in addition to failed instances to obtain more robust models.
Ordinal Regression for Survival Analysis: Ordinal Regression has been extensively used for applications such as age estimation from facial images (Chang et al. (2011); Yang et al. (2013); Niu et al. (2016); Liu et al. (2017)), however the applications are restricted to non-temporal image data using Convolutional Neural Networks.Cheng et al. (2008); Luck et al. (2017) use feed-forward neural networks based ordinal regression for survival analysis.To the best of our knowledge, the proposed LSTM-OR approach is the first attempt to leverage ordinal regression based training using temporal LSTM networks for RUL estimation.
Deep Survival Analysis: A set of techniques for deep survival analysis have been proposed in the medical domain, e.g.Katzman et al. (2018); Luck et al. (2017).On similar lines, an approach to combine deep learning and survival analysis for asset health management has been proposed in Liao & Ahn (2016).However, it is not clear as to how such approaches can be adapted for RUL estimation applications, as they focus on estimating the survival probability at a given point in time, and cannot provide RUL estimates.Further, Chapfuwa et al. (2018) proposes an approach that leverages adversarial learning for doing time-event modeling in health domain.On the other hand, LSTM-OR is capable of providing RUL estimates using time series sensor data.
Uncertainty quantification in RUL estimation models: Uncertainty analysis in data-driven equipment health monitoring is an active area of research and an unsolved problem.The approaches described in Sankararaman & Goebel (2013), Sankararaman et al. (2013) use analytical algorithms, unlike sampling-based methods, to estimate the uncertainty in prognostics.They consider various sources of uncertainty such as the loading and operating conditions of the system at hand, inaccurate sensor measurements, etc. to quantify their combined effect on RUL predictions.The task is formulated as an uncertainty propagation problem where the various types of uncertainty are propagated through state space models until failure.Also, the future states of the system are estimated using the state space models and are used to arrive at an estimate of RUL.Unlike these approaches, we focus on estimating RUL as well as predictive uncertainty by using an ensemble of deep neural networks to model the time-series of sensor data available till a given point in time, without predicting the future states of the system.Our approach does not rely on any assumptions such as those needed in a state-space model.Further, domain knowledge of the underlying dynamics of a system is not needed to quantify uncertainty, and therefore, our approach is much simpler to adapt.
Uncertainty quantification for deep neural networks: Recently, Gal & Ghahramani (2016) proposed the use of dropout at the inference time to provide Bayesian approximation in the RUL estimation.Further, Lakshminarayanan et al. (2017) proposed the use of an ensemble of neural networks for predictive uncertainty estimation and demonstrated their use in comparison to Bayesian methods.Similarly, we also use an ensemble of LSTM networks to estimate the empirical uncertainty in RUL predictions.

BACKGROUND: DEEP LSTM NETWORKS
We use a variant of LSTMs (Hochreiter & Schmidhuber (1997)) as described in Zaremba et al. (2014) in the hidden layers of the neural network.Hereafter, we denote column vectors by bold small letters and matrices by bold capital letters.For a hidden layer with h LSTM units, the values for the input gate i t , forget gate f t , output gate o t , hidden state z t , and cell state c t at time t are computed using the current input x t , the previous hidden state z t−1 , and the cell state c t−1 , where i t , f t , o t , z t , and c t are real-valued h-dimensional vectors.
Consider W n1,n2 : R n1 → R n2 to be an affine transform of the form z → Wz + b for matrix W and vector b of appropriate dimensions.In the case of a multi-layered LSTM network with L layers and h units in each layer, the hidden state z l t at time t for the l-th hidden layer is obtained from the hidden state at t − 1 for that layer z l t−1 and the hidden state at t for the previous (l − 1)-th hidden layer z l−1 t .The time series goes through the following transformations iteratively at l-th hidden layer for t = 1 through T , where T is length of the time series: where the cell state c l t is given by c l t = f l t c l t−1 + i l t g l t , and the hidden state z l t is given by z l t = o l t tanh(c l t ).We use dropout for regularization (Pham et al. (2014)), which is applied only to the non-recurrent connections, ensuring information flow across time-steps for any LSTM unit.The dropout operator D(•) randomly sets the dimensions of its argument to zero with probability equal to a dropout rate.The sigmoid (σ) and tanh activation functions are applied element-wise.
In a nutshell, this series of transformations for t = 1 . . .T , converts the input time series x = x 1 . . .x T of length T to a fixed-dimensional vector z L T ∈ R h .We, therefore, represent the LSTM network by a function f LST M such that z L T = f LST M (x; W), where W represents all the parameters of the LSTM network.

Terminology
Consider a learning set D = {x i , r i } n i=1 of n failed instances, where r i is the target RUL, the RUL in given unit of measurement, e.g., number of cycles or operational hours.Hereafter, we omit the superscript i in this section for better readability, and provide all the formulation considering an instance (unless stated otherwise).
We consider an upper bound r u on the possible values of RUL as, in practice, it is not possible to predict too far ahead in future.So if r > r u , we clip the value of r to r u .The usually defined goal of RUL estimation via Metric Regression (MR) is to learn a mapping f M R : X → [0, r u ].With these definitions, we next describe LSTM-based Ordinal Regression (LSTM-OR) approach as summarized in Figure 2(a), and then describe how we incorporate censored data into the  LSTM-OR formulation.Instead of mapping an input time series to a real-valued number as in MR, we break the range [0, r u ] of RUL values into K intervals of length c each, where each interval is then considered as a discrete variable.The j-th interval corresponds to ((j − 1) ru c , j ru c ], and r is mapped to the k-th interval with k = r c , where .denotes the ceiling function.We consider K binary classification sub-problems for the K discrete variables (intervals): a classifier C j solves the binary classification problem of determining whether r ≤ j ru c .We train an LSTM network for the K binary classification tasks simultaneously by modeling them together as a multilabel classification problem: We obtain the multi-label target vector y = [y 1 , . . ., y K ] ∈ {0, 1} K from r such that
For example, consider a scenario where K = 5, and r maps to the third interval such that k = 3.The target is then given by y = [0, 0, 1, 1, 1], as illustrated in Figure 3(a).Effectively, the goal of LSTM-OR is to learn a mapping f OR : X → {0, 1} K by minimizing the loss function L OR given by: (3) where, ŷ is the estimate for target y, W represents the parameters of the LSTM network, and W C and b C are the parameters of the layer that maps z L T to the output sigmoid layer.

Using Censored Data for Training
For any censored instance, the data is available only till a time T prior to failure and the failure time F is unknown (illustrated in Figure 3(b)).Therefore, the target RUL r is also unknown.However, at any time t 0 s.t. 1 ≤ t 0 < T , it is known that the RUL r > T − t 0 since the instance is operational at least till T .Considering x = x 1 . . .x t0 as the input time series, we next show how we assign labels to few of the dimensions y j of the target vector y: Assuming T − t 0 maps to the interval k = T −t0 c , since T − t 0 < r, we have Since k is unknown (as r is unknown) and we have k ≤ k, the target vector y can only be partially obtained: For all j ≥ k , the corresponding binary classifier targets are masked, as shown in Figure 3(b), and the outputs from these classifiers are not included in the loss function for the instance.The loss function L ORC given by Equation 3 can thus be modified for including the censored instances in training as: (5) where K = k − 1 for a censored instance and K = K for a failed instance.

Mapping OR estimates to RUL
Once trained, each of the K classifiers provides a probability ŷj for RUL being greater than the upper limit of the interval corresponding to the j-th classifier.We obtain the pointestimate r for r from ŷ for a test instance as follows (similar to Chang et al. (2011)): It is worth noting that once learned, the LSTM-OR model can be used in an online manner for operational instances: at current time instance t, the sensor data from the latest T time instances can be input to the model to obtain the RUL estimate r at t.

PREDICTIVE UNCERTAINTY QUANTIFICATION USING ENSEMBLE OF LSTM-OR MODELS
Uncertainty quantification is very important in case of RUL estimation as equipment and operations involved are often of critical nature, and reliable predictions close to (but of course, prior to) failures can help avoid catastrophic failures by generating suitable alarms beforehand.Lack of sufficient training data, inherent noise in sensor readings, and uncertainty in the future usage and operation of equipment are few sources of uncertainty in case of data-driven predictive models for RUL estimation.Quantifying uncertainty in RUL estimates can assist ground engineers and operators to arrive at more informed decisions compared to scenarios where only RUL estimates are available without any metric indicating whether the model is certain about the estimate or not.In other words, uncertainty quantification of the RUL estimate enhances the reliability of data-driven models.This is even more relevant in deep neural network-based estimation models due to their otherwise black-box nature.
An uncertainty metric can be considered to be reliable if: i) for low uncertainty values, i.e. whenever the model is confident about its estimations, the corresponding errors in the RUL estimations are low, and for high uncertainty values, the corresponding errors in the RUL estimation model should be high, ii) it produces RUL estimates with low uncertainty when a failure is approaching, i.e. the model should be able to precisely estimate the RUL with a high degree of certainty close to failures.
To quantify the predictive uncertainty in the target vector estimate ŷ and the corresponding RUL estimate r, we consider training an ensemble of LSTM-OR models.We consider an ensemble learning approach similar to that introduced in Lakshminarayanan et al. (2017): For training an ensemble of LSTM-OR models, we consider all the training data while using different (random) initializations of the parameters (W, W C , b C ) of LSTM-OR models and random shuffling of the training instances to obtain m different models in an ensemble.The final RUL estimate of the ensemble is given by simple average of the RUL estimates of the m models in the ensemble, and the empirical standard deviation (ESD) in the RUL estimates is used as an approximation of the predictive uncertainty in RUL estimation.More specifically, as shown in Figure 2(b), we train m LSTM-OR models such that we have m RUL estimates ri for any instance, i = 1, . . ., m.We obtain the point estimate r for r from ri for an instance as follows: The uncertainty û in terms of ESD is given by: We normalize the uncertainty values (û ESD ) using the minimum and maximum uncertainty values across all instances in a hold-out validation set through min-max normalization.
We also consider other measures of uncertainty quantification in terms of entropy (similar to Park & Simoff (2015)) as explained in Appendix A.1 but found ESD to be the most robust measure of uncertainty.We support this with experimental evaluation in Section 6.3.

EXPERIMENTAL EVALUATION
We evaluate RUL estimation and uncertainty quantification approaches using the publicly available C-MAPSS aircraft turbofan engine benchmark datasets (Saxena & Goebel (2008)).
We provide an overview of the dataset in Section 6.1.We consider metric regression models and ordinal regression models trained only on failed instances as baseline models, and compare following approaches for RUL estimation: i) MR: LSTM-MR using failed instances only (as in Zheng et al. (2017); Heimes (2008); Gugulothu et al. ( 2017)), ii) OR: LSTM-OR using failed instances only and using loss as in Equation 3, iii) ORC: LSTM-OR leveraging censored data along with failed instances using loss as in Equation 5, iv) ORCE: simple average ensemble of ORC models.We describe RUL estimation approaches in Section 6.2.Further, to evaluate uncertainty quantification approach as described in Section 5, we study the relationship of uncertainty estimates with error and ground truth RUL in Section 6.3 while also introducing novel metrics to evaluate the efficacy of uncertainty estimates in context of prognostics.

Dataset Description
We consider datasets FD001 and FD004 from the simulated turbofan engine datasets1 (Saxena & Goebel (2008) For simulating the scenario for censored instances, a percentage p c ∈ {0, 50, 70, 90} of the training and validation instances are randomly chosen, and time series for each instance is randomly truncated at one point prior to failure.We then consider these truncated instances as censored (currently operational) and their actual RUL values as unknown.The remaining (100−p c )% of the instances are considered as failed.
Further, the time series of each instance thus obtained (censored and failed) is truncated at 20 random points in the life prior to failure, and the exact RUL r for failed instances and the minimum possible RUL T − t 0 for the censored instances (as in Section 4 and Figure 3) at the truncated points are used for obtaining the models.The number of instances thus obtained for training and validation for p c = 0 is given in Table 2.The test set remains the same as the benchmark dataset across all scenarios (with no censored instances).The MR and OR approaches cannot utilize the censored instances as the exact RUL targets are unknown, while ORC can utilize the lower bound on RUL targets to obtain partial labels as per Equation 4.
An engine may operate in different operating conditions and also have different failure modes at the end of its life.The number of operating conditions and failure modes for both the datasets are given in Table 1.FD001 has only one operating condition, so we ignore the corresponding three sensors such that p = 21, whereas FD004 has six operating conditions determined by the three operating condition variables.We map these six operating conditions to a 6-dimensional one hot vector as in Zheng et al. (2017), such that p = 27.

RUL Estimation
In this section, we define performance metrics to evaluate our RUL estimation models i.e ORC and ORCE.Further, we discuss our experimental settings which is followed by results and observations.We also draw a comparison between our proposed RUL estimation models and already existing RUL estimation models.

Performance Metrics for Evaluating RUL Estimation Models
There are several metrics proposed to evaluate the performance of prognostics models (Saxena, Celaya, et al. (2008)).
We measure the performance of our models in terms of Timeliness Score (S) and Root Mean Squared Error (RMSE): For a test instance i, the error in estimation is given by e i = ri − r i .The timeliness score for N test instances is given by S , where γ = 1/τ 1 if e i < 0, else γ = 1/τ 2 .Usually, τ 1 > τ 2 such that late predictions are penalized more compared to early predictions.We use τ 1 = 13 and τ 2 = 10 as proposed in Saxena, Goebel, et al. (2008).The lower the value of S, the better is the performance.The root mean squared error (RMSE) is given by:

Experimental Setup
We consider r u = 130 cycles for all models, as used in Babu et al. (2016); Zheng et al. (2017).For OR and ORC, we consider K = 10 such that interval length c = 13.For training the MR models, a normalized RUL in the range 0 to 1 (where 1 corresponds to a target RUL of 130 or more) is given as the target for each input.We use a maximum time series length of T = 360; for any instance with more than 360 cycles, we take the most recent 360 cycles.Also, we use the standard znormalization to normalize the input time series sensor wise using mean and standard deviation of each sensor from the train set.
The hyperparameters h (number of hidden units per layer), L (number of hidden layers) and the learning rate are chosen from the sets {50, 60, 70, 80, 90, 100}, {2, 3} and {0.001, 0.005} respectively.We use a dropout rate of 0.2 for regularization, and a batch size of 32 during training.The models are trained for a maximum of 2000 iterations with early stopping.The best hyperparameters are obtained using grid search by minimizing the respective loss function on the validation set.
For ORCE, we consider an ensemble of m = 6 models (we consider up to 10 models in the ensemble, and found m = 6 to work best across the scenarios considered).The models are trained on the best hyperparameters selected from the corresponding hyperparameter sets of ORC.While training different models, we ensure random initializations of the parameters of neural network and random shuffling of the training instances.For selecting m = 6 models from available 10 models, we ordered the models in the ascending order of their respective loss values on the validation set and then select the first 6 models.

Results and Observations
As summarized in Table 3, we observe that: As the number of failed training instances (n f ) decreases, the performance for all models degrades (as expected).However, importantly, for scenarios with small n f , ORCE significantly outperforms MR and OR.For example, as shown in Figure 4, with p c = 90% (i.e. with n f = 8 and 20 for FD001 and FD004, respectively), ORCE performs significantly better than MR, and shows 19.5% and 6.8% improvement over MR in terms of RMSE, for FD001 and FD004, respectively.The gains in terms of timeliness score S are higher because of the exponential nature of S (refer Section 6.2.1).It is evident from Figure 4 that ORCE is performing better than ORC and MR in terms of both RMSE and S. The performance gap between ORCE and ORC significantly increases in case of timeliness score (S) for FD004 dataset when p c = 90%, shown in 4(d).Due to fewer number of failed training instances (p c = 90%), some models in the ensemble are not trained properly and result in high errors even for the instances with lower RUL r.This results in very high values of S. In case of ORC, the overall value of S tends to be high since, for ORC, we compute the average of timeliness scores corresponding to m models in an ensemble.This is not the case for ORCE, since the instance-wise RUL estimations are obtained as the average of m estimations from the m models in the ensemble, the performance of ORCE in terms of S is better when compared to ORC.
While MR and OR have access to only a small number failed instances n f for training, ORCE and ORC have access to n f failed instances as well as partial labels from n c censored instances for training.Therefore, MR and OR models tend to overfit while ORC and ORCE models are more robust.
We also provide a comparison with existing deep CNN-based (Babu et al. (2016)) and LSTM-based (Zheng et al. (2017)) MR approaches in Table 4. ORC (same as OR for m =0%) performs comparably to existing MR methods.More importantly, as noted above, ORC and ORCE may be advantageous and more suitable for practical scenarios with few failed training instances.

Uncertainty Quantification
We introduce various metrics used to evaluate the performance of the proposed ensemble-based uncertainty estimation approach.Using these metrics, we demonstrate the efficacy of the proposed approach from a practical point of view.We compare the proposed ESD (Equation 8) and two variants of entropy (as introduced in Appendix A.1) for uncertainty evaluation.Decreasing nf

Performance Metrics for Evaluating Uncertainty Quantification Methods
We expect our model to be certain (have high certainty) when the RUL estimates are correct, and less certain (have low certainty) for highly erroneous RUL estimates.We consider an RUL estimation to be correct if the absolute error |r−r| ≤ τ e , and to be certain if the corresponding uncertainty estimate û ≤ τ u .Also, for evaluating the performance of uncertainty metrics we restrict the target RUL r to a maximum of r u = 130 because we train our models with a maximum target RUL of r u and so r cannot be greater than r u .This is done because even if the model confidently estimates r close to r u , a value of r much greater than r u will lead to high error and cannot result in proper performace evaluation of the uncertainty metrics.Under above considerations, we measure precision and recall to evaluate the performance of uncertainty quantification approach as follows: Precision is the fraction of test instances with uncertainty below a threshold τ u that also have error ≤ τ e .Recall is defined as the fraction of test instances having uncertainty and error below some threshold τ u and τ e , respectively.More specifically: where #(X) denotes the number of instances satisfying the condition X.
Further, it is desirable to have very certain and correct estimates close to failure to avoid fatal consequences upon failure.To evaluate performance from this point-of-view, we analyze the relation of uncertainty with nearness to failure.It is desirable to have low error as well as low uncertainty when r is low.To evaluate this aspect, we study the variation in precision for different RUL thresholds τ r , considering test instances with low ground truth RULs.The modified precision P l in this context is given by: For given thresholds τ r and τ u , P l quantifies the fraction of test instances with actual RUL r ≤ τ r and uncertainty ≤ τ u that also have error ≤ τ e .Comparing ESD vs Entropy (ENT) as uncertainty metric: Precision and Recall (as in Equation 9) are used to compare the two approaches for uncertainty estimation.Precision-Recall curves are obtained by varying the threshold on uncertainty 0.1 ≤ τ u ≤ 1.5 while keeping τ e = 10.We observe that for R ≥ 0.1, P is higher in case of ESD for FD001 dataset, shown in Figure 5(a).Similar behavior is observed in case of FD004 dataset, for R ≥ 0.2, shown in Figure 5(b).We further plot F 1 score (as in Equation 9) by varying the τ u , shown in Figure 5(c) and 5(d), which shows that ESD is a better uncertainty quantification metric compared to ENT. (We also analyze the instances for which ESD has unexpected behavior in terms of low uncertainty while having high error in RUL estimate.The observations are given in Appendix A.2.) Relation between uncertainty and error: For a reliable model, RUL estimates with high certainty must be accurate, i.e. have low RUL estimation errors.To evaluate the performance of uncertainty metric in this context, we consider instances with uncertainty û ≤ τ u , and compute the average error in RUL estimation for these instances.As shown in Figure 6(a), we observe that for low values of τ u , the average error thus computed is also low, indicating that the model is more accurate when it is more certain.Further, as expected, we observe an increase in average error with increasing τ u , suggesting that the RUL estimates tend to be more erroneous when the model is uncertain.
Relation between uncertainty and actual RUL: For quantifying the relationship between RUL and uncertainty, P l is calculated as in Equation 10.P l is computed for varying τ r , ranging from 10 to 130 and, keeping τ u and τ e fixed as 0.2 and 10 respectively.From practical point of view, higher precision (P l ) in case of lower values of τ r is expected to correctly and confidently handle instances that are approaching failure.Similar trend is observed in our case also, as shown in Figure 6(b).For τ r = 20, P l = 0.917 for FD001 dataset and P l = 0.857 for FD004 dataset suggests that the model is certain and accurate 91.7% of the times for FD001 dataset and 85.7% of the times for FD004 dataset.

CONCLUSION AND DISCUSSION
In this work, we have proposed a novel approach for RUL estimation using deep ordinal regression based on multilayered LSTM neural networks.We have argued that ordinal regression formulation is more robust compared to metric regression, as the former allows for incorporation of more labeled data from censored instances.We found that leveraging censored instances significantly improves performance when the number of failed instances is small.In future, it would be interesting to see if a semi-supervised approach (e.g. as in Yoon et al. (2017); Gugulothu, TV, et al. (2018)) with initial unsupervised pre-training of LSTMs using failed as well as censored instances can further improve the robustness of the models.Further, an extension to the proposed approach to address the usually encountered non-stationarity scenario using approaches similar to Saurav et al. (2018) can be considered.It is to be noted that although we have experimented with LSTMs for Ordinal Regression, our OR approach is generic enough to be useful for any neural network, e.g.CNNs.
Further, we have proposed a simple yet effective approach to quantify uncertainty in the RUL estimates by using a simple average ensemble of the deep ordinal regression models.The proposed empirical standard deviation based metric for uncertainty provides accurate predictive uncertainty estimates: we observe low errors in RUL estimation for low uncertainty values.Further, the model is found to be accurate with high certainty when the remaining useful life is very low, i.e. the instance is approaching failure.It will be interesting to see if the ensemble based approach for uncertainty quantification can be extended to metric regression models as well using uncertainty methods for regression as proposed in Lakshminarayanan et al. (2017).
Figure 1.Deep Ordinal Regression versus Deep Metric Regression.

Figure 2 .
Figure 2. Steps in LSTM-OR and Ensemble of LSTM-OR.

Figure 3 .
Figure 3. Target vector creation for failed versus censored instance.

Figure 4 .
Figure 4. %age gain of ORC and ORCE over MR with decreasing number of failed instances (n f ) in training.

Figure 5 .
Figure 5.Comparison of ESD and ENT as measures of uncertainty in terms of (a)-(b) Precision Recall Curves; and (c)-(d) F1 Scores with varying τ u .ESD is a more robust uncertainty metric compared to ENT.

Figure 6 .
Figure 6.Performance evaluation of ESD as an uncertainty metric showing: (a) lower uncertainty values corresponding to low RUL estimation errors, (b) highly precise and correct uncertainty estimates close to failures, i.e. when RUL is low.
in FD001 and 249 in FD004) of a turbofan engine from the beginning of usage till end of life.The time series for the instances in the test sets (test FD001.txt and test FD004.txt)areprunedsome time prior to failure, such that the instances are operational and their RUL needs to be estimated.The actual RUL values for the test instances are available in RUL FD001.txt and RUL FD004.txt.We randomly sample 20% of the available training set instances, as given in Table1, to create a validation set for hyperparameter selection.
).The training sets (train FD001.txt and train FD004.txt) of the two datasets contain time series of readings for 24 sensors (21 sensors and 3 operating condition variables) of several in-stances (100

Table 1 .
Number of train, validation and test instances.Here, OC: number of operating conditions, FM: number of fault modes.

Table 2 .
Number of truncated instances.

Table 3 .
Comparison of various LSTM-based approaches considered in terms of RMSE and Timeliness Score (S) for FD001 and FD004 datasets.n f and n c denote number of failed and censored instances in training set, respectively.

Table 4 .
Performance comparison of the proposed approach with existing approaches in terms of RMSE and Timeliness Score (S).
For sake of brevity, we restrict the results and observations to the uncensored scenario, i.e. p c = 0%.Similar results and observations for models corresponding to censored scenarios are presented in Appendix A.2.