Predicting Remaining Useful Life using Time Series Embeddings based on Recurrent Neural Networks

We consider the problem of estimating the remaining useful life (RUL) of a system or a machine from sensor data. Many approaches for RUL estimation based on sensor data make assumptions about how machines degrade. Additionally, sensor data from machines is noisy and often suffers from missing values in many practical settings. We propose Embed-RUL: a novel approach for RUL estimation from sensor data that does not rely on any degradation-trend assumptions, is robust to noise, and handles missing values. Embed-RUL utilizes a sequence-to-sequence model based on Recurrent Neural Networks (RNNs) to generate embeddings for multivariate time series subsequences. The embeddings for normal and degraded machines tend to be different, and are therefore found to be useful for RUL estimation. We show that the embeddings capture the overall pattern in the time series while filtering out the noise, so that the embeddings of two machines with similar operational behavior are close to each other, even when their sensor readings have significant and varying levels of noise content. We perform experiments on publicly available turbofan engine dataset and a proprietary real-world dataset, and demonstrate that Embed-RUL outperforms the previously reported state-of-the-art on several metrics.


INTRODUCTION
It is quite common in the current era of the 'Industrial Internet of ings' [9] for a large number of sensors to be installed for monitoring the operational behavior of machines. Consequently, there is considerable interest in exploiting data from such sensors for health monitoring tasks such as anomaly detection, fault detection, as well as prognostics, i.e., estimating remaining useful life (RUL) of machines in the eld. 1 Copyright © 2017 Tata Consultancy Services Ltd.
We highlight some of the practical challenges in using datadriven approaches for health monitoring and RUL estimation, and propose an approach that can handle these challenges: 1) Health degradation trend: In complex machines with several components, it is di cult to build physics based models for health degradation analysis. Many data-driven approaches assume a degradation trend, e.g., exponential degradation [5,8,34,39,45]. is is particularly useful in cases where there is no explicit measurable parameter of the health of a machine. Such an assumption may not hold in other scenarios, e.g., when a component in a machine is approaching failure, the symptoms in the sensor data may initially be intermi ent and then grow over time in a non-exponential manner.
2) Noisy sensor readings: Sensor readings o en su er from varying levels of environmental noise which entails the use of denoising techniques. e amount of noise may even vary across sensors.
3) Partial unavailability of sensor data: Sensor data may be partially unavailable due to several reasons such as network communication loss and damaged or faulty sensors. 4) Complex temporal dependencies between sensors: Multiple components interact with each other in a complex way leading to complex dependencies between sensor readings. For example, a change in one sensor may lead to a change in another sensor a er a delay of few seconds or even hours. It is desirable to have an approach that can capture the complex operational behavior of machine(s) from sensor readings while accounting for temporal dependencies.
In this paper, we propose Embed-RUL: an approach for RUL estimation using Recurrent Neural Networks (RNNs) to address the above challenges. An RNN is used as an encoder to obtain a xeddimensional representation that serves as an embedding for multisensor time series data. e health of a machine at any point of time can be estimated by comparing an embedding computed using recent sensor history with representative embeddings computed for periods of normal behavior. Our approach for RUL estimation does not rely on degradation trend assumptions, can handle noise and missing values, and can capture complex temporal dependencies among the sensors. e key contributions of this work are: • We show that time series embeddings or representations obtained using an RNN Encoder are useful for RUL estimation (refer Section 5.2). • We show that embeddings are robust and perform well for the RUL estimation task even under noisy conditions, i.e., when sensor readings are noisy (refer Section 5.3). • Our approach compares favorably to previous benchmarks for RUL estimation [24] on the turbofan engine dataset [38] as well as on a real-world pump dataset (refer Section 5.2). e rest of the paper is organized as follows: We provide a review of related work in Section 2. Section 3 motivates our approach and brie y introduces existing RNN-based approaches for machine health monitoring and RUL estimation using sensor data. In Section 4 we explain our proposed approach for RUL estimation, and provide experimental details and observations in Section 5, and conclude in Section 6.

RELATED WORK
Data-driven approaches for RUL estimation: Several approaches for RUL estimation based on sensor data have been proposed. A review of these approaches can be found in [40]. [11,19] propose estimating RUL directly by calculating the similarity between the sensors without deriving any health estimates. Similarly, Support Vector Regression [18], RNNs [14], Deep Convolutional Neural Networks [3] have been proposed to estimate the RUL directly by modeling the relations among the sensors without estimating the health of the machines. However, unlike Embed-RUL, none of these approaches focus on robust RUL estimation, and in particular, on robustness to noise.
Robust RUL Estimation: Wavelet lters have been proposed to handle noise for robust performance degradation assessment in [33]. In [16], ensemble of models is used to ensure that predictions are robust. Our proposed approach handles noise in sensor readings by learning robust representations from sensor data via RNN Encoder-Decoder (RNN-ED) models.
Time series representation learning: Unsupervised representation learning for sequences using RNNs has been proposed for applications in various domains including text, video, speech, and time series (e.g., sensor data). Long Short Term Memory (LSTM) [15] based encoders trained using encoder-decoder framework have been proposed to learn representations of video sequences [42]. Pre-trained LSTM Encoder based on autoencoders are used to initialize networks for classi cation tasks and are shown to achieve improved performance [10] for text applications. Gated Recurrent Units (GRUs) [7] based encoder named Timenet [25] has been recently proposed to obtain embeddings for time series from several domains. e embeddings are shown to be e ective for time series classi cation tasks. Stacked denoising autoencoders have been used to learn hierarchical features from sensor data in [46]. ese features are shown to be useful for anomaly detection. However, to the best of our knowledge, the proposed Embed-RUL is the rst a empt at using RNN-based embeddings of multivariate sensor data for machine health monitoring, and more speci cally, for RUL estimation.
Other Deep learning models for Machine Health Monitoring: Various architectures based on Restricted Boltzmann Machines, RNNs (discussed in Section 3.2) and Convolutional Neural Networks have been proposed for machine health monitoring in di erent contexts. Many of these architectures and applications of deep learning to machine health monitoring have been surveyed in [48]. An end-to-end convolutional selective autoencoder for early detection and monitoring of combustion instabilites in high speed ame video frames was proposed in [2]. A combination of deep learning and survival analysis for asset health management has been proposed in [20] using sequential data by stacking a LSTM layer, a feed forward layer and a survival model layer to arrive at the asset failure probability. Deep belief networks and autoencoders have been used for health monitoring of aerospace and building systems in [36]. However, none of these approaches are proposed for RUL estimation. Predicting milling machine tool wear from sensor data has been proposed using deep LSTM networks in [47]. In [49], a convolutional bidirectional LSTM network along with fully connected layers at the top is shown to predict tool wear. e convolutional layer extracts robust local features while LSTM layer encodes temporal information. ese methods model the problem of degradation estimation in a supervised manner unlike our approach of estimating machine health using embeddings generated using seq2seq models.

BACKGROUND
Many data-driven approaches a empt to estimate the health of a machine from sensor data in terms of a health index (HI) (e.g., [34,45]). e trend of HI over time, referred to as HI curve, is then used to estimate the RUL by comparing it with the trends of failed instances. e HI curve for a test instance is compared with the HI curve of failed (train) instance to estimate the RUL of the test instance, as shown in Figure 1. In general, the HI curve of the test instance is compared with HI curves of several failed instances, and weighted average of the obtained RUL estimates from the failed instances is used to obtain the nal RUL estimate (refer Section 4.3 for details).
In Section 3.1, we introduce a simple approach for HI estimation that maps the current sensor readings to HI. Next, we introduce existing HI estimation techniques that leverage RNNs to capture the temporal pa erns in sensor readings, and provide a motivation for our approach in Section 3.2.

Degradation trend assumption based HI estimation
Consider a HI curve H = [h 1 , h 2 , . . . , h T ], where 0 ≤ h t ≤ 1, t = 1, 2, . . . ,T . When a machine is healthy, h t = 1 and when a machine is near failure or about to fail, h t = 0. e multi-sensor readings x t ∈ R n at time t can be used to obtain an estimate h t for the actual HI value h t . One way of obtaining this mapping is via a linear regression model: h t = f θ (x t ) = θ T x t + θ 0 , where θ ∈ R n and θ 0 ∈ R. e parameters θ and θ 0 are estimated by minimizing where the target HI curve can be assumed to follow an exponential degradation trend (e.g., [45]).
Once the mapping is learned, the sensor readings at a time instant can be used to obtain HI. Such approaches have two shortcomings: i) they rely on an assumption about the degradation trend, ii) they do not take into account the temporal aspect of the sensor data. We show that the target HI curve for learning such a mapping (i.e., learning the parameters θ and θ 0 ) can be obtained using RNN models instead of relying on the exponential assumption (refer Section 5 for details).

RNNs for Machine Health Monitoring
RNNs, especially those based on LSTM units or GRUs have been successfully used to achieve state-of-the-art results on sequence modeling tasks such as machine translation [7] and speech recognition [13]. Recently, deep RNNs have been shown to be useful for  Figure 1: Example of RUL estimation using curve matching.
health monitoring from multi-sensor time series data [12,23,26]. e key idea behind using RNNs for health monitoring is to learn a temporal model of the system by capturing the complex temporal as well as instantaneous dependencies between sensor readings.
Autoencoders have been used to discover interesting structures in the data by means of regularization such as by adding constraints on the number of hidden units of the autoencoder [29], or by adding noise to the input and training the network to reconstruct a denoised version of the input [44].
e key idea behind such autoencoders is that the hidden representation obtained for an input retains the underlying important pa ern(s) in the input and ignores the noise component.
RNN autoencoders have been shown to be useful for RUL estimation [24] in which the RNN-based model learns to capture the behavior of a machine by learning to reconstruct multivariate time series corresponding to normal behavior in an unsupervised manner. Since the network is trained only on time series corresponding to normal behavior, it is expected to reconstruct the normal behavior well and perform poorly while reconstructing the abnormal behavior. is results in small reconstruction error for normal time series and large reconstruction error for abnormal time series. e reconstruction error is then used as a proxy to estimate the health or degree of degradation, and in turn estimate the RUL of the machine. We refer to this reconstruction error based approach for RUL estimation as Recon-RUL.
We propose to learn robust xed-dimensional representations for multi-sensor time series data via sequence-to-sequence [4,43] autoencoders based on RNNs. Here we brie y introduce multilayered RNNs based on GRUs that serve as building blocks of sequence-tosequence autoencoders (refer Section 4 for details).

Multilayered RNN with Dropout.
We use Gated Recurrent Units [7] in the hidden layers of sequence-to-sequence autoencoder. Dropout is used for regularization [32,41] and is applied only to the non-recurrent connections, ensuring information ow across timesteps. For a multilayered RNN with L hidden layers, the hidden state z l t at time t for l t h hidden layer is obtained from z l t −1 and z l −1 t as in Equation 1. e time series goes through the following transformations iteratively for t = 1 through T , where T is length of the time series:

RUL ESTIMATION USING EMBEDDINGS
We consider a scenario where sensor readings over the operational life of one or multiple instances of a machine or a component are available. We denote the set of instances by I. For an instance i ∈ I, we consider a multi-sensor time series ∈ R n is an ndimensional vector corresponding to the readings of the n sensors at time t. For a failed instance i, the length T (i) corresponds to the total operational life (from start to end of life) while for a currently operating instance the length T (i) corresponds to the elapsed operational life till the latest available sensor reading.
Typically, if T (i) is large, we divide the time series into windows (subsequences) of xed length w. We denote a time series window from time t 1 to t 2 for instance i by x (i) (t 1 , t 2 ). A xed-dimensional representation or embedding for each such window is obtained using an RNN Encoder that is trained in an unsupervised manner using RNN-ED. We train RNN-ED using time series subsequences from the entire operational life of machines (including normal as well as faulty operations) 2 . We use the embedding for a window to estimate the health of the instance at the end of that window. e RNN Encoder is likely to retain the important characterstics of machine behavior in the embeddings, and therefore discriminate between embeddings of windows corresponding to degraded behavior from those of normal behavior. We describe how these embeddings are obtained in Section 4.1, and then describe how health index curves and RUL estimates can be obtained using the embeddings in Sections 4.2 and 4.3, respectively. Figure 2 provides an overview of the steps involved in the proposed approach for RUL estimation.

Obtaining Embeddings using RNN Encoder-Decoder
We brie y introduce RNN Encoder-Decoder (RNN-ED) networks based on sequence-to-sequence (seq2seq) learning framework. In general, a seq2seq model consists of a pair of multilayered RNNs trained together: an encoder RNN and a decoder RNN. Figure 3 shows the workings of encoder-decoder pair for a sample time series {x 1 , x 2 , x 3 }. Given an input time series x (i) (t − w + 1, t), the encoder RNN iterates through the points in the time series to compute the nal hidden state z (i) t , given by the concatenation of the hidden state vectors from all the layers in the encoder, s.t.
t,l is the hidden state vector for the l th layer of encoder. e total number of recurrent units in the encoder is given by

RUL Estimator
Set of HI Curves ( ) Input Time Series

RNN Encoder
RUL Estimate (R ' ) Figure 2: An overview of inference using Embed-RUL. e input time series is divided into windows. Each window is passed through an RNN Encoder to obtain its embedding. e embedding z t at time t is compared with the embeddings in set Z norm of normal embeddings to obtain health estimate h t (t = 1, . . . ,T ). e HI curve is then compared with HI curves of failed train instances in set H to obtain the RUL estimate R .

Encoder
Decoder Learned Representation Initialize Decoder Figure 3: RNN Encoder Decoder for toy time series e decoder RNN has the same network structure as the encoder, and uses the hidden state z (i) t as its initial hidden state, and iteratively (for w steps) goes through the transformations in Equation 1 (followed by a linear output layer) to reconstruct the input time series. e overall process can be thought of as a non-linear mapping of the input multivariate time series to a xed-dimensional vector representation (embedding) via an encoder function f enc , followed by another non-linear mapping of the xed-dimensional vector to a multivariate time series via a decoder function f dec : e reconstruction error at any point t in (t − w + 1), . . . , t is t . e overall reconstruction error for the input time series window x (i) (t −w +1, t) is given by e (i) 2 . e RNN-ED is trained to minimize the loss function given by the squared reconstruction error: Typically, along with the nal hidden state, an additional input is given to the decoder RNN at each time step. is input is the output of the decoder RNN at the previous time step, as used in [24]. We, however, do not give any such additional inputs to the decoder along with the nal hidden state of encoder. is ensures that the nal hidden state of encoder retains all the information required to reconstruct the time series back via the decoder RNN.
is approach of learning robust embeddings or representations for time series has been shown to be useful for time series classi cation in [25]. Figure 4 shows a typical example of input and output from RNN-ED, where the smoothed reconstruction suggests that the embeddings capture the necessary pa ern in the input and remove noise.

Handling missing values.
In real-world data, the sensor readings tend to be intermi ently missing. We include masking and delta vectors as additional inputs to the RNN-ED at each time instant, (as in [6]). e masking vector helps to identify the sensors that are missing at time t, and the delta vector indicates the time elapsed till t, from the most recent non-missing values for sensors in the past. We omit superscript (i) for denoting an instance of the machine from the notation of masking and delta vectors de ned below for simplicity.
• Masking vector (m t ) denotes the missing sensors at time t and m t ∈ {0, 1} n , where n is the number of sensors. e j th element of vector m t is given by: where j = 1, . . . , n, and x j t denotes the j th element of vector x t . When m j t = 0, we set x j t to 0 or to the average value for j th sensor (we use 0 for the experiments in Section 5).
• Delta vector (δ t ) indicates the time elapsed till t, from the most recent non-missing values for the sensors in the past and δ t ∈ R n . e j t h element of vector δ t is given by: where j = 1, . . . , n and t ∈ R is the time elapsed from start when t t h reading is available and 1 = 0. It is to be noted that the sensor readings may not be available at regular time intervals. erefore, the sequence of readings is indexed by time t = 1, 2, . . . ,T , while the actual timestamps are denoted by 1 , 2 , . . . , T . e masking and delta vectors are given as additional inputs to the RNN-ED but are not reconstructed, s.t. only the actual sensors are reconstructed. erefore, the modi ed input time serieŝ , while the corresponding target to be reconstructed is x (i) t . e loss function (L) of the RNN-ED is also modi ed accordingly, so that the model is not penalized for reconstructing the missing sensors incorrectly. e contribution of a time series subsequence x (i) (t − w + 1, t) to the loss function is thus given by e (i) In e ect, the network focuses on reconstructing the available sensor readings only.

Obtaining HI Curves using Embeddings
Here we describe how the embeddings of time series subsequences are utilized to estimate the health of machines. Since the RNN Encoder captures the important pa erns in the input time series subsequences, the embeddings thus obtained can be used to differentiate between normal and degraded regions in the data. We maintain a set of embeddings, Z norm , corresponding to the time series subsequences from the normal behavior of all the instances in I. As a machine operates, its health degrades over time and the corresponding subsequence embeddings tend to be di erent from those in Z norm . So, we estimate the HI for a subsequence as follows: h e HI curve for an instance i obtained from the HI estimates at each time is denoted by h Like the set of normal embeddings Z norm , we also maintain a set H containing the HI curves of all instances in I.
It is to be noted that HI values are usually assumed to have value between 0 and 1, where 0 means very poor health and 1 means perfect normal health (as shown in Figure 1). e HI as de ned in Equation 5 follows inverse de nition, i.e. it is low for normal health and high for poor health (as shown in Figure 9). is can be easily transformed to adhere to the standard range of 0-1 through suitable normalization/scaling procedure if required, as used in [24].

RUL Estimation using HI Curves
We use the same approach for estimating RUL from the HI curve as in [24]. We present it here for the sake of completeness. To estimate the RUL for a test instance i * , its HI curve h (i * ) is compared with the HI curves in H . e initial health of a train instance and a test instance need not be same. We therefore allow for a time-lag t D in comparing the HI curve of test instance and train instance. e similarity between the HI curves of the test instance i * and a train instance i ∈ I for a time-lag t D is given by: Here, τ is maximum allowed time-lag, and λ controls the notion of similarity: a small value of λ would imply a large value for s even when the di erence in HI curves is small. e RUL estimate for i * based on the HI curve for i and time-lag t D is given by A weighted average of the RUL estimates obtained using all combinations of i and t D is used as the nal estimate R (i * ) , and is given by: where the summation is over only those combinations of i and t D which satisfy

EXPERIMENTAL EVALUATION
We evaluate our proposed approach for RUL estimation on two datasets: i) a publicly available C-MAPSS Turbofan Engine dataset [38], ii) a proprietary real-world pump dataset. We use Tensorow [1] library for implementing the various RNN models. We present the details of datasets in Section 5.1. In Section 5.2, we show that the results for embedding distance based approaches for RUL estimation compare favorably to the previously reported results using reconstruction error based approaches [24] on the engine dataset , as well as on the real-world pump dataset. Further, we evaluate the robustness of the embedding distances and reconstruction error based approaches by measuring the e ect of additive random Gaussian noise in the sensor readings on RUL estimation in Section 5.3.

Engine dataset.
We use the rst dataset from the four simulated turbofan engine datasets from the NASA Ames Prognostics Data Repository [38]. is dataset contains time series of readings for 24 sensors for 100 train instances (train FD001.txt) of turbofan engine from the beginning of usage till end of life. ere are 100 test instances for which the time series are pruned some time prior to failure, s.t. the instances are currently operational and their RUL needs to be estimated (test FD001.txt). e actual RUL for the test instances are available in RUL FD001.txt. Noticeably, each engine instance has a di erent initial degree of wear such that the initial HI of each instance is likely to be di erent (implying potential usefulness of τ as introduced in Section 4.3).
We randomly select 80 train instances to train the models. Remaining 20 instances from the train set are used as validation set to select the parameters. e trajectories for these 20 engines are randomly truncated at ve di erent locations to obtain ve di erent instances from each instance for the RUL estimation task. We use Principal Components Analysis (PCA) [17] to reduce the dimensionality of data and select the number of principal components (p) to be used based on the validation set.

Pump dataset. is dataset contains hourly sensor readings for 38 pumps that have reached end of life and 24 pumps that are currently operational.
is dataset contains readings over a period of 2.5 years with each pump having 7 sensors installed on it. e 38 failed instances are randomly split into training, validation and test sets with 70%, 15%, and 15% instances in them, respectively. e 24 operational instances are added to training and validation set only for obtaining the RNN-ED model (they are not part of the set H as their actual RUL is not known). e data is notably sparse with over 45% missing values across sensors. Also, for most pumps the sensor readings are not available from the date of installation but only few months (average 3.5 months) a er the installation date. Depending on the time elapsed, the health degradation level when sensor data is rst available for each pump varies signi cantly. e total operational life of the pumps varies from a minimum of 57 days to a maximum of 726 days.
We downsample the time series data from the original one reading per hour to one reading per day. To do this, we use following four statistics for each sensor over a day as derived sensors: minimum, maximum, average, and standard deviation, such that there are 28 (=7 × 4) derived sensors for each day. Further, using the derived sensors also helps take care of missing values which reduce from 45% for hourly sampling rate data to 33% for daily sampling rate data. We use masking and delta vectors as additional inputs in this case to train RNN-ED models as described in Section 4.1.1, s.t. the nal input dimension is 42 (28 for derived sensors, and 7 each for masking and delta vectors). Unlike the engine dataset where RUL is estimated only at the last available reading for each test instance, here we estimate RUL on every third day of operation for each test instance.
A description of the performance metrics used for evaluation (taken from [37]) is provided in Appendix A. e hyper-parameters of our model, to be tuned are: number of principal components (p), number of hidden layers for RNN-ED (L), number of units in a hidden layer l (c l ) (we use same number of units in each hidden layer), dropout rate (d), window length (w), maximum allowed time-lag (τ ), similarity threshold (α), maximum predicted RUL (R max ), and parameter (λ). e window length (w) can be tuned as a hyperparameter but in practice domain knowledge based selection of window length may be e ective.

Embeddings for RUL Estimation
We follow similar evaluation protocol as used in [24]. To the best of our knowledge, the reconstruction error based model, LR-ED 2 , reported the best performance for RUL estimation on the engine dataset in terms of timeliness score (refer Appendix B). We compare variants of embedding distance based approach and reconstruction error based approach. We refer to HI curve obtained using the proposed embedding distance based approach as HI e (refer Section 4.2), and the HI curve obtained using the reconstrcution error based approach in [24] as HI r . Here, we refer the reconstruction error based LSTM-ED, LR-ED 1 and LR-ED 2 models reported in [24], as Recon-RUL, Recon-LR 1 , and Recon-LR 2 , respectively. We compare following models based on RNNs for RUL estimation task: • Embed-RUL Vs Recon-RUL: We compare RUL estimation performance of Embed-RUL that uses HI e curves and Recon-RUL that uses HI r curves. • Linear Regression models: We learn a linear regression model (as described in Section 3.1) using normalized health index curves HI e as target and call it as Embed-LR 1 . Embed-LR 2 is obtained using squared normalized HI e as target for the linear regression model. Similarly, Recon-LR 1 and Recon-LR 2 are obtained based on HI r . • RNN Regression model: RNN-based regression model (RNN-Reg.) is directly used to predict RUL (similar to [14]) 5.2.1 Performance on Engine dataset. We use τ 1 =13, τ 2 =10 as proposed in [39] for this dataset (refer Equations 8-11 in Appendix A). e parameters are obtained using grid search to minimize the timeliness score S (refer Equation 8) on the validation set. e parameters obtained for the best model (Embed-LR 1 ) are p = 2, L = 1, c l = 55, d = 0.2, w = 30, τ = 30, α = 0.95, R max = 120, and λ = 0.005. Table 2 shows the performance in terms of various metrics on this dataset.We observe that each variant of embedding distance based approach perform be er than the corresponding variant of reconstruction error based approach in terms of timeliness score S. Figure 8(a) shows the distribution of errors for Embed-RUL and Recon-RUL models, and Figure 8(b) shows the distribution of errors for the best linear regression models of embedding distance (Embed-LR 1 ) and reconstruction error (Recon-LR 2 ) based approaches. e error ranges for reconstruction error based models are more spread-out (e.g., -70 to +50 for Recon-RUL) than the corresponding embedding distances based models (e.g., -60 to +30 for Embed-RUL), suggesting the robustness of the embedding distances based models.        shows the actual RULs and the RUL estimates from Embed-LR 1 and Recon-LR 2 .

5.2.2
Performance on Pump dataset. e parameters are obtained using grid search to minimize the MSE for RUL estimation on the validation set. e parameters for the best model (Embed-RUL) are L = 1, c l = 390, d = 0.3, w = 30, τ = 70, α = 0.8, R max = 150, and λ = 10. e MSE and MAE performance metrics for the RUL estimation task are given in Table 3. e embedding distance based Embed-RUL model performs signi cantly be er than any of the other approaches. It is ≈ 35% be er than the second best model (Recon-RUL). e linear regression (LR) based approaches perform signi cantly worse than the raw embedding distance or reconstruction error based approaches for HI estimation indicating that the temporal aspect of the sensor readings is very important in this case. Figure 7 shows the actual and estimated RUL values for the pumps with best and worst performance in terms of MSE for the Embed-RUL model.   alitative Analysis of Embeddings. We analyze the embeddings given by RNN Encoder for the Embed-RUL models. e original dimension of embeddings for Embed-RUL for engine and pump datasets are 55 and 390, respectively. We use t-SNE [21] to map the embeddings to 2-D space. Figure 5 shows the 2-D sca er plot for the embeddings at the rst 25% (normal behavior) and last 25% (degraded behavior) points in the life of all test instances. We observe that RNN Encoder tends to give di erent embeddings for windows corresponding to normal and degraded behaviors. e sca er plots indicate that normal windows are close to each other and far from degraded windows, and vice-versa. Note: Since t-SNE does non-linear dimensionality reduction, the actual distances between normal and degraded windows may not be re ected in these plots.

Robustness of Embeddings to Noise
We evaluate the robustness of Embed-RUL and Recon-RUL for RUL estimation by adding Gaussian noise to the sensor readings. e sensor reading x (i * ) t for a test instance i * at any time t is corrupted with additive Gaussian noise to obtain a noisy version x t , σ 2 I ). Table 1 shows the e ect of noise on performance for both engine and pump datasets. For both datasets, the standard deviation of the MSE values over di erent noise levels is much lesser for Embed-RUL compared to Recon-RUL. is suggests that embedding distance based models are more robust to noise compared to reconstruction error based models. Also, for the engine dataset, we observe similar behavior in terms of timeliness score (S): 819±41 for Embed-RUL and 1189±110 for Recon-RUL. Figure 9 depicts a sample scenario showing the health index generated from noisy sensor data. e vertical bar corresponds to 1-sigma deviation in estimate. e reconstruction error and embedding distance increase over time indicating gradual degradation. While reconstruction error based HI varies signi cantly with varying noise levels, embedding distance based HI is fairly robust to noise. is suggests that reconstruction error varies signi cantly with change in noise levels impacting HI estimates while distance between embeddings does not change much leading to robust HI estimates.

DISCUSSION
We have proposed an approach for health monitoring via health index estimation and remaining useful life (RUL) estimation. e proposed approach is capable of dealing with several of the practical challenges in data-driven RUL estimation including noisy sensor readings, missing data, and lack of prior knowledge about degradation trends. e RNN Encoder-Decoder (RNN-ED) is trained in an unsupervised manner to learn xed-dimensional representations or embeddings to capture machine behavior. e health of a machine is then estimated by comparing the recent embedding with the existing set of embeddings corresponding to normal behavior. We found that our approach using RNN-ED based embedding distances is be er compared to the previously known best approach using RNN-ED based reconstruction error on the engine dataset. e proposed approach also gives be er results on the real-world pump dataset. We have also shown that embedding distances based RUL estimates are robust to noise.
For a test instance i * , the error ∆ (i * ) =R (i * ) − R (i * ) between the estimated RUL (R (i * ) ) and actual RUL (R (i * ) ). e timeliness score S used to measure the performance of a model is given by: (exp(γ · |∆ (i * ) |) − 1) where γ = 1/τ 1 if ∆ (i * ) < 0, else γ = 1/τ 2 , N is total test instances. Usually, τ 1 > τ 2 such that late predictions are penalized more compared to early predictions. e lower the value of S, the be er is the performance.

B BENCHMARKS ON TURBOFAN ENGINE DATASET
We provide a comparison of some approaches for RUL estimation on the engine dataset (test FD001.txt) below: