Recurrent Neural Networks for Online Remaining Useful Life Estimation in Ion Mill Etching System

We describe the approach – submitted as part of the 2018 PHM Data Challenge – for estimating time-to-failure or Re-maining Useful Life (RUL) of Ion Mill Etching Systems in an online fashion using data from multiple sensors. RUL estimation from multi-sensor data can be considered as learning a regression function that maps a multivariate time series to a real-valued number, i.e. the RUL. We use a deep Recurrent Neural Network (RNN) to learn the metric regression function from multivariate time series. We highlight practical aspects of the RUL estimation problem in this data challenge such as i) multiple operating conditions, ii) lack of knowledge of exact onset of failure or degradation, iii) different operational behavior across tools in terms of range of values of parameters, etc. We describe our solution in the context of these challenges. Importantly, multiple modes of failure are possible in an ion mill etching system; therefore, it is desirable to estimate the RUL with respect to each of the failure modes. The data challenge considers three such modes of failures and requires estimating RULs with respect to each one, implying learning three metric regression functions - one corresponding to each failure mode. We propose a simple yet effective extension to existing methods of RUL estimation using RNN based regression to learn a single deep RNN model that can simultaneously estimate RULs corresponding to all three failure modes. Our best model is an ensemble of two such RNN models and achieves a score of 1 . 91 × 10 7 on the ﬁnal validation set.


INTRODUCTION
With the advent of Industrial Internet of Things (IIOT) (Xu et al., 2014), large amounts of temporal sensor data is available in (near) real-time leading to an increasing interest in remote monitoring of equipment.Typically, a large number of sensors are installed across various components and subcomponents of a complex system.This leads manual moni-toring of the system extremely challenging.Data-driven approaches can aid operators to monitor the sensor data and generate suitable alerts along with potential diagnostics in case of malfunctioning system.Building data-driven or machine learning based models for fault detection and prognostics (remaining useful life estimation) from sensor data can help in real-time monitoring of equipment, avoid catastrophic failures, enable condition-based maintenances, as well as help to take key engineering decisions, e.g. to improve future manufacturing processes.
Recently, deep Recurrent Neural Networks (RNNs) based on gated units such as Long Short Term Memory (LSTM) networks (Hochreiter & Schmidhuber, 1997) have been successfully used for modeling sequential data.It has been shown that RNNs can model the temporal (sequential) aspect of the sensor data as well as capture the inter-sensor dependencies Malhotra et al. (2015).RNNs have been used to model behavior of machines based on multi-sensor time series with applications to anomaly and fault detection (Malhotra et al., 2015;Malhotra, Ramakrishnan, et al., 2016;Yadav et al., 2016;Filonov et al., 2016), Remaining Useful Life (RUL) estimation (Malhotra, TV, et al., 2016;Gugulothu et al., 2017;TV et al., 2018), and diagnostics (TV et al., 2017;Gugulothu et al., 2018).
Several approaches for RUL estimation using RNNs have been proposed in the past for various type of equipment, e.g.turbofan engines (Heimes, 2008;Malhotra, TV, et al., 2016;Gugulothu et al., 2017), milling machines (Malhotra, TV, et al., 2016), etc.These approaches can be categorised into two types: supervised and semi-supervised.Supervised approaches model RUL estimation as a metric regression problem where RUL is considered to be a real-valued number and a metric regression function -modeled via a (deep) RNN -is learned to map the time series of sensor data to RUL.Examples of this approach include (Heimes, 2008;Zheng et al., 2017;TV et al., 2018).Semi-supervised approaches first learn a deep RNN based model of normal behavior, which is then used to obtain a health index trend of any instance of a machine.The health index trend of a test instance is com-pared to that of the historical failed train instances to obtain an estimate of RUL via curve matching Wang et al. (2008); Malhotra, TV, et al. (2016); Gugulothu et al. (2017).In this work, we adapt and modify the supervised metric regression approaches using deep RNNs.Specifically, we build a metric regression model using deep LSTM networks (LSTM-MR) in a multi-task setting for RUL estimation.
The rest of the paper is organized as follows: In Section 2 we present some related work and provide details of the data challenge in Section 3. We describe briefly the deep LSTM Networks in Section 4 and give details of our approach and experiments in Section 5 and Section 6 respectively, and finally conclude in Section 7.

RELATED WORK
An important class of approaches for RUL estimation is based on trajectory similarity, e.g.Wang et al. (2008); Khelif et al. (2014); Lam et al. (2014); Malhotra, TV, et al. (2016); Gugulothu et al. (2017).These approaches compare the health index trajectory or trend of a test instance with the trajectories of failed train instances to estimate RUL using a distance metric such as Euclidean distance.Such approaches work well when trajectories are smooth and monotonic in nature but are likely to fail in scenarios when there is noise or intermittent disturbances (e.g.spikes, operating mode change, etc.) as the distance metric may not be robust to such scenarios Gugulothu et al. (2017).
Another class of approaches is based on metric regression.Unlike trajectory similarity based methods which rely on comparison of trends, metric regression methods attempt to learn a function to directly map sensor data to RUL, e.g.Heimes (2008); Benkedjouh et al. (2013); Dong et al. (2014); Babu et al. (2016); Gugulothu et al. (2017); Zheng et al. (2017).Such methods can better deal with non-monotonic and noisy scenarios by learning to focus on the relevant underlying trends irrespective of noise.Within metric regression methods, few methods consider non-temporal models such as Support Vector Regression for learning the mapping from values of sensors at a given time instance to RUL, e.g.Benkedjouh et al. (2013); Dong et al. (2014).Deep temporal models such as those based on RNNs Heimes (2008); Malhotra, TV, et al. (2016); Gugulothu et al. (2017); Zheng et al. (2017) or Convolutional Neural Networks (CNNs) Babu et al. (2016) can capture the degradation trends better compared to non-temporal models, and are proven to perform better.Moreover, these models can be trained in an end-to-end learning manner without requiring feature engineering.

DATA CHALLENGE DESCRIPTION
The data challenge focuses on predicting time-to-failure (for each of three types of fault) at specific times of an ion mill etching tool.

Ion Mill Etching System
An ion mill etching tool is shown in Figure 1(a).The process of ion mill etching typically consists of the following steps: • Inserting a wafer into the mill.
• Configuring wafer settings (rotation speed, angles, beam current / voltages, etc).• Processing the wafer for a set amount of time.
• Repeating the 2 nd or 3 rd step for different steps of recipe.
• Removing wafer from mill.
An ion source generates ions that are accelerated through an electric field using a series of grids set at specific voltages.This creates an ion beam that travels and eventually strikes the wafer surface.Material is removed from the wafer when ions hit the wafer surface.The wafer is placed on a rotating fixture that can be tilted at different angles facing the incoming ion beam.The wafer can be shielded from the ion beam until ready for milling operation to commence using a shutter mechanism as shown in Figure 1(b).A Particle Beam Neutralizer (PBN) control system influences the ion beam shape / ion distribution as it travels to the wafer surface.The wafer is cooled by a helium / water system called flowcool.The cooling system passes helium gas behind the wafer at a specified flow rate.The helium gas is indirectly cooled by a water system.The wafer and fixture o-ring separates the flowcool gas from the ion mill vacuum chamber.

Objectives
The objective is to build a model from time-series sensor data collected from various ion mill etching tools operating under various conditions and settings.The goal is to examine the fault behavior of an ion mill etching tool used in a wafer manufacturing process1 .Many different failure mechanisms such as leaks between flowcool and ion mill chambers, electric grid wear, ion chamber wear, etc. can be present in this system.Predicting the time of these failures can help in condition-based maintenance and schedule downtimes of the ion mills for maintenance operations.The problem consists of diagnosing failures (i.e.detect and identify) and determining time remaining until next failure (i.e.predict remaining useful life).Time-to-failure for the following three different failure modes is of interest: • F 1 : Flowcool Pressure Dropped Below Limit (FCP Low.

Dataset details
The training data corresponds to 20 tools.The testing data corresponds to a subset containing 5 tools out of the 20 tools.The data consists of 24 columns and 3 fault types.The time period for testing data is after the training data such that there is no overlapping time period.The columns S1 -S24 contain sensor data and other process information for tool ids arranged with timestamp.The various parameters are listed in Table 1.There are 5 categorical variables, 14 numeric operating condition related variables and 5 numeric parameters obtained through sensors installed on the system.The timestamp and type of failure are available for the 20 tools in the training dataset.The data has been anonymized so the units of measurement for various parameters are not provided.
It is to be noted that the time of failure is actually the time when the operator shuts down the machine for maintenance rather than the time when the actual fault is observed.The actual start of the failure may occur much earlier than the provide failure time.The train folder contains the training data to be used for modeling purposes.The test folder contains the test data that is to be used to generate submissions.The time where faults occur is found in the train/train faults folder.Number of data points and faults for each tool id are listed in Table 2. Example for time-to-failure examples are provided in the train/train tff folder.There are 'null' (NaN) values where faults do not occur in within a specified time horizon.The 20 .csvfiles under the train folder represent the sensor data that are used as predictors.Each of these files represent a separate ion milling tool.The sensor-wise statistics are provided in Table 3. From Table , we can see that mean and standard deviation of all sensors are very close to 0 and 1, which indicates that provided data is Z-normalized.

Scoring functions
The functions for computing Original Score (S 1 ) and Secondary Score (S 2 ) used during the testing phase (Phase-1) and validation (Phase-2) phases respectively, are provided in Table 4.In this data challenge, lower scores indicate better performance.A secondary score is used in the validation portion of the contest.The secondary score is similar in nature to the original score.However, the penalty is more severe for false positives and false negatives.The final score (S) is the average of the original and secondary scores, and is computed as follows: S = S 1 + S 2 2

Challenges
We first highlight few aspects and challenges while formulating an approach for the 2018 PHM data challenge for RUL estimation of ion mill etching System, and then describe our approach that can potentially deal with these challenges: 1. Dealing with multiple fault types leading to failures with missing data prior to reported time of failure: The failures considered corresponded to three types of faults.Data prior to failures which is critical to estimate RULs (data points close to failure) is sparse.Average, minimum and maximum missing points before failure shutdowns are provided in Table 5.This issue is further magnified when we consider failure types independently, leading to very few failure instances with sensor data available close to failure time.2. The nature of evolution of faults over time is not known, i.e. the nature of machine health degradation trends is not known.It may be possible that one or more faults are instantaneous in nature.The time at which the machine is shutdown due to a particular type of fault is given.However, the time taken to respond after the observation of symptoms can vary and not known.For example, it can be seen from Figure 2(a) that there is a sudden drop  Consider W n1,n2 : R n1 → R n2 to be an affine transform of the form z → Wz + b for matrix W and vector b of appropriate dimensions.In the case of a multi-layered LSTM network with L layers and h units in each layer, the hidden state z l t at time t for the l-th hidden layer is obtained from the hidden state at t − 1 for that layer z l t−1 and the hidden state at t for the previous (l − 1)-th hidden layer z l−1 t .The time series goes through the following transformations iteratively at l-th hidden layer for t = 1 through T , where T is length of the time series: where the cell state c l t is given by c l t = f l t c l t−1 + i l t g l t , and the hidden state z l t is given by z l t = o l t tanh(c l t ).We use dropout for regularization Pham et al. (2014), which is applied only to the non-recurrent connections, ensuring information flow across time-steps for any LSTM unit.The dropout operator D(•) randomly sets the dimensions of its argument to zero with probability equal to a dropout rate, z 0 t is the input x t at time t.The sigmoid (σ) and tanh activation functions are applied element-wise.
In a nutshell, this series of transformations for t = 1 . . .T , converts the input time series x 1...T of length T to a fixeddimensional vector z L T ∈ R h .We, therefore, represent the LSTM network by a function f LST M such that z L T = f LST M (x; W), where W represents all the parameters of the LSTM network.

REMAINING USEFUL LIFE ESTIMATION USING MULTI-TASK LSTM-MR
In this section we describe how we formulate the RUL estimation problem.We intend to learn a single mapping function to estimate RUL values for all the fault types simultaneously and so we model it as a multi tasking problem of metric regression tasks.Each metric regression task performs RUL estimation for each of the fault types and the LSTM-MR network is trained to perform all the tasks at once.

RUL Estimation Problem Formulation
Consider a training set D = {x 1...T i , r i } N i=1 containing multivariate real-valued time series x 1...T i = {x 1 i , x 2 i , . . ., x T i } where x t i ∈ R m and RUL vector corresponding to the last timestep of the time series r i = {r 1 i , r 2 i , ..., r C i }.Here m is the number of sensors (input parameters), T is the length of the time series, r j i ∈ R with j = 1, . . .C, where C is the number of faults, such that there is one RUL value corresponding to each fault type.In general, the N instances are obtained from data across tools by suitable windowing and normalization as described in detail in Section 6.The goal is to learn a mapping function f M R that maps any given multivariate time series x 1...T i to a vector of RUL estimates ri corresponding to the ground truth r i .In the following subsection, we describe simple pre-processing done on the time series before inputting it to the model.

Pre-processing
In pre-processing, first, since we have to deal with time series of large lengths, we perform the standard downsampling by a factor of d 1 .The downsampled time series is given by, where k = 1, 2, 3, . . ., T 1 and T 1 = T /d 1 .Here xk i is an m dimensional vector representing the value at k th timestep of the downsampled time series of length, T 1 .In the step, we further reduce the time series length by a Time series to Vector (T2V) operation.In T2V, we decrease the sequence length by a factor of d 2 but also increase the dimensionality of the resultant time series by the same factor d 2 unlike in the usual downsampling.This is done to ensure having a smaller sequence length by retaining all the information in the time-series.Also, we assume that for a small d 2 , we do not lose any significant temporal information.
Formally, T2V can be described as follows.
where j = 1, 2, 3, . . ., T 2 and T 2 = T 1 /d 2 .Here y j i is an m.d 2 dimensional vector representing the value at j th timestep of the resultant time series of length, T 2 .Effectively, through the two pre-processing steps, a time series window of length T is converted to a time series of length T d1.d2 , making it computationally more efficient to be processed by the LSTM-MR Network.
We impose an upper bound r u on any RUL value r j i as, in practice, it is not possible to provide meaningful estimates of RUL too early in the life of a tool, e.g. if the tool is in perfect health and there are no symptoms of degradation.So if any r j i ≥ r u , it is clipped to r u .We describe our training and inference procedures using f M R in the next subsection.

Model training
We train an LSTM-MR network to estimate the RUL vector ri , given an input time series y 1...T2 i .At the output of the network we have C linear units estimating the remaining useful life values for C fault types.The network is trained using the standard MSE loss function as given below.
where W represents all the parameters of the LSTM network and W O and b O represent the weights and biases of the output linear layer which map the LSTM network's output to RUL estimations.The total loss function can be given by 1 N N i=1 L(r i , ri ).We minimize this loss using stochastic gradient descent method and the standard backpropagation through time for RNNs.
From equation 2 and the pre-processing described in 5.2, we  can concisely represent the entire process as, Once trained, it can be noted that f M R can be used in an online fashion.For a current time instance t, the latest T points (corresponding to time instances t − T + 1, . . ., t), can be input to f M R to estimate the RUL values for the C fault types.The process of training and inference using f M R is shown in Figure 4.

EXPERIMENTAL EVALUATION
In this section, we present our experimental setup which includes the details of pre-processing and choosing values for various parameters.We later present the results of our LSTM-MR approach.

Experimental Setup
In the training set, we are provided with data for 20 tools.According to the functional description of the apparatus, we infer that the etching process is carried out only when the parameter fixture shutter position takes a value of 1. So, for all the tools, we consider only those points for which fixture shutter position is equal to 1 and also use only the 9 sensors as marked in Table 1 for model building.We then split the data of each tool into overlapping windows of length 2000 (T = 2000) points with an overlap of 500 points.Using the time to failure values given for the tools in the train set, we get the values for the target RUL vector r i for each of the windows.We use a value of r u = 500.We had around 0.1M windows out of which only around 150 were having at least one of r 1 i , r 2 i , ..., r C i less than r u .To handle such an imbalance in the target RUL values in training, we retain all the windows for which at least one of r 1 i , r 2 i , ..., r C i is less than r u and randomly sample from the remaining set of windows to form our complete training set.For training, we divide the clipped target RUL values by r u to normalize them, such that the target RUL values lie in the [0, 1] range.After estimating ri , RUL values on the original scale are obtained by multiplying with r u .According to the dataset we have three fault types and hence C = 3.For the two steps described in section 5.2, we use d 1 = 2 and d 2 = 10 to pre-process the time series before inputting them to the model.
For the LSTM network, we choose the number of units h from the set {50, 100, 150, 200} and use L = 2 layers.We use early stopping with a maximum of 2500 iterations of training with a batch size of 32.Also we use dropout (Zaremba et al., 2014) with a value of 0.4 over the feedforward connections for regularization, and use Adam optimizer (Kingma & Ba, 2014) for optimizing the weights of the networks with an initial learning rate of 0.005 for all our experiments.We chose the best architecture (by varying the number of hidden units (h)) as the one with minimum RMSE on a labeled hold out set (taken from the training set of 20 tools).We will present results of our approach in the next subsection.

Results
We have evaluated the performance of our LSTM-MR approach using f M R on the data sets provided in the test (phase-1) and validation (phase-2) phases provided during the challenge.We restrict our RUL estimations between 100 and 150 and any RUL estimate falling below 100 or above 150  Along with LSTM-MR, we also present results for LSTM-MR-Ensemble which is an ensemble of the two best models, say M 1 and M 2 , picked on the basis of a hold out set.Even if one of M 1 and M 2 estimates NaN, we consider the final RUL estimate to be NaN and we take the maximum of the estimations from M 1 and M 2 otherwise for LSTM-MR-Ensemble.
From Table 8, we can see that in both the test and validation phases, LSTM-MR-Ensemble performs better than LSTM-MR.It can be noted from Table 4 that, when the ground truth is actually a number, S 2 for estimating the RUL to be a number is mostly higher compared to estimating it to be NaN.This is the reason for the significant difference, in terms of S 2 , between LSTM-MR-Ensemble and LSTM-MR on the validation phase data set.Also, LSTM-MR-Ensemble estimates more number of NaN values as compared to LSTM-MR (as denoted by P test and P validation in Table 8) as it is an ensemble and hence results in lesser false positives.

CONCLUSION
We have described the approach used for the 2018 PHM Data Challenge.We have highlighted the challenges in the supervised learning based approach such as missing data close to failures, noisy labels for failures due to lack of knowledge of onset of failure, learning to estimate very large RUL values (very early prediction of failure), capturing temporal dependencies from very long multivariate time series, dealing with multiple operating conditions, etc. and described our solution in the context of these challenges.Our approach leverages deep recurrent neural networks to learn a supervised model for remaining useful life estimation for three types of failure modes in ion mill etching system.Our approach leverages data from all tool-ids and failure modes to learn one generalpurpose model for all tools and failure modes.Importantly, the proposed approach is able to estimate RUL in an online fashion.We also found that an ensemble of RNN models significantly improves the results.In future, it will be interesting to explore if a semi-supervised approach (e.g. as in Malhotra, Ramakrishnan, et al. (2016); Malhotra, TV, et al. (2016)) can be used to mitigate the issue of lack of knowledge of the time of onset of failure to find the change points, i.e. the time instances when the health index starts decreasing or the first symptoms of failure appear, and the RNN model can then be used to estimate RUL only after the change points.

Figure 2
Figure 1.Ion Mill Etching System and Process Overview Figure 2. Sample time series plots with fault signatures.Image best viewed in color.

Figure 4 .
Figure 4. Approach Overview with Training and Inference Phases.Here, T2V: Time series to Vector operation.

Figure 5 .
Figure 5. True positives to False positives ratio with respect to estimated RUL ranges

Table 1 .
Sensors used in model building

Table 2 .
Zaremba et al. (2014) and faults for each tool idTool-ID Total Points (/10 6 ) F 1 F 2 F 3 Lack of knowledge of exact time of onset of failures in the training set: Despite having access to large number of failure instances, the onset of failure is very challenging to identify as there is a large variance in the time between onset of a failure and the shutdown of a machine.This can be seen, for example, in Figure2(b) where there is a gradual increase in Flowcool Pressure as we move towards failure.4.Extremely large range of possible RUL values: It is well-known that it is difficult to estimate the remaining useful life unless there is at least one symptom of an approaching failure or an onset of failure.In this dataset, the average time between two failures is 60.82 × 10 4 with the maximum time between two failures being as large as 22.26 × 10 6 .It is therefore challenging to model such a large variance in the RUL values.Possible range of RUL values for each tool id are listed in Table6.We use a variant of LSTMs as described inZaremba et al. (2014)in the hidden layers of the neural network.Intuitively, an LSTM unit maintains a cell state using an input gate, a forget gate, and an output gate: at a given time step, the input gate decides what should be added to the cell state, forget gate decides what should be removed from the cell state, and the output gate decides what part of cell state should be given as an output from the LSTM unit.Hereafter, we denote column vectors by bold small letters and matrices by bold capital letters. Fr a hidden layer with h LSTM units, the values for the input gate i t , forget gate f t , output gate o t , hidden state z t , and cell state c t at time t are computed using the current input x t , the previous hidden state z t−1 , and the cell state c t−1 , where i t , f t , o t , z t , and c t are real-valued h-dimensional vectors such that z t = f (x t , z t−1 , c t−1 ) as given by Equations 1.
5.Sequence of faults in a given tool: Several shutdowns caused by different types of faults are reported for each tool.Therefore, it is difficult to model the normal operational behavior of a tool as a tool may be normal with respect to one fault type but may be depicting abnormal behavior or symptoms with respect to another fault type.6.Multiple operating conditions:There are various parameters such as fixture shutter position, stage, recipe, recipe steps, that determine the operating condition.Each parameter can have a large number of values (one at a time).All possible combinations of all parameter's values (all operating conditions) are large in number.Percentage of data points with respect to shutter position are listed in Table7.4.BACKGROUND: DEEP LSTM NETWORKS

Table 4 .
Scoring functions used in test and validation phases.Here, R: Ground Truth TTF, R: Submission TTF.

Table 7 .
Percentage of points with respect to shutter position

Table 8 .
Performance of LSTM-MR and LSTM-MR-Ensemble in terms of S 1 and S 2 on test and validation phase data sets.P test and P validation denote the number of non-NaN estimations made on the test and validation phase data sets respectively.S 1 , S 2 and S are scores as described in Section 3.4.Lower scores indicate better performance.Approach P test P validation S 1 on test S 1 on validation S 2 on validation S on validation