Remaining Useful Life Prediction using Gaussian Process Regression Model

This paper evaluates the merits of a multi-variable Gaussian process regression (GPR) model for remaining useful life (RUL) estimation. The paper presents an optimization method that trains the GPR model to find the best kernel type and hyper-parameter combination. Furthermore, the paper evaluates the performance of the GPR model for small training datasets and with a reduction (missing) of input features. A comparison is made to the multi-layer perceptron (MLP) neural network which forms the basis of deep learning models. To illustrate model performance, an air filter clogging RUL dataset is used. The performance results show that both GPR and MLP models have similar sensitivity to training set size but GPR also computes the uncertainty. Empirically, MLP is more robust to a test set with a missing input while the data suggests that the GPR performs better when the training data also did not include the same input.


INTRODUCTION
Remaining useful life (RUL) of a component is a metric that defines the expected duration over which a component or system can function in accordance with its specifications. Typically, the RUL is estimated in units of time or numbers of cycles until failure. Estimation of RUL is a vital aspect of predictive maintenance and prognostic health management (PHM). Additionally, RUL estimations can also be applied to the analysis of reliability, efficiency, productivity and safety (Hu, Miao, Si, Pan, & Zio, 2022). Due to the wide range of applications, RUL estimation is gaining interest in many different domains of engineering.
As multiple sensors are becoming a commodity on many components and systems, the volume of available sensor data is growing (Aircraft Sensors Market: 2021-28: Industry size , growth report, 2021). This enables the expansion of data-Katarina Vuckovic et al. This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 United States License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. driven techniques in PHM applications. To this end, many researchers are proposing data-driven PHM solutions in recent years (Huang, Di, Jin, & Lee, 2017). Particularly, deep learning techniques are attracting significant attention especially when dealing with highly non-linear and complex features (Y. Wang, Zhao, & Addepalli, 2020). Increasingly, if the failure mode is marginally observable, enough data can create the relevant reference signal.
Machine learning methods like deep learning offer some advantages over traditional physics based feature engineering, especially when future behavior can be expected to follow previously observed behavior. Yet, these techniques do have some undesirable properties. There are two main challenges with the neural networks (NNs) that are inherent to deep learning. The first challenge is the data availability. Deep learning models require large datasets to avoid the problem of under-fitting (Pei, 2021). While operational data is generally abundant, failures are relatively rare. If the RUL differs significantly based on the operating conditions (i.e. pressure, temperature, etc.), then the NN will inevitably underfit the data due to a small dataset. Furthermore, the only way to understand the nature of underfit is to require more data.The second problem with NN is that they provide a "black-box" solution. In other words, the functioning mechanism is not transparent and cannot be interpreted easily (Chen et al., 2021). The ambiguous model interpretation may raise confidence issues within the PHM community especially when dealing with counter-intuitive outcomes. This is especially true for predictions that cannot be explained by the physical knowledge of the system. Some of these concerns may be addressed by using Gaussian process regression (GPR) models. The GPR models have the ability to learn a wide range of complex non-linear functions by tuning a small number of hyper-parameters. Since the models have only a few parameters to optimized, they tend to generalize better on small datasets while also determining the confidence bounds of any estimation. This poses a significant advantage over NN models that contain many parameters and require large datasets to train. Regarding the traditional physics based engineering models, the GPR poses an advantage in the ability to model complex functions. Typically, feature engineering models are parametric and do not adapt as well to complex non-linear relationships.
To illustrate the benefits of the GPR model, the RUL of dust filters are examined (Mauthe, Hagmeyer, & Zeiler, 2021). These types of filters are utilized in a variety of industries such as automotive, aviation, and facility management. The dust filters fail once there is sufficient clogging due to the accumulation of dust particles. Since these filters are typically expendable components that require frequent replacement (Eker, Camci, & Jennions, 2013), RUL predictions of can improve maintenance efficiency, reduce delays due to failures, and improve system reliability.
This paper is organized as follows. Section 2 begins with a literature review of RUL estimation models. This is followed by an overview of the GPR and MLP models in Section 3. Section 4 covers the dataset used to illustrate the performance of the models as well as the model training approach. In section 5, the performance metrics for the two models are presented and discussed. Finally, Section 6 summarizes the work and highlights the main points.

LITERATURE REVIEW
This section provides an overview of the GPR models studied in the context of RUL prediction. There are several papers that explore RUL prediction with GPR models. The simplest approaches use a single input parameter, derived either as a single feature or as a combination of contributing features into a single Health Indicator (HI). More complex methods employ tiered GPR models with multiple inputs.
A case study on the RUL of lithium batteries proposes a fuzzy evaluation-Gaussian process regression (FE-GPR) that combines fuzzy logic with GPR model. The FE-GPR uses the battery capacity as the single input feature to estimate the RUL (Kang et al., 2020). Another study on the RUL predictions of Electric Energy Metering Equipment (EEME) used a single health index input based on the environmental stress (N. Li et al., 2021). In (Liu & Chen, 2019), a novel method that combines HIs with multiple GPR models is presented for forecasting the RUL. The authors present a rather complex framework where they first use three separate single feature GPRs to generate HI predictions which are then used as inputs to a multiple input feature GPR model that predicts the RUL. A similar approach is presented in few other works where the short-term state of health (SOH) is first estimated using GPR model. The SOH is then used as an input to another GPR model that predicts the long-term RUL (Jia et al., 2020), (X. Li, Wang, & Yan, 2019) and (X. Li, Yuan, & Wang, 2020). Another interesting multi step framework is proposed wherein the first step uses a Gaussian process clas-sification (GPC) model to determine if component is healthy or degraded based on HIs. Then, another multiple input GPR model is used to predict the RUL (Benker, Bliznyuk, & Zaeh, 2021). Finally, in a RUL study on lubricating oil, a multiple output GPR (MO-GPR) model is used that correlates the historical degradation trends with current degradation trends (Tanwar & Raghavan, 2020).
The main contribution of this work is the optimization of the GPR model over multiple kernel types for a range of hyperparameter values. The optimizations yields that the best kernel for the analyzed dataset is the Matérn32. The second contribution is in the comparison of the GPR model with the MLP NN. The comparison is conducted across various training dataset sizes and when input features are missing.

Gaussian Process Regression
A Gaussian Process (GP) is a generalization of the multivariant Gaussian probability distribution. Whereas the probability distribution describes random variables which are scalars or vectors, the GP as a stochastic process governs the properties of a function (Rasmussen & Williams, 2005). The GPR refers to a statistics based methodology where the Gaussian processes are used as prior models for the Bayesian regression functions that are fitted to the observed data (Särkkä, 2019). Given a set of observed data points, there is an infinite number of possible functions that can fit these points. In GPR, the GPs conduct regression by defining a distribution over these infinite number of functions. The GPR then uses the distribution as prior knowledge to make predictions and provide the analysis of uncertainties. Given an input and output training pair dataset D = (x n , y n ), the GPR model is defined as where f is drawn from a Gaussian process with mean µ and covariance K, f (x n ) ∼ GP(µ, K), ϵ n is assumed to be an independent, identically distributed Gaussian noise added to the the system, and n refers to the n th observation (Damianou & Lawrence, 2013).
Given training input data X = [x 1 , x 2 , ..., x n ] and its corresponding observations y = [y 1 , y 2 , ..., y n ], the joint distribution of the observed values y and the function values at the new testing point y * associated with test set X * , is defined as (J. Wang, 2021): where K = K(X, X), K * = K(X, X * ), K * * = K(X * , X * ), and K + σ 2 n I is the variance of (1). If there are n training points and n test points, then K * = K(X, X * ) denotes the n × n matrix of the covariances evaluated at all pairs of training and test points. Similar analogy applies for K and K * * . In other words, covariance matrix K consist of covariance function entries K ij = k(x i , x j ). Then, the posterior predictive distribution is defined by the mean and variance functions of the posterior. The mean y * and variance V[y * ] are evaluated as (J. Wang, 2021): As it may be observed, the GPR is entirely defined by the mean and covariance functions (kernel). Without any prior knowledge, the mean is generally assumed to be zero and if necessary, shifted accordingly after the model is trained. Therefore, the adjustment of the mean is trivial. The main focus of the GPR design is selecting the appropriate kernel and optimizing its hyper-parameters. The kernel function describes the correlation of the Gaussian process. The most common choice of kernel is the square exponential (SE) (also known as Radial Basis Function (RBF)). Even though, this kernel generally performs well in many instances, there are other kernels that may perform better for some datasets. Therefore, optimization is applied on GPR over five different kernel types to find the kernel that best fits sample dataset.
l ) The kernels are listed in Table 1 where σ is the standard deviation of the signal, l is the characteristic length scale, α is the positive-valued scale-mixture parameter, and

Multilayer Perceptron
The MLP is a type of NN that is created by stacking multiple perceptrons (Géron, 2019). The architecture of the MLP consists of an input layer, one or more hidden layers, and an output layer as illustrated in Figure 1. The number of perceptrons in the input layer is equivalent to the number input parameters. The hidden layers can contain any number of perceptrons and any number of layers. Finally, the number of perceptions in the output layer is equivalent to the number of output parameters. To illustrate the performance of the RUL predictions, analysis is performed on the filter clogging dataset 1 (Mauthe et al., 2021). Almost all industries depend on some sort of filtration process and filter clogging tends to be one of the main reasons for filter degradation and removal (Sutherland & Chase, 2011). Therefore, this dataset is a good example of a real world scenario.

Figure 2. Dataset Preparation
This dataset was generated in a laboratory environment using an air filtration test bench. The dataset records 1) the pressure drop across the filter, 2) the amount of dust introduced, 3) the size of the dust particles (three different sizes), and 4) the flow rate across the filter. These four features are recorded at a regular sampling rate of 10 Hz. The feature measurements and the RUL at a given time instance represent one inputoutput training pair as illustrated in Figure 2. The total dataset contains M = 50 filters (components) which are split into N testing and M − N training filters.
Since one of the goals is to test the performance for varying sizes of training datas, the dataset is split into 3 groups, each with a training set representing 20%, 50%, and 80% (i.e. 10 filters, 25 filters, and 40 filters) of the overall dataset. Each group contains 5 randomly selected batches used for crossvalidation. For 20% training, the first 10 components in the dataset were selected for batch 1 while the remaining 40 were used for validation. For batch 2, the second 10 components were selected for testing and so on. For 80% training, the testing and training sets were reversed from the 20% training batches. Finally, the 50% training, were selected as follows: first 25 components (batch 1), second 25 components (batch 2), components 14-38 (batch 3), components 1-13 and 39-50 (batch 4), and components 6-30 (batch 5).

GPR Model
The GPR model consists of an offline phase, during which the kernel functions and their hyper-parameters are optimized, and an online phase, during which the RUL prediction is computed using (3). The confidence interval is obtained from the variance in (4) as illustrated in Figure 3. Referring to (1), the input vector x contains the four features at a given time instance, while the output y is the RUL of that component.
The optimization over multiple kernels and multiple hyperparameters values are shown in Figure 4. The kernels from Table 1 are listed on the Kernel Function axis. Furthermore, the Sigma axis represents standard deviation σ n for the noise in (1). Finally, the vertical axis represents the objective cost function, which is being minimized as part of the optimization. In this case, the log likelihood function defined in (6) should be maximized, so the objective cost function is the negative of the log likelihood function (6). The optimization is complete when either the objective cost function converges or the model reaches its maximum number of 15 iterations.
The objective function converges on the Matérn32 kernel as the estimated optimal solution. The squared exponential which is used as a default kernel in many GPR models, definitely shows a sub-optimal solution compared to Matérn32. It is possible to separate the length-scale parameter for each predictor. This is known as the automatic relevance determination (ARD). The purpose of ARD is to regularize the solution space using a parameterized, data-dependent prior distribution that effectively prunes away redundant or superfluous features (Neal, 2012). The kernels are tested with and without the automatic relevance determination (ARD) and show that for this dataset the ARD does not improve the performance.

MLP Training
Referring to Figure 1, the input layer has 4 neurons (one for each feature). There are 3 hidden layers of size 150, 100, and 25. Finally, the output layer has 1 neuron which represents the RUL estimate. The MLP training network is designed to use a 'relu' activation function, 'adam' optimizer, learning rate of 0.0001, and maximum iteration of 300.

Performance Metrics
The prediction is evaluated using three common performance metrics for RUL: 1) the mean absolute error (MAE), 2) the mean absolute percent error (MAPE), and 3) the root mean square error (RMSE) (Liu & Chen, 2019). The equations are defined in (8) -(10), respectively, where y i is the true and y * i is the predicted RUL, and n is the total number of test samples.
Each of the three performance metrics provides a different insight into the results. The MAE gives less weight to the outliers which means that the metric is less sensitive to them. This also means that it may not adequately reflect the performance when dealing with large error values. On the other hand, the RMSE is heavily dependent on large errors. Furthermore, both MAE and RMSE are absolute errors specific to the scale of the data. These metrics do not provide any insight into the performance of data at different scales. For such application it is better to use MAPE which is the absolute error normalized over the data. The MAPE generates a metric that can be used to compare results across different scales. However, the MAPE does have drawbacks. True value data points that are equal to zero have to be excluded from the dataset to avoid dividing by zero. In addition, errors at small values of y i will have a large bearing on the result.

Performance Results
The three performance parameters for each batch as well as the averages and standard deviations (STDEV) over all five batches are presented in Tables 2-4. The MAE and MAPE are expressed in seconds while the MAPE is a relative error expressed in percentage. The average result in all instances except for MAPE for 50% of training data shows that the GPR outperforms the MLP model. Referring to Table 3, the largest error is coming from batch 1 with GPR MAPE at 121%. Careful analysis of the data points reveals that the GPR had higher error at small RUL, which is penalized heavier due to the normalization. Furthermore, the MAPE for GPR in batch 1 is significantly higher than the MAPE for the other batches in the same train category (column), thus skewing the average to be higher. On average both models perform better when the training dataset is larger as is expected of data-driven models. However, as the training dataset sizes are reduced, variation among the different batches increases. Some batches perform poorly while other batches have performances comparable to models with larger training datasets. For instance, referring to the training dataset of 20% in Tables 2 and 4, the MAE and RMSE for batch 1 is much higher than for the other four batches. The training dataset for batch 1 primarily contains samples of filters that are operating in environments with large dust particles. Therefore, it does not learn the correlation of the dust size to the RUL, and performs poorly when presented with a dust size other than the one in the training set. Furthermore, it cannot be expected that the model performs well on filters that are routinely subjected to smaller dust particles. On the other hand, for batches where the training dataset contains a variety of conditions, the performance improves significantly. Furthermore, the advantage of GPR over MLP is that in conditions where variety exits, the GPR can train a better model with a smaller dataset. For instance, batch 4 in the training dataset of 20% contains a variety of operating conditions. The performance of GPR for this batch is roughly two times better than MLP. The results show that, on average, both models improve by similar amounts when presented with more training data (last two columns of Tables 6  and 7). Yet, there are differences between the model improvement numbers when examined on a per-batch basis. This variation among the batches is best measured by the standard deviation which improves for the GPR model as the dataset size increases. However, the STDEV for MLP does not necessarily improve for larger datasest. In fact, for the MLP model, it is actually smaller for 50% than for 80% training dataset.

Predictive Variance
As the focus of this paper is on small datasets, this section analyzes the average predictive variance for the batches trained with 20% dataset. Using the variance in (4), the 95% confidence interval (CI) can be computed as An example of an estimated RUL with the 95% CI is illus-trated in Figure 5. This example uses the GPR model trained with 20% of dataset from batch 1. The predicted RUL in the figure represents a sample component from the testing dataset. In this example, the true value is indicated with the solid straight line while the dashed line is the predicted RUL. The shaded gray region is the 95% CI computed from (11).
Figure 5. Example of a RUL prediction of a samples component. Table 5 shows the average variance the training dataset of 20% and the percent of testing samples that are within the 95% CI. As it may be observed from the table, the percent of samples within the 95% CI is less than 95%. If the training dataset was properly balanced with a variety of filter operating conditions, the third column in Table 5 would have been 95% or higher. However, none of the five batches contains the right variety of conditions to properly map the entire solution space. This emphasizes the need to carefully select the samples in the training dataset. The largest variance in batch 4 means that this batch has the biggest variety of input samples which results in the lowest prediction error in GPR in Tables 2-4. This shows that batch 4 is the best batch to train on and MLP does not provide this information.

Missing Input
Another measure of performance is robustness to missing or reduced input data. In many practical applications, the inputs available to train a model may be limited. Accurately detecting particle size over time may not be feasible. Accordingly, both the GPR and MLP models were evaluated when (1) the dust particle size is contained in the training but not in the testing set, and (2) when dust particle size was missing from both training and testing sets. Relative degradation to a baseline performance underlies how sensitive each model type is to missing data. Table 6 and 7 summarize the results. When the 4-input model expects particle size but instead gets a constant '0' value, the MLP model degrades the least across the 5 batches. Accordingly, when the model is trained with three inputs, omitting the particle size, the GPR model degrades the least compared to the four input case across the 5 batches. Table 6. Comparing performance GPR model trained with 4 features (baseline) to 1) Dust size features missing during testing, 2) Dust size missing during training and testing, 3) improved performance when training dataset is increased to 50%, and 4) improved performance when training dataset is increased to 80%.

CONCLUSION
This work proposed a GPR model to estimate the RUL. The primary focus was on the design of the GPR model which was trained to find the optimal kernel type and hyper-parameters. The performance of the GPR model was compared to the MLP NN using three different performance metrics. Furthermore, the performance of the models was evaluated for multiple training dataset sizes and for missing data. The results showed that training input size affect both GPR and MLP equally, but each model performs differently in cases of missing data. Where the data was expected but not provided, MLP is empirically less affected, and when the model is trained without the missing data as an input, GPR is less degraded.
In general, the GPR has advantages in that it provides both the estimate and its confidence distribution and requires less tuning effort. Lastly, the analysis revealed the importance of a quality, well-balanced training dataset.