Automated Hyper-parameter Tuning for Machine Learning Models in Machine Health Prognostics

Recent studies have revealed the success of data-driven machine health monitoring, which motivates the use of machine learning models in machine health prognostic tasks. While the machine learning approach to health monitoring is gaining importance, the construction of machine learning models is often impeded by the difficulty in choosing the underlying hyper-parameter configuration (HP-config), which governs the construction of the machine learning model. While an effective choice of HP-config can be achieved with human effort, such an effort is often time consuming and requires domain knowledge. In this paper, we consider the use of Bayesian optimization algorithms, which automate an effective choice of HP-config by solving the associated hyperparameter optimization problem. Numerical experiments on the data from PHM 2016 Data Challenge demonstrate the salience of the proposed automatic framework, and exhibit improvement over default HP-configs in standard machine learning packages or chosen by a human agent.


INTRODUCTION
With the prevalence of Machine Learning and the availability of low-cost sensors, data-driven machine health monitoring is gaining importance in modern manufacturing systems.While popular machine learning models, such as deep neural networks and random forests, give rise to highly accurate predictive models, the success of these models hinges on judicious choices of their underlying hyper-parameter configurations (HP-configs), which governs the construction of these models.
For example, in the case of deep neural networks, an HPconfig corresponds to the choice of network architecture, such as the number of layers as well as the number of neurons in each layer, and the choice of stochastic gradient descent al-Cheung et al.This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 United States License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
gorithms for training a network, such as Adagrad or ADAM.After the network architecture and the training algorithm are chosen, the decision maker optimizes the network weights in order to minimize the prediction error on a training dataset.While the decision maker desires to minimize the prediction error, the error crucially depends on the choice of the HPconfig.Unfortunately, the size of the hyper-parameter space for a machine learning model is often too big for a bruteforce search for a competent HP-config.The identification of a competent HP-config is often based on the experience of the decision maker, and the identification could be daunting and time-consuming for a new machine prognostic task.
We develop an automatic framework that identifies competent hyper-parameter configurations without conducting a brute-force search on the underlying hyper-parameter spaces.Our framework is based on Bayesian Optimization algorithms, which allows quick convergence to a competent hyper-parameter configuration.A Bayesian Optimization algorithm involves the construction of a certain stochastic surrogate function on the performance of every hyper-parameter configuration.We implement our automatic framework on the PHM 2016 Data Challenge Task (PHM, 2016) for finetuning popular machine learning models such as random forest regression model and multi-layer perceptrons.We witness reduction in prediction errors under our automatic framework, in comparison to the machine learning models constructed using default HP-configs postulated by practitioners in standard machine learning packages (Pedregosa et al., 2011).

Literature Review
There has been a line of research work studying the prognostics health modeling with machine learning technology.Baptista et al. in (Baptista et al., 2016) leveraged Support Vector Machines (SVM) to estimate the lifetime for maintenance in aeronautics.Experiment results show that the proposed method can have better estimation than traditional autoregressive moving average (ARMA) method.Liu et al. in (Liu, Zuo, & Qin, 2016) also leveraged SVM to perform the health state assessment and remaining useful lifetime (RUL) prediction modeling for rolling element bearings.The health states are classified into multiple classes, and individual regression model is separately built for each class to predict the RUL.Deutsch et al. in (Deutsch & He, 2016) leveraged Restricted Boltzman Machine (RBM) to model vibration data to predict the RUL of bearings.The feature vectors are manually designed, with the root mean square function to capture the bearings degradation over time.However, their deep learning approach achieved lower accuracy than the particle filter based approach.One way to improve the accuracy is to use stacked RBM structure with hyperparameter tuning.Chen et al. in (Chen, Lucas, Lee, & Buehner, 2015) leveraged neural network to predict gear failure.Similarly, Yang et al. in (Yang et al., 2016) leveraged neural network and ensemble multiple models to predict the remaining useful lifetime of electrical machines.
Another line of research leveraged deep learning for fault diagnosis and prognostics.He et al. in (He, He, & Bechhoefer, 2016) proposed a deep learning approach for feature extraction, called large memory storage retrieval neural network (LAMSTAR), to perform bearing fault diagnosis.Babu et al. in (Babu, Zhao, & Li, 2016) adopted Convolutional Neural Network (CNN) for RUL estimation in prognostics.The proposed deep architecture is claimed to learn the highlevel feature efficiently from the low-level raw sensor signals, which can result in the higher accuracy for RUL estimation.Gugulothu et al. in (Gugulothu et al., 2017) proposed to use Recurrent Neural Networks (RNNs) for predicting remaining useful life of the engine and pump.
However, neither of those previous work considered the hyperparamter tuning of the machine learning algorithms for training the models.
In recent years, automated hyperparameter tuning has been gaining importance in the machine learning and artificial intelligence literature.A basic way to automatically tune hyperparmaeters is by the Random Search algorithm (Bergstra & Bengio, 2012), while more sophisticated Bayesian optimization algorithms such as SMAC (Hutter, Hoos, & Leyton-Brown, 2011) and GP (Srinivas, Krause, Kakade, & Seeger, 2010) are also proposed in the literature.For a survey on hyper-parameter tuning and its applications, the readers are welcome to consult the survey (Shahriari, Swersky, Wang, Adams, & de Freitas, 2016).

Organization and Notations.
In Section 2, we introduce the concept of hyper-parameters in constructing machine learning models.In Section 3, we define the hyper-paramater optimization problem, and outline Bayesian optimization algorithms and the Random Search algorithms, which solve the optimization problem to near-optimality efficiently.In Section 4, we consider the PHM 2016 Data Challenge (PHM, 2016), and we evaluate the effectiveness of the algorithms in searching for a good hyperparameter configuration automatically.Finally, we conclude in Secoin 5. Throughout the manuscript, we denote R as the set of real numbers, and denote R ≥0 as the set of nonnegative real numbers.

HYPER-PARAMETERS FOR MACHINE LEARNING
In this Section, we define the notion of a hyper-parameter configuration for a machine learning (ML) model, and provide illustrating examples with well-known ML models.Then, we define the notion of out-of-sample validation error, which paths our way to define the hyper-parameter optimization problem.

Machine Learning Models and Hyper-parameters
Let's consider a supervised machine learning (ML) task involving a collection of training dataset {(X tr i , y tr i )} N tr i=1 , where X tr i ∈ X ⊆ R D is the feature vector of the ith training sample, y tr i ∈ Y ⊆ R as the label of the ith training sample, and N tr is the number of training samples.In the context of machine health monitoring, feature vector X could be the time series sensor data measured on a machine part in a time interval, and label y could be the corresponding remaining useful life.The goal of an ML task is to construct an ML model M : X → Y, so that the output M (X) accurately predicts the corresponding label y.The ML model M belongs to a parameterized class of ML models, and the construction is to identify an effective choice of the parameters that makes the ML model M accurate.
The construction procedure A of an ML model M crucially depends on the training dataset {(X tr i , y tr i )} N tr i=1 , as well as the underlying HP-config c.An HP-config c governs the way M is constructed.For example, an HP-config c determines the pre-processing procedure on the training dataset {(X tr i , y tr i )} N tr i=1 , and the algorithmic details in determining M , which often involve solving an optimization problem.Altogether, we express the procedure as a function on both the training dataset and the HP-config: To provide a more concrete discussion, we illustrate the HPconfigs involved in linear regression models and multi-layer perceptron models.
Linear regression: A linear regression model M LR is parameterized by a vector θ ∈ R D .For a given feature vector X ∈ X ⊆ R D , the model postulates a linear prediction, i.e.M LR (X) = θ X, which is the dot product between X, θ.1 With a given training dataset {(X tr i , y tr i )} N tr i=1 , the vector of parameters θ is the optimal solution of the following empirical risk minimization problem: The regularization term κ θ p , where serves to stabilize the optimal solution θ.For example, the choice of p = 1 corresponds to the LASSO regularization, and the choice of p = 2 corresponds to the Euclidean regularization.
The HP-config for constructing M LR is c LR = (κ, p), and the construction procedure A LR ({X tr i , y tr i } N tr i=1 , c LR ) involves solving optimization problem (1) for the parameter vector θ.
Multi-layer perceptrons: A multi-layer perceptron (MLP) model M MLP is a feed-forward neural network, which inputs a feature vector X and outputs a prediction M MLP (X) for the label y.The model M MLP is parameterized by the edge weights in the network.Each internal node in the network carries an activation function, which is a uni-variate nonlinear function that takes as input a linearly weighted sum of the preceding layer's outputs, and returns a scalar value for the next layer.
Compared to the case of M LR , the construction of ML model M MLP involves a more complicated HP-config c MLP .The HPconfig c MLP consists of HPs in the following two categories.The first category concerns the architecture of the network, such as the number of layers, the number of nodes in each layer, as well as the type of activation function in each layer.The second category concerns the training procedure of the MLP model.It is well known that an MLP is trained by the Backward Propagation algorithm A MLP , which is a stochastic gradient descent algorithm that incrementally adjusts the network weights in a series of iterations.Relevant HPs include the mini-batch size, the learning rate, etc.Finally, while the discussions above focus on the HPs regarding the algorithmic details in determining an ML model, an HP-config could also contain hyper-parameters regarding the pre-processing of the training dataset, as shown in our study on the PHM 2016 prognostic challenge in Section 4.

Out-of-Sample Validation Error
As exemplified in the preceding examples, the choice of an HP-config has a profound impact on the resulting machine learning model.More precisely, an HP-config c (together with the training data) determines the resulting machine learning model M , which we wish to evaluate its prediction accuracy on a validation dataset.We evaluate the where θ is appended by a y-intercept θ 0 , and each feature vector X is appended by 1.
effectiveness of an HP-config c by considering its out-ofsample validation error, which is denoted as OOSV(c).
To define OOSV(c), we first denote : Y × Y → R ≥0 as the prediction error function, where (M (X), y) is the prediction error of M (X) on the true label y.For example, for (y, y ) = |y − y |, the error (M (X), y) is the absolute error of the prediction M (X) on the label y.
Suppose that we are given a training dataset {(X tr i , y tr i )} N tr i=1 , as well as a validation dataset {(X va i , y va i )} N va i=1 that is separate from the training dataset.For a given HP-config c, the corresponding out-of-sample validation error OOSV(c) is equal to where Essentially, OOSV(c) evaluates the performance of the resulting ML model M , which is constructed with the training dataset {(X tr i , y tr i )} N tr i=1 , on another set of unseen validation data {(X va i , y va i )} N va i=1 .To identify an effective HP-config c, we would like to identify a c such that OOSV(c) is small by using Bayesian optimization algorithms, which are introduced and motivated in the next section.

BAYESIAN OPTIMIZATION FOR HYPER-PARAMETER OPTIMIZATION
In this Section, we define the hyper-parameter optimization problem, and illustrate its intractability.Then, we outline Bayesian optimization algorithms for solving the hyper-parameter optimization problem to near-optimality efficiently, and highlight two prominent Bayesian optimization algorithms, SMAC and GP.Finally, we also provide the description of Random Search Algorithm, which is a simple but effective algorithm for identifying a good HP-config.
Suppose that we are given a fixed training dataset {(X tr i , y tr i )} N tr i=1 and validation dataset {(X va i , y va i )} N va i=1 , and we restrict our search of HP-config c in the search space C. The hyper-parameter optimization problem is defined as follows: Given an optimal solution c * to the optimization problem (2), we can then perform A(c * , {(X tr i , y tr i )} N tr i=1 ), which returns an ML model M * , our desired ML model that has a low out-ofsample error (In particular, a minimum prediction error on the validation dataset {(X va i , y va i )} N va i=1 ).Unfortunately, the optimization problem (2) is usually intractable.Indeed, the search space is typically large.In addition, the function OOSV is typically not convex nor monotonic in c.Moreover, the evaluation of OOSV could be computationally expensive.For example, in the case of MLP, the evaluation requires training deep neural networks on the training data, which could be time consuming when the number of layers is large or when the training dataset is large.Consequently, it is infeasible to conduct a brute force search, and to evaluate OOSV on every HP-config c for minimizing OOSV.This motivates the design of Bayesian optimization algorithms, which converge to a near-optimal HP-config c with a small number of evaluations of OOSV.
To illustrate the idea of Bayesian optimization, we first think about how a human agent tunes the HP-config for constructing an ML model.Typically, the human agent starts with a few randomly chosen HP configs to get a sense of how different HP-configs perform.After that, the human agent is able to gain some intuition about the effects of varying different hyper-parameters.He/she is then able to narrow down the search space C in his/her subsequent search.For example, in training MLPs, the human agent could discover certain characteristics about learning rates.For example, it could be taht the training process always diverges when the learning rate is higher than 0.3, but always converges when the learning rate is lower than 0.1.Based on this insight, he/she can focus on trying various HP-configs with learning rates at most 0.1.While such an effort often leads to a competent choice of HP-config, the process of extracting such insights is often laborious, and requires domain knowledge.Is it possible to automate such a HP-config optimization process, and save the laborious effort by humans?Bayesian optimization algorithms provide an answer to the question above.A Bayesian optimization algorithm is an online algorithm, which involves evaluating different HPconfigs in iterations.At iteration t, a Bayesian optimization determines the HP-config c t ∈ C for evaluation, based on the previously tested HP-configs c 1 , . . ., c t−1 and their respective evaluations OOSV(c 1 ), . . ., OOSV(c t−1 ).Such a online decision procedure mirrors the sequential nature of the human agent, who tries to optimize the choice of HP-config by a series of trial-and-errors on different HP-configs, as previously discussed.
We provide the pseudo-codes for a typical Bayesian optimization algorithm in Algorithm 1. Essentially, the online selection of HP-configs c 1 , . . ., c T ∈ C is guided by the functions OOSV 1 , . . ., OOSV T , which serve to approximate the intractable function OOSV.In the end, a Bayesian algorithm returns a HP-config c * T , which has the best evaluated function value of OOSV among the evaluated HP-configs c 1 , . . ., c T .To fully specify a Bayesian optimization algorithm, we need to specify the construction of OOSV t : C → R, which is an approximation function to the function OOSV : C → R. The construction is based on {(c s , OOSV(c s )} t−1 s=1 .The approximation function OOSV t serves to crystallize the insights gained in experimenting c 1 , . . ., c t , similar to how a human agent extracts insights based on his/her experience on different HP-configs in his/her experimentation, as previously dis-Algorithm 1 Algorithmic framework of Bayesian optimization, with T evaluations of OOSV 1: Let OOSV 1 : C → R ≥0 be a random function.2: for t = 2, . . ., T do 3: Construct OOSV t based on {(c s , OOSV(c s )} t−1 s=1 .

4:
Compute HP-config c t , which solves min c∈C OOSV t (c).

5:
Evaluate OOSV at c t , which returns OOSV(c t ).Sample HP-config c t uniformly at random from C.

3:
Evaluate OOSV at c t , which returns OOSV(c t ).cussed.By putting forth a Bayesian optimization algorithm, we automate the HP-config optimization procedure.The construction of OOSV t is based on certain statistical techniques, hence the name Bayesian optimization algorithms.
There are two prominent ways to construct an approximation function OOSV t , giving rise to the following two prominent Bayesian optimization algorithms: 1. SMAC, proposed by (Hutter et al., 2011), which constructs the approximation function OOSV t by random forest regression on {(c s , OOSV(c s ))} t s=1 , 2. GP, proposed by (Srinivas et al., 2010), which constructs the approximation function OOSV t by combining Gaussian Processes regression on {(c s , OOSV(c s ))} t s=1 with optimistic exploration.
More details about SMAC and GP could be found in the survey (Shahriari et al., 2016).Finally, apart from Bayesian optimization algorithms, the Random Search algorithm is also a way to compute an efficient HP-config under the intractability of the hyper-parameter optimization problem (2), cf.reference (Bergstra & Bengio, 2012).Essentially the Random Search algorithm with T evaluations on OOSV simply involves evaluating OOSV on T randomly chosen HP-configs.Then, the algorithm returns the HP-config with the smallest value evaluated under OOSV.The Random Search algorithm is stated in Algorithm 2. While the Random Search algorithm is conceptually simple, it is often inferior compared to Bayesian optimization algorithms in identifying a competent HP-config, as illustrated in the numerical experiments in the next Section.

HYPER-PARAMETER TUNING FOR MACHINE HEALTH PROGNOSTIC TASKS WITH PHM 2016 DATASETS
In this Section, we demonstrate the effectiveness of Bayesian optimization for automating the HP-config optimization of the machine health prognostic task for the PHM 2016 dataset (PHM, 2016).We first describe the ML task and the dataset involved in Section 4.1.Then we describe in Section 4.2 our construction procedure A PHM for constructing an ML model M PHM for the PHM 2016 Data Challenge task.We also highlight the HP-config c PHM involved in the construction.Then, in Section 4.3, we define the hyper-parameter optimization problem for the task, and also provide the search space C PHM for the optimization.Finally, we present the numerical results in Section 4.4.

Prognostic Task in the PHM 2016 Data Challenge
A brief description of the task.The PHM 2016 data challenge task involves the investigation of a wafer Chemical-Mechanical Planarization (CMP) tool that removes material from the surface of the wafer through a polishing process.The goal of the task is to predict the average polishing removal rate, based on the sensor data collected during the polishing process.The challenge task (PHM, 2016) provides a collection of time-series-label pairs, where each pair records the sensor data recorded during a polish process.
Data description.The time-series-label pair for the ith polishing process is denoted as (X i , y i ).The time series data is expressed by the matrix X i ∈ R 26×Ti , where T i is the number of time steps, and there are 26 sensor reading at a time step. 2 Different processes could have different time lengths, i.e.T i could vary with different i.The 26 entries include sensor readings such as chamber pressure, usage measure of the dresser in the wafer CMP tool, etc.For more details, please consult (PHM, 2016).Finally, the label y i ∈ R ≥0 is the average polishing removal rate in the ith polishing process.
The collection of time-series-label pairs is organized in three datasets: the training dataset {(X tr i , y tr i )} N tr i=1 , the test dataset {(X te i , y te i )} N te i=1 , and the validation dataset {(X va i , y va i )} N va i=1 , which contain N tr = 1977, N te = 424, N va = 424 timeseries-label pairs respectively.Typically, we have T i roughly equal to 300 in each dataset, but the quantity T i could vary from pair to pair.The label y i in a time-series-label pair is typically lies in the range [40,200].
Objective.The objective of the PHM data challenge task is to construct a machine learning model M PHM , so that the mean 2 The time unit for a time step is not given in the task.squared error on the test dataset {(X te i , y te i )} N te i=1 is minimized.

Description of Our ML Model Construction Procedure A PHM
In this subsection, we define our construction A PHM of the desired machine learning model M PHM , and also explain the HP-config c PHM involved in the construction.By the definition of an ML model construction, we have Before describing A PHM we first provide a high level view on the HP-config c PHM .The HP-config c PHM consists of hyperparameters in pre-processing the time-series-label-pairs, as well as the hyper-parameters in tuning the random forest regression model3 used in the construction.In the Appendix, we replace the random forest regression in A PHM by a multilayer perceptron model.Now, the construction procedure A PHM is described in the pseudo-codes in Algorithm 3. The procedure A PHM mainly consists of two steps.First, it involves a pre-processing step called block decomposition, which transform each timeseries-label pair (X tr i , y tr i ) into a number of block-augmentedlabel pairs {( Xtr i,j , ỹtr i,j )} cnum-block j=1 . The block decomposition step serves to extract succinct feature vectors from the timeseries-label pairs.Second, it involves building a random forest regression model M RF , where this ML model inputs a block, and output a prediction on the augmented label.The procedure A PHM finally outputs M RF .
The pre-processing step BLOCKDEC is detailed in Algorithm 4. Essentially, BLOCKDEC extracts a number of blocks { Xtr i,j } cnum-block j=1 from the time series data X tr i , and attaches an augmented label ỹtr i,j to block Xtr i,j .In the algorithm, the notation X[:, a : b] denotes the sub-matrix of X formed by the columns a, a + 1, . . ., b of X. Essentially, the blocks { Xtr i,j } cnum-block j=1 are 26×c num-block sub-matrices, where each two consecutive blocks are t sep time steps apart.The quantity t sep is chosen such that the blocks are as spread out as possible.Each block serves as a "snapshot" on the time series of sensor data, and each augmented label serves as a proxy for the removal rate during the snapshot.We define the augmented label as ỹtr i,j = y tr i , that is, the removal rates across different blocks are constant.We believe that, with more information about the wafer CMP tool, we can enrich the augmented labels.and an HP-config c PHM , and returns a random forest regression model M RF , which can be used to predict the average removal rate by Algorithm 5.The out-of-sample validation error OOSV PHM (c PHM ), which is on the validation dataset {(X va i , y va i )} N va i=1 , is defined as follows.First, we compute the random forest regression model M RF according to equation (4).Then, for each time series data X va i , we compute the prediction ŷva i according to Algorithm 5. Finally, we have which is the out-of-sample mean squared error on the validation dataset.This serves to indicate the mean squared error on the test dataset, which is
where the function OOSV PHM is defined in equation ( 5).The HP-config search space C PHM is displayed in Table 1.The search space C PHM is the Cartesian product of the search ranges shown in Table 1.An HP-config can be retrieved by taking one element in the search range in each row.Note that the first two HPs are for the pre-processing procedure BLOCKDEC.The last four HPs are for the random forest regression model M RF .For more details about these HPs, please consult the documentation from Scikit-Learn (Pedregosa et al., 2011), a Python package for ML which is used for our experimentation.It is interesting to note that there are altogether many HP-configs in the search space C PHM .Such a large space precludes any possibility of a brute-force search for the HP-config that minimizes OOSV PHM .In the next Section, we solve the HP-config optimization problem (6) to near optimality by the Bayesian optimization algorithms SMAC and GP, as well as the Random Search Algorithm.For each of these algorithms, we perform 200 function evaluations.We Table 1.compare the performance of these algorithms with the default HP-config defined by a human agent.

Numerical Results
In this Section, we provide the experiment results on hyperparameter optimization on the PHM 2016 Data Challenge task.From the results, we see that Bayesian optimization algorithms are able to identify better HP-configs than the default HP-config set by a human agent, which can be found under the "Human" column in Table 2. Here, the default hyperparameters for pre-processing the data are chosen based on a few trial and errors.The default hyper-parameters for the random forest regression model follows (Pedregosa et al., 2011).
First, we consider Figure 1, which provides a macroscopic view on these algorithms by comparing their out-of-sample validation errors, and also compares these errors with the validation error under the default HP-config.For each of these algorithms, we plot the best-so-far out-of-sample validation errors, which are defined as follows.First, recall that the sequence of HP-configs experimented by an algorithm is denoted as c 1 , . . ., c T .Now, for each t ∈ {1, . . ., 200}, we first identify c * t ∈ {c 1 , c 2 , . . ., c t }, for which OOSV PHM (c * t ) = min s∈{1,...,t} OOSV PHM (c s ).Thus, OOSV PHM (c * t ) is the best out-of-sample validation error achieved by the HPconfigs in {c 1 , c 2 , . . ., c t }.Altogether, the sequence of best-so-far out-of-sample validation errors under an algorithm is OOSV PHM (c * 1 ), OOSV PHM (c * 2 ), . . ., OOSV PHM (c * 200 ).Clearly, for each algorithm, the sequence is a non-increasing sequence, and the corresponding plotted curve of the best-sofar out-of-sample errors in Figure 1 drops once the algorithm identifies a better HP-config than those experimented in previous iterations.
Figure 1 shows that algorithms GP, SMAC and Random Search are able to identify better HP-configs (in terms of outof-sample errors) as each of these algorithms experiment with more HP-configs.We see that all three algorithms identify better HP-configs than the default (in terms of out-of-sample errors) at the end of 200 function evaluations.Moreover, Bayesian algorithms GP and SMAC, which conduct experimentation on HP-config in a more principled manner than the Random Search Algorithm, are able to identify better HPconfig than the Random Search Algorithm.
To shed light on the process of hyper-parameter optimization, we then proceed to Figure 2.For each of the algorithms, both the sequence of best-so-far out-of-sample validation errors {OOSV PHM (c * t )} 200 t=1 and the sequence of out-of-sample validation errors {OOSV PHM (c t )} 200 t=1 are plotted in solid lines and dots respectively.
We observe that Algorithm GP is the most stable, in the sense that the sequence of out-of-sample validation errors (plotted in dots) is close to the sequence best-so-far out-of-sample validation errors (plotted in a line).Algorithm SMAC also manifests such a stability.Nevertheless, the Random Search algorithm is far less stable.The sequence of out-of-sample validation errors is very far away (from above) compared to the sequence best-so-far out-of-sample validation errors.This signifies that Random Search could converge to a competent HP-config slowly, different from Bayesian optimization algorithms.
Next, in Figure 3, we compare the out-of-sample errors on the validation set (computed by evaluating OOSV PHM ) with the errors on the testing set.For each of algorithms GP, SMAC, Random Search, we plot the sequence {OOSV PHM (c * t )} 200 t=1 for showing the validation errors and the sequence {Test PHM (c * t )} 200 t=1 for showing the testing errors.The plots in Figure 3 shows that the constructed ML models generalize well, in the sense that the trend of the testing errors follows the trend of the validation errors.
In Figure 4, we provide a comparison of testing errors between different HP-config optimization algorithms, and compare these errors with the baseline by the default HP-config.It is demonstrated that all 3 algorithms achieve performance superior to the human baseline, signifying the value in HPconfig optimization, which automates the process for finding a competent set of HP-config.Finally, Table 2 provides the competent HP-configs identified by GP, SMAC, Random Search, as well as the default HPconfig used by a human agent.The default HPs for the random forest regression model follow (Pedregosa et al., 2011).

CONCLUDING REMARKS
In conclusion, we have introduced hyper-parameter optimization and its applications to automatically find a good hyperparameter configuration in machine learning tasks for machine health prognostics.We evaluate hyper-parameter optimization algorithms on the PHM 2016 Data Challenge, which demonstrate promising results.A future direction is to seek a way to automate the process of feature selection and feature engineering by a similar idea.

APPENDIX 6. HYPER-PARAMETER OPTIMIZATION FOR THE PHM 2016 CHALLENGE TASK WITH MULTI-LAYER PERCEPTRONS
In this Section, we conduct hyper-parameter optimization by Algorithms GP, SMAC and Random Search in a similar way to Section 4. The only difference is that we use multi-layer perceptrons (MLPs) instead of random forest.Since the experimental set-up and the results are similar to those in Section 4, we only elaborate on the main difference, which is on the HP-config space.
Here, we consider a two-layer perceptron model.First, we provide the descriptions for hyper-parameters concerning the network architecture.The internal layers are layers 1, 2, while layer 3 is the output layer.For each i = 1, 2, we use Hidden i to denote the number of nodes in layer i, Activation i to denote the type of activation function used in the layer, and Dropout i to denote the dropout rate in the layer.Second, we provide the description for the hyper-parameters for training the MLP by the Backward Propagation (BP) algorithm.Batch size is the number of samples to feed into the BP algorithm every iteration.Learning rate, Decay and Momentum concern the rate at which the BP algorithm absorbs the information carried by each mini-batch.For more information about training neural networks, please consult (Chollet et al., 2015).The search space for the HP-config optimization task is provided in Table 3.Finally, the numerical results for hyper-parameter optimization for improving MLP models can be found in Figures 5,  6, 7, 8.In general, the discussions for Figures 5, 6, 7, 8 are similar to the discussions for Figures 1, 2, 3, 4. In addition, the HP-configs identified by algorithms GP, SMAC, Random Search as well as the default HP-config by a human agent are provided in Table 4.The defualt HP-config is identified through trial and error by a human agent.Interestingly, the dropout rates should always be kept at 1. {"selu", "relu", "tanh", "sigmoid"} Dropout 1 {1., 0.9, 0.8} Hidden 2 {16, 32, 64, 128, 256, 512} Activation 2 {"selu", "relu", "tanh", "sigmoid"} Dropout 2 {1., 0.9, 0.8} Activation 3 {"selu", "relu", "tanh", "sigmoid"} Batch size {4, 8, 16, 24, 32} Learning rate {0.05, 0.01, 0.008, 0.005, 0.001} Decay {0., 1e − 6, 1e − 5, 1e − 4} Momentum {0.9, 0.8, 0.7, 0.6, 0.} Table 4.

Figure 1 .
Figure1.OOSV PHM with the best HP-config identified so far.

Figure 4 .
Figure 4. Mean squared error and mean absolute error on the testing dataset.

Figure 5 .
Figure 5. Validation errors with the best HP-config so far.
Algorithm 3 Procedure A PHM for PHM 2016 Data Challenge 1: Input data: training data {(X tr i , y tr i )} N tr i=1 , where X tr i ∈ R 26×Ti , y tr i ∈ R. 2: Input HP-config: HP-config c = (c pre , c RF ).3: for i ∈ {1, . . ., N tr } do : Construct a random forest regression model M RF on D: M RF = A RF (D, c RF ) 78: Return the random forest regression model M RF .Finally, we note that the ML model M RF output by procedure A PHM only provide prediction for a block, but not for a timeseries sensor data X in general.Nevertheless, we can readily use BLOCKDEC in conjunction with M RF for predicting the removal rate for a time-series sensor data, as illustrated in Algorithm 5.Altogether, we have defined the construction procedure A PHM in Algorithm 3. It inputs the training dataset {(X tr i , y tr i )} N tr i=1 Table of HP-config search space C PHM for A PHM .

Table 2 .
Table of HP-configs c * T identified through automated HP-config tuning, and a comparison with a HP-config chosen by a human agent.

Table 3 .
Table of Hyper-parameter search space for predicting the average removal rate with MLP models.
Table of HP-configs identified through automated HP-config tuning, and a comparison with a HP-config chosen by a human agent.(sigm = sigmoid)