A Probabilistic Machine Learning Approach to Detect Industrial Plant Faults

Fault detection in industrial plants is a hot research area as more and more sensor data are being collected throughout the industrial process. Automatic data-driven approaches are widely needed and seen as a promising area of investment. This paper proposes an effective machine learning algorithm to predict industrial plant faults based on classification methods such as penalized logistic regression, random forest and gradient boosted tree. A fault's start time and end time are predicted sequentially in two steps by formulating the original prediction problems as classification problems. The algorithms described in this paper won first place in the Prognostics and Health Management Society 2015 Data Challenge.


Introduction
Fault detection in industrial plants is a hot research topic as more and more sensor data are being collected throughout the industrial process, and standard systems based on univariate statistical process control lack power in these more complex systems. Early detection of faults can help to avoid system shut-down and component failure or even catastrophes [Korbicz et al., 2012].
Many machine learning algorithms used in pattern classification are now being utilized in fault detection. Dimension reduction techniques, such as principal component analysis, partial least squares, and Fisher's discriminant analysis have been applied to detect faults in chemical processes [Chiang et al., 2000, Chiang et al., 2004, Yin et al., 2012. Support vector machine and artificial neural networks are also widely used methods for fault detection; they have been applied to gearbox failure detection [Samanta, 2004] and chemical process fault diagnosis [Wang and Yu, 2005]. K-Nearest Neighbor and fuzzy-logic are two other powerful methods that have been used to detect faults in semiconductor manufacturing processes [He and Wang, 2007] and mechanical systems [Korbicz et al., 2012]. Tree based algorithms such as random forest and gradient boosted tree are useful machine learning algorithms in situations where one expects nonlinear and interactive effects between covariates. They have been applied to fault detection in aircraft systems [Lee et al., 2014].
This year's Prognostics and Health Management (PHM) Society data challenge focused on plant fault detection. We try many of the above machine learning techniques and ultimately use a combination of several in our final detection strategy described herein. The rest of the paper is organized as follows. Section 2 discusses the data challenge problem. Section 3 introduces the relative methodologies and our algorithm. Finally, Section 4 concludes the paper and discusses future work.

Problem Statement
The objective of this year's challenge is to design an algorithm to predict plant faults. Correct prediction involves predicting the type of fault (one of five), as well as the start and end time of each fault, within one hour.
Three datasets are given, training, test, and validation; they contain information on 33, 15, and 15 plants, respectively. For each plant three files are provided: plant-#a.csv, plant-#b.csv, plant-#c.csv, where # is the plant id. File (a) contains time series readings of 4 sensors (S1-S4) and 4 reference signals (R1-R4) from each plant component. The number of components (N m) varies by plant; data on Sj and Rj for ith component are denoted mi Sj and mi Rj, respectively, where i ∈ {1, . . . , N m} and j ∈ {1, 2, 3, 4}. File (b) contains time series data for cumulative energy consumed (E1) and instantaneous power (E2) from a fixed number of zones within a given plant. Each plant zone covers one or more of the plant components and the number of zones (N n) varies by plant. The notation ni Ej is used to represent the reading of Ej for the ith zone. File (c) contains plant fault events, each characterized by a start time, an end time, and a failure type. Data are given on 6 different fault types (F 1 − F 6), but only faults 1-5 are to be predicted. The training dataset has complete fault event data, and is used to train the model. The test dataset has complete fault event data for the first half of the sample, but approximately 50% of the events in the second half of the data have been randomly removed. The boundary between the first and second half of the data is given, and referred to as the boundary time. Our goal is to predict the deleted fault events. The validation dataset is similar in structure to the test dataset.
Each team participating in the contest is permitted to submit their predictions of the missing faults in the test data (fault type, and start and end time) once each week to assess their prediction performance and use the score as feedback to improve their model. The final team rank is determined by the score of a submission of predictions based on the validation dataset. The data can be download from NASA Ames Prognostics Data Repository [Rosca et al., 2015].

Data Description and Preprocessing
We began our analysis by first studying the data to garner any information that would be useful in predicting the faults. Not only do the number of both zones and components vary by plant, but the proportion of each fault type (P F i) varies quite dramatically. To illustrate, Table 1 summarizes the data for the first five plants in both the training and test datasets.
Note that F 3 (fault type 3) never occurs in plants 2, 3, 5 and 42, and F 5 never occurs in plants 2, 3, 5, 41, 42 and 46. We also notice that S3, R1, R2, R3, R4 appear to be categorical variables, and S1, S2, S4 appear to be continuous variables. The number of unique levels of all categorical variables for the same sample plants are summarized in Table 2. Given the above differences across plants and variables, we built a separate model for each plant.   1  12  38  6  8  3  2  11  26  6  6  3  3  12  30  7  8  3  4  12  34  7  7  3  5  12  12  7  6  3   41  12  33  4  6  3  42  8  38  5  7  3  43  12  23  3  4  3  45  12  23  4  5  3  46  12  40  4  5  3 The sampling interval for the data provided was theoretically 15 minutes, however some logging delays resulted in irregular intervals. To preprocess the data, we rounded all timestamps to obtain regular 15-minute gaps, and then combined all three files. We define new variables T T F F k, k = 1, . . . , 6, to represent time to failure of fault type k. A negative value, −i, means the next fault is i intervals in the future (1 interval is 15 minutes), and a positive value, i, means the current fault started i intervals ago and has not yet ended. We define E3 as the first order difference of E1, i.e., E3(t) = E1(t) − E1(t − 1). E3 measures the energy consumed in the most recent 15 minutes, which similar as E 2 is a way to measure instantaneous power. We also define start F k, k = 1, . . . , 6, as a binary indicator of whether any type k fault starts within one hour of the corresponding timestamp, and define end F k, k = 1, . . . , 6, as the binary indicator of whether any type k fault ends within one hour of the corresponding timestamp. Occasionally observations of covariates on some timestamps are missing. Forward imputation was applied to all covariates to impute these missing values, except for T T F F k, start F k and end F k, which were imputed with values -999, 0 and 0, respectively. To illustrate these preprocessing steps, a small proportion of plant 1's data are shown in Table 3. The imputation simplifies the analysis, and from the authors' observation, it has little influence on the modeling results. There are segments of time where all covariates are missing and fault type 6 is happening.
We assumed the plant must be in some type of maintenance mode during these periods, and we excluded these observations in the following analysis.

Visualization
Visualization was key to our understanding of the data.
First, we observe that R2, R3 and R4 are highly positively correlated, and S2 and S4 are highly negatively correlated, across all components in all plants. To illustrate this finding, Figure 1 shows the correlation heatmap of plant 1 for the first two components, where each cell represents the Pearson correlation between two features. Pearson correlation is calculated Second, by observing the correlation heatmap of either mi R2, mi R3, or mi R4, across all components for a given plant, one can identify which components are in the same zone; components in the same zone are highly correlated. For example, Figure 2 shows the correlation heatmap of mi R4 across all 6 components in plant 1. Based on the heatmap, it seems components 1, 3, 5 of plant 1 belong to one zone, components 2, 4 belong to another zone, and component 6 itself belongs to the third zone. Although one can identify which components are zoned together, the groups of components could not always be linked to a specific zone, so this information was ultimately not utilized in our modeling approach.
We also find that month and hour are important categorical variables to predict the faults.
Count plots of F 2 by month and hour are shown in Figure 3 and 4 to illustrate this point.
F 2 starts most frequently between May and November and between 6 o'clock and 23 o'clock.
But its distribution varies across plants.
Lastly, we observe that, before a fault happens, sensor readings are often increasing or decreasing. These unique patterns can be utilized to predict the start time of the fault.
See Figure 5 for an example, where the mean value of m2 R2 and its corresponding 95% confidence bands are plotted against time to failure of F 1.

Methodology
In this section we introduce our approach and the related methodologies utilized for the PHM competition. The overall approach consists of two parts: preprocessing and modeling. Figure   6 provides an overview of the process implemented. Details of the data preprocessing have been discussed in Section 2.1. After preprocessing, we divide the training data into two parts: cross validation training data and cross validation test data. Mimicking the test dataset and the validation dataset, the cross validation training data has complete fault event data for the first half of the sample, and 50% randomly selected events in the second half. The cross validation test data contains the 50% deleted events in the second half. Our basic approach is to try various models using the cross validation training data and then evaluate their performance based on their ability to forecast faults in the cross validation test data. The winning model is then applied to the test data and the subsequent predictions submitted to PHM for assessment. Here we are not learning the exact model with cross validation training and test data, as we fit a plant specific model to each plant. However, we learn things such as which classifier to use, which threshold value to apply, etc. Please refer to Section 3.3-3.5 for more details.
There are two steps to the modeling process: predict fault start times and then, given these start times, predict fault end time. A detailed flowchart of the modeling process is shown in Figure 7. The modeling procedure outlined is implemented for each fault type, plant by plant. Given a fault type and plant, F 1 in plant 5 for example, we translate the prediction problem into a classification problem (start F 1=1 vs start F 1=0). From the classification model we estimate the probability that F 1 starts during each time interval. We derive the set of predicted fault start times, Ω F 1 , based on these estimated probabilities. For each start time in Ω F 1 , we then solve another classification problem (end F 1=1 vs end F 1=0) and estimate the probability the F 1 will end in the next 1 to t max time intervals, where t max is an estimated upper bound of fault F 1's duration. These estimated probabilities are used to find the fault end time.
Various machine learning algorithms were tried to solve these classification problems: Knearest neighbors (KNN), naive bayes, gradient boosting machine (GBM), random forest, penalized logisitic regression (with ℓ 2 penalty), etc. In the final algorithm, we use gradient boosting machine, random forest and penalized logisitic regression. All methods are implemented in SAS or Python Scikit-learn.
In Section 3.1, we give a quick review of the machine learning algorithms used. Section 3.2 discusses how we evaluate the effectiveness of these different approaches. Section 3.3 describes all the features we use in the model and finally, the specific algorithm details to predict a fault's start time and end time are given in Section 3.4 and 3.5, respectively.

Machine Learning Algorithms
Data-driven or statistical approaches based solely on historical data are seen as the most costeffective approach for fault detection in complex systems [Aldrich and Auret, 2013]. Machine learning is the key to any data-driven algorithm.
Machine learning algorithms can be categorized as either supervised or unsupervised. In supervised learning, the goal is to predict a response Y based on input features X. All methods in our algorithm belong to supervised learning.
K-nearest neighbor is an instance-based learning algorithm which has a very simple form but works extremely well on many problems. The algorithm is simple. For a test point x 0 , we first find k training points that are closest in distance to x 0 , and then classify using majority vote. K-nearest neighbor can learn very flexible decision boundaries. However, when dealing with high dimensional data, it is likely to suffer from over-fitting and perform poorly due to the curse of dimensionality.
Naive Bayes [Rish, 2001] is a classification technique based on applying Bayes' theorem.
It assumes conditional independence between features given a class Y = i. Given a class response Y and a p-dimensional feature x = (x 1 , . . . , x p ), we have Pr(x k |Y = i) based on Bayes' theorem and conditional independence assumption, and Naive Bayes classifies Despite its oversimplified and sometimes unrealistic conditional independence assumption, it often outperforms other more sophisticated algorithms. Naive Bayes is widely used in text mining and natural language processing.
Logistic regression [Hosmer Jr and Lemeshow, 2004] is widely used in classification problems. However, when the number of input features is large, it performs poorly due to overfitting. Penalized logisitic regression avoids the overfitting problems of logistic regression by imposing a penalty on large fluctuations in the estimated parameters. In this paper, we use a penalized logistic regression with ℓ 2 penalty. Besides avoiding overfitting and improving prediction accuracy, this ridge type penalty is also very computationally efficient.
Random forest [Breiman, 2001] is an ensemble learning method which averages over a large collection of de-correlated decision trees. Similar as bagging, random forest builds decision tress on bootstrapped samples. But unlike bagging method, when building the decision tree, each time random forest only use a portion of randomly selected features. Thus, it decorrelates the decision thees and makes their ensemble less variable. Random forest allows for interaction effects among features just like any tree based algorithm, but it corrects for the likely overfitting of decision trees. The performance of random forest is comparable to boosting, and they are easier to train and tune [Friedman et al., 2001].
Gradient boosting machine (GBM) [Friedman, 2001, Friedman, 2002 is an ensemble method which combines weak classifiers to form a strong classifier. We use decision tree as our weak classifier. Unlike random forest which fit a large number of decision trees in parallel, GBM works in a forward "stagewise" fashion. In each step, GBM firstly calculate the pseudoresiduals (negative first-order derivative of the loss function) at the current model, and then fit a decision tree to the pseudo-residuals. GBM then add the fitted decision tree to the previous model. There are many parameters that we can tune in GBM. For example, we can set the order of interaction we want to consider by specifying the depth of the decision tree; We can avoid overfitting by specifying a small learning rate in GBM. GBM is also very flexible as users can provide their own loss function. GBM has been implemented in many data mining competition winning strategies.

Evaluation
Evaluation is the key step to obtain feedback and find the approach that works well predicting faults with the data at hand. In this competition, each team was allowed to submit a set of predictions only once a week to score their model. However, this is not frequent enough given that there are a large number of possible models and tuning parameters to try. To remedy this problem and allow us to try many approaches, we built our own evaluation system, based on the idea of cross validation. Our cross validation system was basically designed to mimic the competition evaluation/scoring system.
To build our own scoring system, we randomly remove 50% of the faults in the second half of each training dataset. We build the model using the remaining fault data and attempt to predict the deleted events. We then compare the predicted faults E P with the deleted true events E T , and score our model. If a fault event in E T has been correctly predicted in E P (i.e., there exists an event in E P with start time and end time within one hour of actual start time and end time, and fault type also matches), it is a true positive and receives 10 point.
If a fault event in E P has correct start time and end time, and incorrect fault type, it is a misclassification and that prediction receives -0.01 point. If a fault event in E P has incorrect start time or end time, it is considered as a false positive and receives -0.1 point. If a fault event in E T has not been identified in E P , it is considered a false negative and receives -0.1 point.
We found that the above scoring system worked very well in the sense that the order of magnitude of improvement of one classification algorithm over another based on our scoring system was similar to the improvement seen on the leader board. In this way, we could use our scoring system and experiment with many different algorithms and tuning parameters.
The final model achieves a score of 79570 on the training data, and 21015 on the validation data. The average score per plant of the final model on the training data is 2411 with 90% confidence interval from 60 to 5006. The average score per plant on the validation data is 1401.

Feature engineering
We did feature engineering and added features like month, hour, weekday and time (the number of minutes since 00:00 of the first day of the corresponding year / (60*24)) in our classification models to predict faults' start and end time. A complete list of the features to predict faults' start and end time is given in Table 4. Here elapsed t represents elapsed time since the fault first occurred which is defined in Section 3.5.
We also added lagged covariates of all sensor readings (R1-R4, S1-S4, E2 and E3) to the model. Specifically, for any sensor reading, X(t), we included X(t − k), for all nonzero k, where k ≥ min lag and k ≤ max lag. To define these new variables we introduce the following notation: Lk mi Sj(t) = mi Sj(t − k), where k > 0 and thus, represents the lagged covariate of mi Sj. In contrast, Rk * mi Sj(t) = mi Sj(t − k), where k < 0 and k * = −k, and thus, represents the lead (future) covariate of mi Sj. The time interval is 15 minutes. Based on the description of the competition, we know a fault is independent of data outside a three hour window of time. So the smallest min lag and the largest max lag we considered are -12 and 12, respectively. All covariates were standardized to have mean 0 and variance 1 before feeding to the classifiers.

Predict Start Time
For every plant and fault type in the cross validation training, test or validation datasets, we built a separate classification model to predict the start time of deleted events. Start F k, the binary indicator of whether a type k fault starts within one hour, is the response variable Y . To train each model, we include all data from the first half of the sample where we know exactly when all faults do or do not occur. In addition, we also include data from the second half where start F k(t) = 1. That is, we only include the data for the faults that we know occur (i.e., the faults that have not been randomly deleted). We define X train and Y train to be the resulting covariate and response matrices used to train the model. We stack the data from the second half that are not used to train the model to define X test .
X test contains the data where the response start F k (Y test ) is unknown. We then estimatê p test (p test = Pr(Y test = 1)), and predict deleted fault start time based on the magnitude of p test . We specify our model as follows to gain the optimal performance. The optimal tuning parameter and thresholds are found by cross validation.
• We tried different min lag and max lag combinations, and the best one we found is min lag = −8 and max lag = 4.
• For k consecutive estimates of p test (i.e., the estimated probability a fault starts for k consecutive time intervals), we found the largest probability and compared it to a threshold p. If it exceeded p, the corresponding timestamp was saved as a predicted start time. We tested different combinations of values for k and p. The best performing combination was k = 6 and p = 0.75. See Figure 8 for an illustration. • We experimented with various different classifiers including KNN, Naive Bayes, GBM, random forest, and penalized logistic regression. We found that random forest and penalized logistic regression performed the best. Our final algorithm was an ensemble of these latter two models, where we kept all predicted start times from random forest, and then added all predicted start times that were found in penalized logistic regression but not found in random forest.

Predict End Time
To predict the end time of a plant fault, we built another classification model. As with the start time prediction problem, we estimate a separate model for each fault type and plant.
To explain our modeling approach, suppose we want to find the end time of a predicted type 1 fault (F 1). We first estimate an upper bound for the duration of a F 1 fault (t max ) based on all known F 1 events. We estimate t max as follows: t max = max(8, q 0.95 ), where q 0.95 is the 95% upper quantile of all historical F1 durations.
Intuitively, we could predict the fault end time by calculating the elapsed time since the The classification model is trained using data from the t max intervals following the onset of each known F 1 event. These data, stacked together, form the matrix of covariates for the training model (X train ). end F 1, the binary indicator of whether fault 1 ends within one hour of the corresponding timestamp, serves as the response variable, Y train . Once the classification model is trained, it is used to estimate the probability that each of our predicted events will end in any one of the t max time periods following the predicted event start time.
These predictions are based on X test , formed by stacking the t max intervals following the onset of each predicted F 1 event.
Given the small penalty for false negative predictions relative to the reward for a true positive prediction, we allow our system to predict as many as two end times for each predicted event. The first estimated end time is made by finding the time period within the t max periods following our predicted event with the largest estimated probability that end F 1=1. This is our first end time prediction. To look for a possible second prediction, we delete all observations with timestamps within one hour of our first estimated end time. We then find the (remaining) time period with the largest probability. If the estimated probability in this period is larger than our threshold p2, this is a second end time prediction. If the probability is less than p2, only one end time is predicted. See Figure 9 for an illustration, where the second end time prediction is elapsed t = 7 and is kept asp > p2(= 0.2).
The following list details the specifics of our final algorithm for the end time classification problem. The optimal tuning parameter and thresholds are found by cross validation.
• The optimal threshold probability for deciding on a second end time is p2=0.2.
• The optimal lag choice is min lag = −8 and max lag = 8.
• We model all covariates as continuous variables.
• We have compared the performance of various different classifier methodologies including GBM, random forest, and penalized logistic regression. We find that GBM has the best performance. We choose tree number=200 and tree depth=5 for the GBM.
As with the start times, we find the most important covariates in predicting fault end time by calculating a score based on mean decrease impurity in GBM. In Table 7
Another untried approach is deep learning neural networks. Convolutional neural networks or recurrent neural networks, which have been shown to be powerful tools when modeling with large and complex datasets, may yield good results. Convolutional neural network can automatically consider lagged observations by modeling temporal contiguous observations jointly together. Recurrent neural network can create an internal state of the network which allows it to exhibit dynamic temporal behavior. These facts make deep learning neural networks potentially very useful in fault detection for the PHM data.
Lastly, in our approach, lagged covariates are added to the model which creates high dimensional features. Curse of dimensionality may damnify the classifiers' performances.
Techniques such as principal component analysis and functional data analysis can be applied to extract key features from time series covariates and reduce the feature dimension.