Multivariate Bernoulli Logit-Normal Model for Failure Prediction


 
 
The failures among connected devices that are geographically close may have correlations and even propagate from one to another. However, there is little research to model this prob- lem due to the lacking of insights of the correlations in such devices. Most existing methods build one model for one de- vice independently so that they are not capable of captur- ing the underlying correlations, which can be important in- formation to leverage for failure prediction. To address this problem, we propose a multivariate Bernoulli Logit-Normal model (MBLN) to explicitly model the correlations of devices and predict failure probabilities of multiple devices simulta- neously. The proposed method is applied to a water tank data set where tanks are connected in a local area. The results indicate that our proposed method outperforms baseline ap- proaches in terms of the prediction performance such as ROC. 
 
 



INTRODUCTION
Failure prediction is an important problem in industry and has been studied over decades in various areas.Generally, majority of equipments, and industrial components deteriorate after running for a period of time.The failures among multiple devices, which are physically connected to each other, may propagate.When a component of a system fails, other relevant components may break down too.For example in a mill plant, when a motor fails, the bearings, which are physically connected to it, may fail as well.Although the properties of these multiple devices are different, they may have correlations to each other.However, there is little research to provide approaches to solve this critical issue.In this work, we intend to study this problem by modeling the relationship between devices or components.
Huijuan Shao et al.This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 United States License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
In practice, multiple devices installed in the same location may have failures at the same time.In many cases, the relationship between the devices' measurements may have big impact in predicting their failures.To capture these relationships, model-based techniques use system equations to extract analytical redundancies between devices' measurements (Ragot & Maquin, 2006), (Ferrari, Parisini, & Polycarpou, 2011).Unfortunately, for complex systems, the system models are not easy to develop.Moreover, the environment keeps on changing during the system life-cycle.Therefore, reliable models are not always available (Khorasgani, 2017).An alternative approach is to apply a data-driven solution that uses information from similar devices for failure prediction.Datadriven methods use system's measurements as the features for failure prediction (Khorasgani, Farahat, Ristovski, Gupta, & Biswas, 2018), (Zheng, Ristovski, Farahat, & Gupta, 2017), (Q.Wang, Zheng, Farahat, Serita, & Gupta, 2019).In order to predict multivariate responses from multiple devices simultaneously with higher accuracy, we use the correlated features from these devices.The common features are either explicitly or implicitly hidden in the data, which is captured by sensors or monitored events.Compared to only utilizing the data of each device or component, the combined information from multiple devices or components can borrow strength from each other.
How to use such common features is a challenge.To the best of our knowledge, there is a lack of adequate methods to leverage correlation for model building.The majority of research builds an individual model for each type of devices and predicts the failure for each device separately.A common research direction is multi-task learning (Gong, Ye, & Zhang, 2012a), (Gong, Ye, & Zhang, 2012b), (Gong, Zhou, Fan, & Ye, 2014), (Jalali, Sanghavi, Ruan, & Ravikumar, 2010).These methods build a model for each type of devices or components then aggregate these models and extract similarity among models.Another research direction is to utilize regularization to estimate the p×q regression coefficients and obtain the correlation among predictors (Liu, Wang, & Zhao, 2014) , (Lozano, Jiang, & Deng, 2013), (Rothman, Levina, & Zhu, 2010), (W.Wang, Liang, & Xing, 2013).But the responses in all these papers are continuous multivariate variables rather than multivariate binary variables, which are not applicable to failure prediction.Other researchers employ existing network structure of physical models (Khorasgani, Farahat, Hasanzade, & Gupta, 2019) as one of the inputs for failure prediction.However, in reality we may not be able to obtain the graph structure inside a system or among devices.
In this paper, we introduce multi-Bernoulli distribution with logit transformation to learn the correlation between the predictors and multivariate responses.We handle the dependency of multivariate response through the analysis between one device and another device.Therefore, we build a single model for multiple devices at once rather than creating multiple models.The advantages of our model over other models are listed as below.First, it can capture the correlation among different physical devices by computing a coefficient matrix and an inverse covariance matrix.The learned correlation can help to interpret the underlying correlation among multiple devices, even with significantly different physical properties.Thus, it may help us have more insight into the domain knowledge.Second, our method generates a unified model rather than a combination of multiple models.This makes model development process much simpler.Having a single model for several devices, can simplify model management task and reduce cloud computing costs during the application.Moreover, our unified model can be applied to multiple types of devices.This paper is organized as follows: Section 2 briefly describes the multivariate Poisson log-normal model (MVPLN), on which this model is proposed.In Section 3, we formulate the concurrent multiple devices failure prediction problem as a multivariate Bernoulli logit-normal model.Section 4 focuses on estimating the parameters of this model using the Monte Carlo expectation maximization algorithm.In Section 5, we apply this MBLN model to a water tank dataset generated by a simulator.Section 6 discusses the pros and cons of this approach, and proposes future work.

BACKGROUND
There are a few approaches which can capture the correlation of devices with different features.A comparable approach is MVPLN (Wu, Deng, & Ramakrishnan, 2018).MVPLN model was proposed to solve the prediction problem where both predictors and responses are count data.In order to link the input variables {x 1 , . . ., x p } to the output discrete variables {y 1 , . . ., y q }, it builds a linear regression model and introduces a latent variable to reflect this relationship.Since the responses are count data, the MVPLN model assumes that the responses conform to multivariate Poisson distribution.
The objective function of MVPLN is a sum of expected joint likelihood function and l 1 penalties on two model parameters.To reach the optimal value of the objective function, it utilizes a Monte Carlo expectation maximization (EM) algorithm (Moon, 1996) to estimate the model parameters.In the E-step, an expected log-likelihood function is formulated by Monte Carlo techniques.In the M-step, the optimization problem is not convex.Thus, it uses an iterative and alternative approach by fixing a model parameter and solving another one.In each iteration, a model parameter is solved by Lasso (Tibshirani, 1996) and another one is tackled by Graph Lasso (Friedman, Hastie, & Tibshirani, 2008).This research is motivated by multiple devices' failure prediction, where the responses are multivariate binary variables.We assume the responses conform to certain distribution multivariate Bernoulli distribution as multivariate Poisson distribution in MVPLN model.Different from the log link function in MVPLN, we use the logit function as link function instead in order to predict binary variables.
This MBLN model overcomes two drawbacks: overdispersion and zero-inflation.For over-dispersion, the responses in that paper spread over the whole integer space.With Poisson distribution to simulate the response, the variance would be very large.However, Poisson distribution only has one free parameter.Researchers cannot adjust the variance independent of the mean, i.e.Poisson distribution overdispersion problem.In the MVPLN model, zero-inflation appears in the sampling step of the Metropolis-hasting algorithm, during which a lot of negative values are sampled but discarded.MBLN avoids the zero-inflation problem because of two reasons.(1) The response of this model falls between 0 and 1.The variance of the response is small.(2) This model samples from the multivariate-normal distribution directly rather than using the metropolis-hasting algorithm, which eliminates the need to discard useless samples.

PROBLEM DEFINITION AND FORMULATION
The follow notations are used in this paper.Lower case letters, such as x and y denote scalers, whereas bold lower case letters such as x and y represent vectors.The j th component of the vector x is shown as a lower case letter with subscript x j .Bold calligraphic upper case letters X and Y denote random column vectors.Bold norm upper case letters X and Y stand for the matrices.The (j, k) entry of matrix Y is expressed as y j,k .

Multivariate Bernoulli Logit-Normal Model
The input are features from multiple devices.These features can be discrete or continuous variables.The output are multivariate binary response for different devices.This output can be represented as a multivariate random variable Y = [Y 1 , Y 2 , ..., Y q ] T ∈ Z +q , where the superscript T denotes the transpose, and Z + denotes the set of binary integer variables.We assume that the binary response Y follows the multivariate Bernoulli distribution.Each dimension of Y, i.e.Y k , follows the univariate Bernoulli distribution with parameter θ k .Thus, any dimension Y k is conditionally independent of other dimensions given θ k .
Given the predictor vector x = {x 1 , x 2 , ..., x p } T ∈ R p , we use a regression model to connect the relationship between Y and x as Equation 2.
where B is a p×q coefficient matrix, and Σ denotes a q×q covariance matrix which represents the covariance structure of variable θ.With the conditionally independence assumption, the probability mass function for the multivariate Bernoulli random variable y is (2) Here θ = [θ 1 , θ 2 , ..., θ q ] T given x.γ = log θ 1−θ follows the multivariate Gaussian distribution N (B T x, Σ) with density function (3) Therefore, we derive the density function of θ|x as Equation 4.
With n number of the predictors X = [x 1 , x 2 , ..., x n ] T and responses Y = [y 1 , y 2 , ..., y n ] T , the log-likelihood of the multivariate Bernoulli logit-normal model is computed as Equation 5. where, Here, p(Y = y|θ) and p(θ|x) follow the multivariate Bernoulli distribution and multivariate logit-normal distribution.In order to derive the coefficient matrix B and the inverse covariance matrix Σ −1 , we introduce l 1 to penalize these two model parameters.Therefore the loss function becomes as Equation 7.

MONTE CARLO EM ALGORITHM FOR PA-RAMETER ESTIMATION
The paper (Wu et al., 2018) uses a Monte Carlo expectation maximization (MCEM) algorithm to estimate the model parameters B and Σ.This MBLN model utilizes a similar MCEM algorithm to approximate a numerical solution for the same two model parameters.Also, we adopt the same criteria EBIC to select the turning parameters λ 1 , λ 2 .In the E-step, MBLN uses the logit function as link function rather than log function in that paper.Therefore, the formulation and derivation are different when deriving the log-likelihood function.

Monte Carlo (MC) E-step
In the iteration t + 1 of the MC E-step, in order to obtain the conditional probability distribution of θ j = [θ j1 , ..., θ jk , ..., θ jq ] T , we use a m × q matrix Θ j = [θ ] T from p(θ j |Y = y j , x j ; B (t) , Σ (t) ) to estimate the expected log-likelihood function.
where, m is the maximal sampling size of θ j .Here in this model, we sample γ j from multivariate-normal distribution as Equation 2 rather than sampling θ j from p(θ j |Y = y j , x j ; B (t) , Σ (t) ).Then we use θ j = 1 1+e −γ j to compute θ j .By combining Equation 2 and 4, we obtain the joint distribution of (Y = y j , θ (τ ) j ) given x j , B (t) , Σ (t) as the fol-lowing equation.

M-step: Maximize Approximate Penalized Expected Log-likelihood
The M-step in this MBLN model is different from the paper (Wu et al., 2018) in three aspects.First, the derivation is different because we use logit as link function.Second, this model uses a simpler technique for implementation.We sample γ from a multivariate Gaussian distribution instead of employing Metropolis-hasting algorithm.This reduces the computational cost.Third, the input to estimate B and Σ is different.It is caused by the derivation of distinct link function.Next, we will explain these differences.
In the iteration t + 1 of the MC M-step, we aim to maximize the joint probability of Equation 9.It is equivalent to minimize the average negative log-likelihood Q in Equation 10.
By adding l 1 penalties into the two model parameters, the overall objective function in M-step becomes as Equation 11.
The two model parameters in Equation 11 can be solved by searching a minimal objective value in Equation 13.
This optimization problem in Equation 13is not a convex problem but has been solved by an iterative algorithm in the paper (Wu et al., 2018).It fixes either B (t) or Σ −1 in each iteration, then solves another parameter alternatively.We adopt a similar algorithm by supplying different input.
With B fixed at B (t) , the optimization problem in Equation 13 turns into a convex optimization problem as shown in Equation 14.We can solve Σ (t+1) −1 by Graphical Lasso (Friedman et al., 2008).
The input of Graphical Lasso is an empirical covariance matrix D.
When Σ −1 is fixed at Σ (t) −1 , we can estimate B (t+1) from the convex optimization problem presented in Equation 15 by Lasso.
If we write A into the following block matrix. .
where X j is a m × p matrix with each row being x j for all j = 1, 2, ..., n.After a series of transformations as shown in Appendix, the objective function in Equation 15becomes η(B) in Equation 16.
where B is an estimated coefficient matrix in iteration t, , and Σ (t+1) −1 is the latest computed value in the t + 1 iteration.Taking the first order derivative of η(B) w.r.t.B and setting it to zero, and let (17) All the detailed derivations are described in the Appendix.Algorithm 1 summarizes this MCEM algorithm.In the E- step, it computes the sum of the log likelihood of the joint probability and penalty on two parameters.In the M-step, It alternatively solves B and Σ −1 with the other fixed at the value of the latest iteration.When the value in the objective function converges, we get the coefficient matrix B and inverse covariance matrix Σ −1 in the last iteration.

WATER TANK DATA STUDY
We use the simulated water tank system dataset (Khorasgani et al., 2019) to demonstrate and validate the performance of our method.This dataset includes a network of water tanks.Each tank can be connected to several tanks in the system.The measurements for each tank includes 1) tank's pressure, which represents the level of water in the tank, 2) tank refill mode, which is equal to 1 when an outside source is filling up the tank, and is 0 otherwise.A tank may start to leak at any point.When the operators fix a leakage, the tank returns to normal operation.The goal is to detect tank's leakages.A leakage in any tank can affect pressure in the tank.However, the leakage is not the only parameter that affects the tank's pressure.The flow-rate between the connected tanks, and the refill flow from an outside source can also affect the pressure.This makes the leak detection problem very challenging.There are totally 100,000 consecutive data points.Each tank has two distinct features Tank Pressure and Tank Refill.
We run this MBLN model on a subset of five connected tanks as shown in Figure 1.The physical structure of these five tanks is an indirected graph.T22 connects to T30 within a distance of 7.10 and T30 bridges with T36, T66, T86 with different distances of 5.25, 10.77 and 5.23.The smaller this distance is, the higher influence of this tank in its neighbor.We aim to predict whether there are leaks for two tanks T20 and T30. Figure 2 describes the data organization for this MBLN model on this water tank data.Instead of only considering the features of itself, MBLN model incorporates the features from each tank's one-hop neighbor tanks.For instance regarding tank T30, we add the features of T22, T36, T66, and T86 besides the features of itself T30.To predict the leak status of two tanks, this model uses all the features from 5 tanks, i.e. 10 features in total.The responses are binary variables for T22 and T30 in parallel.We filter each data point by time.There are four combinations of leak and nonleak for T22 and T30.For example, the first data point has 10 features from T22, T30, T36, T66 and T66.The responses are leaking for both T22 and T30.We split the dataset into two categories.The first 90% dataset is for training and the left 10% for test.When either T22 or T30 leaks at the same time, this data point is set as leak data.
All leak data points from both T22 and T30 are selected in both training and test dataset.In order to balance data for models, we downsample the non-leak data as the same size of leak data.Therefore, there are 37, 499 leak data and equal amount of non-leak data in the training dataset, and 3,830 leak and non-leak data in the test dataset.MBLN model is then applied to predict the leak information of T22 and T30 simultaneously.Other models, such as gradient boost descent, random forest, logistic regression, glmnet and kNN, are used to predict the leak information of T22 and T30 as baselines.
We compare the receiver operating characteristic (ROC) curve of these six models and illustrate them in the Figure 3.It shows that the area under the ROC curve of MBLN model is the largest 0.82.Other approaches, k-nearest neighbor, logistic regression, glmnet, random forest, gradient boost descent have an area of 0.64, 0.70, 0.70, 0.72, 0.79.MBLN performs the best from the view of ROC curve because its estimated parameters B and Σ reflect the correlation of 5 tanks and contribute to the tank leak prediction of two tanks T22 and T30.

CONCLUSION
This paper proposes a multivariate Bernoulli logit-normal model for failure prediction for multiple devices.The insight is that, for devices that are connected and geographically close, there are correlations in the monitoring data collected from these devices.And these correlations can be used to predict failures.We conducted an experiment on a water tank dataset.The prediction results show that this MBLN model is superior than traditional approaches that model each device independently.This MBLN approach for failure prediction has several advantages.First, it can model and predict the failures of multiple devices in a single model so that it predicts failure probabilities for all devices simultaneously.Third, it can deal with both count data and sensor data, or mixed data, as the input of MBLN is normalized before building the model.Last, it can learn the correlation of features from different devices, which can be used by domain experts to learn insights and better understand the behavior of devices.
The scope of this work consists of two main points.One is that MBLN is more effective in handling at least two devices.
If only predicting with one device, MBLN becomes similar to glmnet.The other is that the input data from multiple devices should have some correlation.If there's little correlation of input data, the advantages of parameters B and Σ can not be embodied.
In future work, we will explore and extend this work to failure predictions for multiple types of devices (i.e., devices with significantly different physical model).Additionally, the proposed MBLN model assumes that the input data are linearly correlated to each other.In the future work, non-linear correlation will be studied.

Figure 3 .
Figure 3. ROC Curve Comparison of Six Models.