Collaborative Prognostics for Machine Fleets Using a Novel Federated Baseline Learner

Difficulty in obtaining enough run-to-fail datasets is a major barrier that impedes the widespread acceptance of Prognostic and Health Management (PHM) technology in many applications. Recent progress in federated learning demonstrates great potential to overcome such difficulty because it allows one to train PHM models based on distributed databases without direct data sharing. Therefore, this technology can overcome local data scarcity challenges by training the PHM model based on multi-party databases. To demonstrate the ability of federated learning to enhance the robustness and reliability of PHM models, this paper proposes a novel federated Gaussian Mixture Model (GMM) algorithm to build universal baseline models based on distributed databases. A systematic methodology to perform collaborative prognostics is further presented using the proposed federated GMM algorithm. The usefulness and performance are validated through a simulated dataset and the NASA Turbofan Engine Dataset. The proposed federated approach with parameter sharing is shown to perform at par with the traditional approach with data sharing. The proposed model further demonstrates improved robustness of predictions made collaboratively keeping the data private compared to local predictions. Federated collaborative learning can serve as a catalyst for the adaptation of business models based on the servitization of assets in the era of Industry 4.0. The methodology facilitates effective learning of asset health conditions for data-scarce organizations by collaborating with other organizations preserving data privacy. This is most suitable for a servitization model for Overall Equipment Manufacturers who sell to multiple organizations.


INTRODUCTION
The advent of technology paradigms such as Artificial Intelligence (AI), Internet of Things (IoT), Cyber-physical Systems (CPS), and initiatives like Industry 4.0 direct to a future where intelligence is endowed to every value-adding entity across enterprises. Pervasive sensor networks with cheaper and convenient computational capabilities (Lee, Davari, Singh & Pandhare, 2018) form the foundational infrastructure for these paradigms. Building on this foundation, the real fruits of sustainable and synergistic profit can only be reaped through collaboration between these entities. One of the most significant outcomes of collaborative operation is servitization (Palau, Dhada, Bakliwal & Parlikad, 2019;Baines, Lightfoot, Evans, Neely, Greenough, Peppard, Roy, Shehab, Braganza, Tiwari & Alcock, 2007): selling of services instead of selling the assets that provide the service. In the servitization model, the client does not pay for owning the asset but pays for their right to use the asset. In such a scenario, upkeep of the assets is even more important as downtime directly transfers to the revenue lost. But developing a prognostics model can be challenging in practice for multiple reasons. Despite the availability of sensors, the frequency of "event" data per asset that represents an equipment failure can be low due to the high reliability of machines. Additionally, due to operational variation, one machine cannot present all degradation patterns in a lifetime. Preparing comprehensive datasets within a single enterprise is highly impractical given the range of assets and operational conditions per enterprise. Attempting to develop models from experimentation is not just costly and time-consuming but can also lack consistency with field behavior. Thus, collaboration among industrial entities for developing reliable and robust prognosis models can unlock huge potential for improved maintenance decisions.
are two major challenges associated with this. Primarily, with the volume, velocity, and variety of the data generation, transferring of data is not efficient. Advancements in cloudbased solutions only address the issue of centralized storage through developing architectures such as edge computing and fog computing (Yi, Li, & Li, 2015). Secondly, the transfer of data can violate data-ownership laws as well as raise privacy concerns for collaborating entities. This is especially true for Overall Equipment Manufacturers (OEM) who sell to multiple clients. OEMs may not be able to harvest the data while the equipment is in use at the client site, as the data is considered proprietary. McKinsey & Company reports data ownership is a major barrier for manufacturing companies toward collaborating, even with third-party providers such as scientists, etc. (McKinsey, 2016). Beyond OEMs, governments across the world are strengthening laws to protect data privacy (Regulation (EU), 2016). This issue of data privacy becomes an irony as it blocks the very foundational path of collaboration to Industry 4.0's success. Thus, new learning schemes need to be developed that can unlock the potential of distributed private data for collaborative prognostics.
Multiple methods have come up in recent times to address specific problems that come along with the increasing ubiquity of data and learning from it. Batch learning is developed to solve the problem of inefficient model training due to computation memory exhaustion while training on large data sets (Bisong, 2019). Transfer Learning is developed to avoid needing large datasets and retraining models from scratch to learn similar and related patterns (Pan & Yang, 2009). More recently, Federated learning, proposed by Google, is the radical solution to overcome the bottleneck of learning from datasets that are distributed over multiple systems keeping the local datasets private (Yang, Liu, Chen & Tong, 2019). The novelty lies in training a global machine learning model only via communication of parameters and parameter updates between each system. While most of the research in federated learning has been focused on applications about users (Yang, Liu, Chen & Tong, 2019), such as text prediction using Natural Language Processing, making recommendations for retail by estimating user profiles, etc., industrial applications remain largely untapped. As Industrial Artificial Intelligence (AI) possesses its own set of challenges compared to general-purpose AI, similarly Federated learning also brings its challenges for adaptation to industrial problems. Thus, a novel methodology for performing collaborative prognostics is proposed, which includes a new algorithm called Federated Expectation Maximization for privacy-preserving model aggregation.
The remainder of the article is organized as follows. Section 2 presents a literature review. Section 3 introduces the proposed federated Expectation Maximization algorithm for training multivariate Gaussian Mixture Model (GMM). Section 4 illustrates the application of the proposed algorithm using two case studies for parameter estimation and collaborative fault prognosis. Section 5 concludes the article.

LITERATURE REVIEW
Collaborative learning has recently gained significant attention, although being considered for a long time under various self-organizational architectures such as Multi-agent Systems and Holonic Systems. In the area of Prognostics and Health Management (PHM), Palau, et.al. (2019) propose a multi-agent system for real-time distributed collaborative prognostics. They show that distributed collaborative prognostics are advantageous in scenarios with significant fleet heterogeneity, limited computing capability, and faulty measurements. Information is shared between assets assigned to groups formed using a Friendship Matrix, and predictions are made using Weibull-Time-To-Event Recurrent Neural Networks (WTTE-RNN) trained on run-to-failure trajectories. Additionally, Lin, Liu, Byon, Qian, Liu & Huang (2017) propose a collaborative learning framework for estimating many individualized regression models in a heterogeneous population of run-to-failure trajectories. However, depending on the degradation trajectory of a system it may not always be optimal to capture the behavior in a single failure model (Lei, Li, Guo, Li, Yan & Lin, 2018). Thus, the prediction is often preceded by fault detection or health-stage separation. Fault detection involves assessing the health condition of equipment to determine whether an event can be considered as an occurrence of fault (Lapira, 2012). It is difficult and unnecessary to predict the remaining useful life in the healthy stage as it contains no information about the degradation trend. Thus, fault detection becomes a critical primary step to perform effective prognostics. Most works on collaborative fault detection approach the problem from a similarity-based clustering or peer-to-peer comparison methodology for a fleet of assets. Zhao, Li, Lu, Lv, Gu & Shang (2020) implemented a fault detection model using collaborative filtering techniques for detecting an incipient fault in large-scale solar farms by sharing current data among photovoltaic systems. Maroli, Özgüner & Redmill (2019) propose a collaborative fault detection framework for largescale vehicle networks using an echo state network. Ng & Srinivasan (2010) developed a multi-agent-based collaborative fault detection and identification system for application in chemical process plants and combines various heterogeneous methods to maximize performance using information fusion. These approaches exploit data sharing facilities to derive inferences based on the assumption that the fleet belongs within a single plant or an organization with data-privacy not being a concern. For collaborative learning across organizations where data sharing is not possible, privacy-preserving learning frameworks such as federated learning can be implemented.
The development of approaches for federated learning has been largely neural network-based. With the availability of data significantly increasing in recent times, deep learning has been used extensively for wide-ranging applications including for PHM (Rosero, Silva & Ribeiro, 2020). Neural Networks provide a convenient option for training over a federated setting. Weights and biases of the global network can be initiated in the parameter server which can be then communicated with the clients. At each client node, the gradient can be evaluated from the local dataset based on the present parameters of the network and the gradient can be shared back to the server where it can be aggregated using algorithms like federated averaging (Konečný, McMahan, Yu, Richtárik, Suresh & Bacon, 2016;Dhada, Parlikad & Salvador, 2020). Over multiple rounds of communication, the global model can be trained to keep the data private. Numerous upgradations have been performed in this architecture to address issues such as gradient quantization (Jin, Huang, He, Dai & Wu, 2020), reducing expensive communications (Li, Sahu, Talwalkar, & Smith, 2020), addressing system heterogeneity, efficient gradient aggregation, and especially security concerns through homomorphic encryption, differential privacy, etc. However, almost all of the methods discussed in the federated learning literature are developed for the task of supervised learning in data abundant scenarios, i.e. they assume the availability of large datasets with corresponding labels, which is not always possible. While statistical machine learning methods provide the high capability to capture the knowledge from limited data using an unsupervised approach, they remain largely unexplored for federated implementations. Thus, there is a lack of systematic analytical methodology to build prognostic models based on decentralized databases addressing limitations of data-privacy as well as local label unavailability. A systematic federated approach is proposed to facilitate collaborative prognostics.

PROPOSED METHOD
Mixture models, especially Gaussian Mixture Model (GMM), are a widely used statistical method as an effective universal approximator. Justifiably, it finds use in several applications (Pimentel, Clifton, Clifton & Tarassenko, 2014; Diaz-Rozo, Bielza & Larrañaga, 2020) such as density estimation, clustering, association rules, outlier detection, latent factors, ranking, and even data visualization. Given its wide use, effective training of GMM is a continuously evolving area (Jin, Zhang, Balakrishnan, Wainwright & Jordan, 2016;Kurban, Jenne, & Dalkilic, 2017) with Expectation Maximization (EM) being one of the popular methods (Ververidis & Kotropoulos, 2008;Balakrishnan, Wainwright & Yu, 2017;Zhao, Li & Sun, 2020). Thus, federated expectation maximization for the training of GMM is proposed to build a universal baseline model in the problem of collaborative prognostics. This forms the first novelty of the work. Additionally, a systematic methodology to perform collaborative prognostics based on the proposed algorithm is also presented. This forms the second novelty of the work. The improvement in performance and robustness using the proposed approach is validated through the NASA Turbofan Engine Dataset (Saxena & Goebel, 2008).

Gaussian Mixture Model and EM Algorithm
Given data D={ } =1,…, , the GMM has the form where denotes the number of components or clusters. = { , , Σ } =1,…, are the unknown parameters to be estimated by Expectation Minimization (EM) algorithms, which includes the mixing weights ∈ [0,1] and ∑ =1 = 1 , the mean and the covariance Σ for the -th component. Membership of point to cluster are described as the posterior probability where ∈ {1, … , } is the latent state and = { } =1,…, forms the latent vector.
Parameters and in GMM is estimated by Expectation Maximization (EM) algorithm, which maximizes the lower bound of Component Data Log Likelihood (CDLL) through the alternation between the E-step and M-step. CDLL has the form (3) is difficult to compute since is unknown. Therefore, EM optimizes the Expected Complete data Log Likelihood (ECLL) instead, which describes the lower bound of CDLL. The E-step and M-Step in EM algorithm can be written as: is the estimated from the previous step. Based on the discussions, the algorithm for the parameter estimation of GMM can be detailed as given in Fig. 1.

Federated GMM for universal baseline modeling
The proposed method builds the GMM model from decentralized databases as shown in Fig. 2. In the setting of machine PHM, the local client servers in Fig. 2 represent the database at different organizations, manufacturing plants, corporations, etc. Normally, cross-organization data-sharing is not favored due to security concerns, administrative reasons, and many others. To overcome the barrier of datasharing, this study proposes a novel algorithm called Federated Expectation Maximization to develop a Gaussian Mixture Model without sharing data across organizations. The execution steps of the proposed algorithm are outlined in Fig 3. In a broader impact, the proposed modifications to the GMM can be easily extended to any other EM-based algorithms and most MLE-based algorithms for parameter estimation. Based on the information flow outlined in Algorithm 2, secure data exchange between the global server and local clients can be promoted by homomorphic encryption, which will allow computation on the encrypted text directly at the global server. Adding homomorphic encryption to the proposed algorithm will only affect the computation efficiency. The discussions about homomorphic encryption in left for future work. The goal of this study is to demonstrate and validate the performance of the proposed algorithm only.

Proposed methodology for failure prognosis
The proposed methodology is presented in Fig. 4. The training set consists of clients (or units) with run-to-failure data, and the testing set can be units with any amount of data available. The data for each unit can be distributed across organizations, plants, etc. in such a way that data-sharing is not allowed. Federated training of the global baseline GMM model in the global server is performed by the exchange of model parameters with each client using the local baseline data as described in the proposed algorithm in Fig. 3. The deviation metric representing the health of the unit using the trained gaussian mixture model is given by weighted squared Mahalanobis Distance (MD). The deviation metric is evaluated for each client locally. MD is defined by (6), where denote the sensor measurements for a sample, and = { , , Σ } =1,…, denote the parameters of the trained global baseline model.
The Mahalanobis distance, obtained from one gaussian component, follows a distribution (Ververidis & Kotropoulos, 2008), as shown in (7), where is the random variable representing the Mahalanobis distance, is the number of samples used for training, is the dimensionality of the data, and < . It is important to note that, in this case as information is used from all clients, = Algorithm 1 Initialize as 0 and assign ( − ) = 0 1) Perform E-step and compute the membership Perform M-step and update the ( −1) to ( ) ( ) = 1 ∑ ,
The cumulative distribution function is given by (8) where ( , ) is the incomplete beta function.
Thus, the threshold for fault detection is derived from the Limit described by (9) To adapt the threshold for a deviation score derived from multiple gaussian components, a scaling parameter is incorporated. This parameter is given by (10), where are the parameters of the local gaussian distribution of the unit .
The failure threshold established in (10) is used to detect incipient failures. Once the deviation metric in (6) exceeds the failure threshold, the incipient failure is detected, and the RUL is predicted. To simplify the discussion, the RUL is predicted as the mean time to failure (from incipient failure detection point to end of life) that is obtained from the training units. The training units are used to estimate the mean time to failure after an incipient failure is detected. For the training units, based on the fault detection time (FDT), the true remaining useful life (RUL) is calculated at FDT. This is the first instance at which RUL is predicted for a typical prognosis task. The RULs at FDT for each training unit are shared with the global server to fit into a Weibull distribution. The mean time to failure is then evaluated for predicting the time to failure at the first instance of detecting fault.
The baseline samples from the testing units also participate in the federated training of the global baseline model. The deviation metric as well as the fault detection threshold is obtained locally for each test unit. Every new incoming sample is monitored until the deviation metric crosses the fault detection threshold, at which point the remaining useful for the test unit is predicted.

CASE STUDIES
Two case studies are presented to showcase the performance of the proposed method. The first case study focuses on the performance of fitting a gaussian mixture model in a federated application compared to a traditional application on a simulated dataset. The second case study highlights the performance of the proposed method on remaining useful life prediction for Turbofan Engine Dataset.

Case Study 1: Simulative Study
Two simulated datasets are used from previous studies (Ververidis & Kotropoulos, 2008) to illustrate the clustering capability of the proposed federated expectation maximization algorithm. Dataset A is composed of eight partially overlapping distributions/clusters, with an equal number of i.i.d. samples from each distribution belonging to each client respectively. Dataset B is a set of three wellseparated distributions/clusters, with samples from each distribution belonging to each of the three clients. Table 1 describes Dataset A and B, where and Σ represent the true mean and true covariances of the distribution from which the datasets are generated. 0 represents the data samples for the respective cluster.
Two approaches are considered for parameter estimation of the simulated dataseti) proposed federated expectation maximization, where no data is shared among the clients; and ii) traditional expectation maximization or centralized approach, where data from clients is aggregated together in a central server for parameter estimation.  Fig. 5(a) shows that the minimum AIC value is achieved for 8 components, which matches with the true number of clusters for Dataset A. Fig. 5(b) shows the probability density of the components over the dataset. The

Fig. 4 Proposed Methodology using Federated Global
Baseline Modeling for Failure Prognosis proposed federated approach can perform at par with the traditional approach as can be observed by comparing in Table 2. In Table 2, and Σ represent the estimated mean and covariance using federated expectation maximization, and and Σ represent the same parameters estimated from the centralized approach respectively for Dataset A.
Similarly, Fig. 5(c) shows the performance of the proposed method for Dataset B. In Dataset B, each client has samples from each of the three distributions. This can be compared to a case where each of the distribution can be considered as a particular asset operating under different conditions with multiple clients. The data from each cluster is randomly distributed across every client. This represents the case of non-uniform availability of data across clients. Fig. 5(c) and Fig. 5(d) show that the optimal number of components or clusters for dataset B is correctly estimated as three with the presented probability density spread. Table 3 shows the estimated parameters of the clusters for Dataset B using both approaches and illustrates that the estimation by the federated EM approach matches with the estimation by the traditional EM approach. With this global model trained in a federated way, every client organization has access to more knowledge of the dataset distribution irrespective of the local availability of the data, preserving data privacy. For data-poor clients, this can bring benefit through learned models without having to wait for enough data to be collected.

Data and Problem Description
An implementation of the proposed algorithm is shown for collaborative fault prognosis using the Turbofan Engine Degradation Dataset (Saxena & Goebel, 2008). The data is generated using the MATLAB-based software called C-MAPPS, which is designed to simulate the behavior of commercial turbofan engines and is widely used in the literature. The turbofan engines simulated by C-MAPPS are formed by several inter-dependent sub-systems, resembling the mechanisms typically present in industrial machinery. One of the sub-systems included is a limiter, which prevent machines from exceeding pre-set tolerances. Simulation parameters include environmental, control, and failure parameters, including a set of health-parameter inputs that are designed to simulate deterioration and fault. Time-series variables that represent parameters such as fan speed, temperatures at various locations in the system, engine pressure ratio, etc. are recorded over the life of the engine. A comprehensive diagram and description of the turbine simulated by C-MAPPS along with the list of variables  recorded can be found in (Saxena & Goebel, 2008). The dataset used in this work considers operation at sea level and degradation of High-Pressure Compressor (HPC) and consists of 100 multi-variate run-to-failure time series. Each time series represents a different engine unit. Each engine starts with different degrees of initial wear and manufacturing variation which is unknown to the user. This wear and variation are considered normal, i.e., it is not considered as a fault condition. The recorded data is contaminated with sensor noise. Measurements from seven sensors -2, 3, 4, 7, 11, 12, and 15are used in the analysis based on the past literature (Wang, Yu, Siegel & Lee, 2008). The engine is operating normally at the start of each time series and develops a fault at some unknown point during the operation. The fault grows in magnitude until system failure, and the number of operational cycles until failure is recorded.
The problem addressed in this case study is to predict the remaining useful life of the engine units at the point of fault detection. The methodology proposed in Figure 4 is followed for this case study. Two experimental settings are used for evaluation. In Experiment Setting I the entire dataset is divided into training and testing units, wherein the failure time of the training units is known, and the failure time of the testing units is unknown. The training units are used to estimate the mean time to failure from the point of first fault detection, and the accuracy of the prediction is evaluated on the testing units using root mean square error (RMSE). An iteration consists of using 75% of the total units randomly selected for training, and the remaining 25% used for testing, with a four-fold cross validation. Each iteration is repeated ten times. In Experiment Setting II all the 100 units are used as training units to estimate the bounds of the mean time to failure. The difference between the upper and lower bound of the estimated mean time to failure is used as the metric of performance.

Performance Benchmarking
The performance of the proposed federated learning approach is compared with the following traditional approaches. The fault prognosis architecture remains the same for each of the following benchmarking methods except for changes in the data sharing strategy, baseline modeling techniques, and fault detection metrics. All methods labeled Stand-Alone (SA) make use of only the local data to train a local baseline model representing the scenario where no data sharing is allowed. In these methods, each unit has their separate local baseline models instead of a global baseline model.

5.
Proposed: This is the proposed method that uses federated training of a global gaussian mixture model for fault detection as described in Fig. 4. 6. Ideal: This method trains a global gaussian mixture model by allowing sharing of data. It represents the traditional method for model training and signifies the best possible scenario for fault detection using gaussian mixture model.

Results and Discussion
The root mean square error of all six methods for 100 units using Experiment Settings I is shown in Fig. 6 as a box plot. is considered as 10 samples. The proposed and the ideal methods perform significantly better than any of the stand-alone approaches or when no baseline model is considered. This highlights the huge benefit of performing collaborative prognosis. These results represent a scenario where each individual unit may belong with a different client organization or user and data sharing across organizations may not be allowed. Using the proposed federated method for collaborative fault prognosis allows achieving performance as good as performing fault prognosis traditionally using data sharing. Using the proposed method, both the mean and the spread of the RMSE is better than SA approaches and at par with the ideal scenario.
Since the performance of stand-alone methods can be affected by the number of baseline samples, , available for model training, Fig. 7 shows the average RMSE of each method by varying the number of baseline samples per unit. The model with no baseline is unaffected as it does not use a baseline model for fault detection. The performance of the stand-alone method improves as the number of baseline samples per unit increases, but only until a certain level. After about the value of = 70, the performance either worsens or remains unaffected for every stand-alone method. However, the proposed approach remains consistently better and unaffected by the local change in . This again highlights the benefit of collaborative learning, where the shortcomings of one organization can be overcome by the strength of other organizations. This advantage is most significant for units with less amount of baseline data. Even for units with a high amount of baseline data, the proposed approach remains better than stand-alone approaches and at par with the ideal scenario.
For effective collaboration for fault prognosis, the number of collaborating organizations or collaborating users plays an important role. Moreover, there can scenarios where data sharing may be possible among units that belong to one organization. For organizations with a huge fleet and the possibility of data sharing, using federated learning may not always be the solution for improved performance. But for an organization with one or few units, as is the case in Fig. 6 Fig. 8 and Fig. 9 present the performance of proposed methods for decreasing the number of collaborating organizations and use Experiment Settings II. 100 units are distributed equally over the number of organizations. For example, for 100 collaborating organizations, each organization has 1 unit, whereas, for 2 collaborating organizations, each organization has 50 units. The case of only 1 organization represents the ideal scenario where data sharing across all units is allowed. For this analysis, the standalone notation is updated to local notation, as data sharing within the organization is allowed.
The value of for the results reported in Fig. 8 is 10 and in Fig. 9 is 20. From both the figures four major inferences can be made. First, the performance of the proposed method is consistent and almost at par with ideal method irrespective of the number of collaborating organizations. This further validates the competency of the proposed method allowing minimum deviation from ideal scenario even with having data barriers for privacy protection in place across units. Secondly, the benefit of federated collaborative learning is dependent on the number of collaborating organizations. As the number of collaborating organizations decrease and the number of units within an organization increase, it becomes less favorable to collaborate. Thirdly, by comparing Fig. 8 and Fig. 9, for collaborating organizations greater than 10, the RMSE value decreases significantly with the increase in the value of . However, the performance of local models becomes better than the collaborative model only for more than 10 collaborating organizations. Thus, collaboration is most beneficial for units with a lesser number of baseline samples. Finally, the performance of the gaussian mixture model can be affected by the increase in the size of data. On the other hand, the performance of SOM and OCSVM follow a consistent trend of reduction in RMSE as the number of collaborating organizations decreases. This happens as the tendency of the expectation maximization algorithm to converge at a local-optimum increases with an increase in data size. Advanced versions of gaussian mixture model training such as (Balakrishnan, Wainwright & Yu, 2017;Zhao, Li & Sun, 2020) could be further developed to be implemented in federated settings to address this issue and forms the future scope of this work.

CONCLUSION
A systematic analytical methodology to build collaborative prognostic models from decentralized databases addressing limitations of data-privacy and local label scarcity is proposed. The methodology facilitates effective learning of asset health conditions for data-scarce organizations by collaborating with other organizations preserving data privacy. Collaboration is more beneficial when data for common assets is sparsely distributed across organizations. This is most suitable for a servitization model for Overall Equipment Manufacturers who sell to multiple organizations.
While the presented work assumes same type of fault (labels) in the distributed datasets for prognosis, future work can consider development of methods addressing variations in fault types, machine operation conditions, etc. This is more likely to occur in real life scenarios as model aggregation happens across different organizations. Such variations make structured datasets with same labels even more scarce and distributed leading to the overall data distribution becoming non i.i.d.. For such situations, federated learning can prove even more valuable as advanced algorithms are developed to train models from non i.i.d. data distributions under this framework.