Multi-Kernel-based Adaptive Support Vector Machine for Scalable Predictive Maintenance

Application of data-driven solutions across an industry is challenging, since the data are often stored locally, and increasing privacy and security concerns restrict access to the data. Because it is highly unlikely that all potential data patterns are captured in a single data source, machine learning (ML) models developed from a single source cannot be robust enough. An alternative is to train local ML model at each source and at the central location combine all the local models to generate a global model. In this work, we develop a proof-of-concept of distributed machine learning model, federated transfer learning, using a multi-kernel-based adaptive support vector machine. For federated learning, the multikernel approach enables feature-specific model aggregation under data heterogeneity; whereas for transfer learning the adaptive model enables utilization of an aggregated model from a different task. The proposed approach is validated using nuclear power plant vertical motor-driven pump data to predict the health condition of vertical motor-driven pumps as an anomaly detection. The efficiency of the proposed approach is also quantified and compared with neural network.


Introduction
Artificial intelligence is driving the advancement of technologies in the fields of healthcare, industrial automation, transportation, etc. (National Artificial Intelligence Initiative, 2020). The enormous amount of data generated at different locations from various sources such as sensors and Internet of Things devices requires high bandwidth to transmit data. Also, low-latency real-time decision making requires continuous network connection. Failure to meet bandwidth or network connection requirements would lead to unreliable decision making at a centralized location. This drives a localized decision-making capability the more appropriate solution. For localized decision making (i.e., edge computing) using Koushik A. Manjunatha et al. This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 United States License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. data-driven models, it is challenging to incorporate all possible data patterns into the data-driven models. Data collected from various sites provide a better estimation of the population than do data collected at a single site. However, privacy, security, legal, and commercial concerns restrict data sharing. For example, the U.S. Health Insurance Portability and Accountability Act (Edemekong, Annamaraju, & Haydel, 2018) requires that medical and individual data only be released with proper anonymization (Li & Qin, 2017). Similarly, in commercial terms, data related to processes and manufacturing are often a valuable business asset. However, the central accumulation of summaries or obfuscated models may be considered reasonable as long as the original data are not revealed. Thus, it is essential to enable privacy-preserving distributed mining of information and decision making. That said, a comprehensive model can be developed by combining all the local models into a single aggregated model. Realization of a comprehensive model is achieved via a central entity (server) with which all the distributed edge devices share their local model at customized intervals. This relies on performing distributed training on the edge devices and edge servers where the data is generated or collected, without necessarily sending data outside the enterprise firewall to the central server.
Often, edge devices have limited computational capabilities, meaning frequent transmitting and receiving will not be energy efficient. This challenge increases with complex models such as deep neural networks (Goodfellow, Bengio, & Courville, 2016). In addition, the more complex models are "black box" in nature, and there is always an added challenge in deriving the proper appreciation for the data-driven models from an enterprise/business perspective. In this research, we focus on a specific classifier: the support vector machine (SVM) (Suykens & Vandewalle, 1999). We leverage this popular classification model in machine learning (ML) to build a collaborative framework for distributed learning, along with a unique way to model aggregation. Over the last decades, several collaborative SVM-based data mining approaches have been proposed in order to enhance Figure 1. Federated transfer learning framework for condition assessments of vertical motor-driven pumps in the circulating water systems of two plant sites. distributed learning. A privacy-preserving SVM classification was proposed in (Vaidya, Yu, & Jiang, 2008), constructing the global SVM classification model from data distributed at multiple parties without disclosing the parties' data to each other. Using a similar approach, a collaborative learning framework (Que, Jiang, & Ohno-Machado, 2012) was developed for SVM. Collaborative multi-kernel SVM (MK-SVM) with alternating direction method of multipliers was used to globally optimize the distributed sub-models (Chen & Fan, 2012). In this approach, the training matrix is partitioned into blocks in two different ways (i.e., column partitioning and row partitioning), and multiple kernels are extracted for each model and then aggregated. Federated learning (FL) with SVM was implemented in (Carlsson, 2020), with the kernel values being combined and then shared back to the local models. To address collaborative learning challenges through SVM, this work proposes multi-kernel-based adaptive SVM (MK-A-SVM)-based federated transfer learning (FTL), as shown in Figure 1. FTL is a combination of FL and transfer learning (TL). FL is a collaborative learning technique in which many clients collaboratively train a model under the orchestration of a central controller, without exchanging the party's original data. FL enables focused data collection and data minimization by reducing the systematic privacy risks and costs resulting from traditional, centralized ML. The FL process is typically driven by model engineers who develop ML models for particular applications. TL builds an effective model for the target domain while leveraging knowledge from the other (source) domains. The main advantages of TL are that the training time is significantly reduced and only a very small amount of training data (or none at all) are required to leverage pretrained models. To enable FTL across heterogeneous data, a feature-group-based MK-SVM was developed. The features are grouped based on measurement type, with each group having its own kernel/parameter settings. The predicted category is the weighted sum of the contributions from each kernel. To enable TL, the MK-SVM is modified by integrating an adaptive SVM framework and redefining the optimization approach for SVM. The adaptive nature of the MK-A-SVM also enables the integration of other ML models (e.g., neural networks [NN] and Bayesian methods) with SVMs in TL. Unlike traditional SVM, MK-A-SVM supports (1) the adoption of multiple and feature-specific kernels, and (2) a resilient approach to dealing with missing measurements.

Multi-Kernel Adaptive Support Vector Machine
SVM (Suykens & Vandewalle, 1999) is a discriminative classifier that finds, in a higher dimensional space, a hyperplane that distinctly classifies the data points. To separate two classes, there may be many possible hyperplanes. SVM finds the hyperplane of maximum margin with the longest distance between the data points of both classes. Support vectors are data points on the hyperplane that influence the position and orientation of the hyperplane. For the input feature vector x ∈ X with labels Y, the general expression for the soft-margin classifier in dual form with regularization parameter, C is given by (Suykens & Vandewalle, 1999) where α is the Lagrange multiplier and K( is a kernel function.

MK-SVM
With the kernel function, the soft margin decision function for SVM is defined by: Let [X] m×n represent a data matrix with m samples and n features. The sample set can be vertically partitioned based on feature type. The MK-SVM across the vertically partitioned sample set can be determined by computing a net kernel matrix (i.e., a gram matrix) K = K(x i , x j ) from individual matrices determined from each vertically partitioned sample. In the case of two partitions, the [X] m×n data matrix can be vertically partitioned into X 1 and X 2 . Then X 1 and X 2 will have K 1 = K(X 1 , X 1 T ) and K 2 = K(X 2 , X 2 T ) as the gram matrix, respectively. The net gram matrix can next be determined as the linear combination of individual gram matrices. Let the (i, j) th element of K represent K(x i , x j ) and x 1 i and x 2 i be vertically partitioned vectors of x i from X 1 and X 2 , respectively. Accordingly: The net gram matrix can also be obtained via the weighted summation of individual kernels. Hence, for G vertical partitions, equation (3) can be generalized as: where β i is the weight associated with each local gram matrix and G i β i = 1. Using the net kernel matrix applied in equation (1), the SVM model parameters can be obtained by solving a quadratic programming problem. Note that each gram matrix will be a square matrix. Thus, using the MK approach, FL across P parties can be performed to generate a global (master) gram matrix as the weighted sum of net kernel matrices K t (determined in equa-tion [4)], for t = [1, 2, . . . , P]. The formulation to generate a global gram matrix is given by: where p t is the weight/importance associated with each net gram matrix and P t p t = 1. The importance of each net gram matrix can be considered equal (1/P), or be based on the individual contribution to overall performance (e.g., prediction accuracy or F β −score) denoted as ACC t is determined as: As per equation (6), kernel aggregation is conducted by giving the highest importance to the most accurate individual model, and the lowest importance to the least accurate model. Then the K g matrix, which captures patterns from all the individual models, is used to retrain each individual model. After retraining with K g , all groups will have a global model, and the above process of individually updating the kernel matrix and constructing a global matrix from all the P parties continues. Substituting equation (4) for (5), we get: The global gram matrix will be shared with each party. Then, using the global gram matrix in equation (1), each party can obtain the SVM model parameters by solving a quadratic programming problem. Creating a feature-group-based MK-SVM model enables the aggregation of kernel matrices across individual parties with similar or partially heterogeneous features.

Adaptive SVM
Adaptive SVM is useful for transferring a trained model from one system to another, or for performing domain adaptations. The adaptive SVM learns from the source model f s (x) by regularizing the distance between the learned model (target model) f (x) and f s (x). These source models can be trained using any algorithm (e.g., SVM, decision tree, NN, and naive Bayes). Let [X] T m ′ ×n ′ represent a data matrix with m ′ samples and n ′ features for the target model. The distribution of the target data is likely to be different from the source data. Let f s (x) represent the decision model available from the source data, our goal being to learn classifier f (x) using M different source models f s (x) as per: where t k is the weight associated with each source model. The objective is to determine a new decision boundary that is close to M i=1 t k f s k (x) and accurately predicts on [X] T m ′ ×n ′ . Thus: Equation (9) is a more intuitive interpretation of regularization aimed to minimize the distance between f (x) and the source model f a (x). To learn the parameters, the objective function in equation (1) can be rewritten as follows (Aytar & Zisserman, 2011;Yang, Yan, & Hauptmann, 2007): . Given theα determined from equation (10) and the results from the MK-SVM in equation (7), the MK-A-SVM decision function using equation (8) can be written as:

Numerical Results
To validate FTL using MK-A-SVM, the circulating water systems (CWSs) at two nuclear power plants (NPPs) were selected as the identified plant assets. The CWS is an important non-safety-related system. As the heat sinks for the main steam turbine and associated auxiliaries, the CWSs at Plant Sites 1 and 2 are designed to maximize steam power cycle efficiency (NRC, 2009). Plant Site 1 (a two-unit pressurizedwater reactor) features six circulators at each unit. Schematic representations of the main condensers for Plant Site 1 Unit 2 are shown in Figure 2a. Each pair of waterboxes in the condenser is named according to the following convention: Unit #, Condenser #A, and Unit #, Condenser #B. Plant Site 2 (a single-unit boiling-water reactor) has four circulators. A schematic representation of the Plant Site 2 CWS is shown in Figure 2b, and several distinct differences are seen when comparing it to the Plant Site 1 CWS. These include: (1) the water supply to the Plant Site 2 CWS comes from a cooling tower water basin, not directly from the river; (2)  S endModel( f (x), ACC T ) 24: end if four circulators feeding six waterboxes via a common header, unlike the Plant Site 1 CWS, in which each waterbox has its own circulator.

Fault Signatures of the NPP Asset
Fault signatures enable informed decision making to prevent potential failure of a plant asset. They can also be used for root cause analysis if failure occurs. In theory, the different fault modes associated with a plant asset (e.g., the CWS) have unique, consistently identifiable fault signatures. In practice, fault signature identification and diagnosis are not straightforward and can benefit from analyses of historical data. Each detected fault signature for a particular degradation mode should have enhanced feature verification and confidence by selecting additional process and condition monitoring data  that provide complementary information.
Of the faults of interest examined, only waterbox fouling caused numerous instances of CWP shutdowns, even though this fault is not a pump/motor fault but rather a system fault whose symptoms may affect pump performance. Because waterbox fouling occurred so commonly, enough data was available to allow development and testing of condition-based monitoring algorithms. Fault types that caused only a single instance of CWP shutdown provided limited information for developing fault signatures and training ML algorithms. The potential fault signatures contained within the data are not readily resolvable at this time. A potential way of addressing this sparseness in some of the fault signatures is to leverage simulated data generated from the first-principles model of the CWS motor and pump (M&P) set. It is anticipated that, as ML technologies mature for operational plant applications, these subtle faults will be identified. This section discusses two examples of waterbox fouling. The first is from Plant Site 1 and the second from Plant Site 2.
These two examples highlight the similarities and differences in the fault signatures for waterbox fouling, and are a perfect lead-in as to why FTL is required for predictive modeling.
The primary issue noted with the Plant Site 1 CWS is waterbox fouling, which typically occurs due to accumulation of grass/debris in the waterboxes and causes condenser tube blockage and reduced circulator water flow. This is a unique and frequent issue at Plant Site 1, since the Plant Site 1 CWP intake comes directly from the river, which produces a significant quantity of grass/debris. The grassing season typically occurs between February 1 and May 31 (NRC, 2009). Grassing often emerges from the river during high-wind conditions associated with storms. During these periods, the motor current can oscillate with river level changes. Operations monitors the waterbox motor current and inlet pressure, and schedules waterbox cleanings based on deviations in motor current and inlet pressure when compared against historical baseline data. Waterbox fouling is typically identified via motor cur-rent increase (also, though far less frequently, motor current decrease), inlet pressure increase, waterbox differential temperature (DT ) increase, and condenser thermal performance loss. Figure 3a shows an instance of waterbox fouling diagnosed in Plant Site 1 Unit 2's CWP 22B. An upward drift in DT and motor current was identified on July 23, 2018. Consequently, the gross load began to dip. Note that, in Figure 3a, the CWP 22B motor current increased from 231 to 245 amps, and the DT increased from 14 to 16 • F, with the gross load not trending as expected. The motor current and DT decreased to 220 amps and 14 • F, respectively, following the waterbox cleaning on August 25, 2018, resulting in a 30-40 MWe improvement in gross load. The waterbox fault and approximate date of the shutdown were found by searching the work order database and narrative log information. For Plant Site 2, waterbox fouling is not a major fault, yet remains of interest. The cause of waterbox fouling at Plant Site 2 is once again debris (limited grassing) in the water circulating in and out of the cooling tower basin. Figure 3b shows an instance of waterbox fouling in Plant Site 2 waterbox A. Under normal operating conditions with no faults, the differential pressure (DP) across the Plant Site 2 CWPs averages 40-41 PSIG. Note that, in Figure 3b, on around December 23, 2017, CWP A's DP began trending upward, exceeding 43 PSIG. Following the DP trend, the DT across the north and south ends of waterbox A also trended upward in the same time period. These slow, steady increases in DP and DT trends are indicative of waterbox fouling. Following a waterbox cleaning on around January 23, 2018, the DP reduced to near 41 PSIG, and the DT also stabilized.
These two examples show that different fault features can indicate the same fault. Developing a comprehensive fault signature for each fault mode is key to achieving scalable, accurate predictive models. For other CWP fault signatures, see (Agarwal et al., 2021).

Feature Extraction
To develop a FTL-based predictive model, features were extracted based on identified fault signatures.

Plant Site 1
From the CWS-associated plant operational data, the following features were extracted for each M&P set: • DT was calculated as the difference between the outlet water temperature associated with the M&P set and the inlet river temperature • The measured motor in-board (MIB) temperature, motor outboard (MOB) temperature, and motor stator (MS) temperature • From historical CWS M&P replacement/refurbishment dates; the M&P run-hours from one replacement to the next were considered in calculating the motor age (M Age ) and pump age (P Age ) • To consider the seasonal effects on the data, week of the year was calculated for every timestamp and then used as a feature.
Thus, a total of seven features were extracted from the CWS plant operational data for each M&P set. Detailed information on feature extraction from plant operation data-as well as from vibration data-can be found in (Agarwal et al., 2021). For model development, plant operational data after 2016 were considered, because Plant Site 1 first adopted a new six-year CWP replacement PM strategy at that time.
Since 2016, each Plant Site 1 unit has had its CWPs periodically replaced as per the updated PM strategy. The age of the M&P set is estimated based on the date of replacement of each CWP. If any faults in the M&P are identified postreplacement, the data corresponding to that fault and time period is labeled either unhealthy or healthy.

Plant Site 2
From the Plant Site 2 CWS-associated plant operational data, the following features were extracted for each M&P set: • DT as an average of the DTs measured at the north and south condensers (the DT at each condenser was calculated as the difference between the respective condenser inlet and condenser outlet temperatures) • The measured MIB (thrust) temperature, MOB temperature, and motor winding (stator) (MS) temperature • To consider the seasonal effects on the data, week of the year was calculated for every timestamp and then used as a feature.
Thus, five features were extracted from the Plant Site 2 data. As there was no historical CWS M&P replacement/refurbishment date information available for Plant Site 2, motor age (M Age ) and pump age (P Age ) were not calculated. This makes sense because most of the fault data captured in the unhealthy class are associated with waterbox fouling, which is best reflected by DT information rather than the other six features discussed above.

FL-based CWP Motor Health Prediction
FL was demonstrated using an MK-SVM (Chen & Fan, 2012) that classifies whether a CWP is in a healthy or unhealthy state. FL was demonstrated on the Plant Site 1 data, with each local model being developed for a pair of CWPs connected to a common waterbox, as shown in Figure 2a. Since there are three waterboxes for each Plant Site 1 unit, this gives six local models that will be combined into a master model via the FL approach (see Figure 1). The samples that were grouped based on CWP combinations are then split into training and test samples in accordance with a 80:20 ratio, as shown in Table 1. As per equation 11, the extracted features are grouped into three categories, an individual kernel matrix is built for each group, and the final decision f (x) is determined as a weighted combination of predictions from each kernel associated with each feature group, as shown in Figure 6. The selection of optimal hyperparameters C and γ for MK-A-SVM with radial basis function kernels is performed using grid-search cross validation to predict CWP conditions. Since there are three feature groups, the MK-A-SVM takes three parameters of γ (γ 1 , γ 2 , and γ 3 ), each of which must be optimally tuned. In this work, for the sake of simplicity, only one γ parameter is tuned, and it is set as γ = γ 1 = γ 2 = γ 3 . The parameter γ was varied from 10e − 3 to 10e3, and the regularization parameter C was varied from 10e − 3 to 10e2. For predicting CWP conditions, γ = 100 and C = 0.001 achieved the highest prediction accuracy value: 95.58%. Note that determining the optimal hyperparameter values separately for each feature group (and weight β i associated with each feature group) is beyond the scope of this work. The results of the MK-A-SVM based individual learning and FL on Plant Site 1 are mentioned in Table 2. From Table 2, it is seen that individual models from each group achieved a performance of close to 100% in most of the MK-SVM models. This is a clear indication of overfitting in individual models, with the models being unable to predict other datasets or unseen data with the same level of accuracy. In addition, for some models, the accuracy of the test samples is higher than that of the training samples, since the test data were sometimes easier for the model to predict than the training data. Since the data labels are highly imbalanced, the F1 scores for all the models were above 98% (during individual training and after FL aggregation), indicating the prediction is not biased toward healthy class labels. After applying FL-based model aggregation and retraining each individual model, the accuracy levels decreased for most of the models, though performance remained at acceptable levels. As a comparison, FL aggregation was performed based on NNs, and the results are closely comparable with MK-A-SVM model. FL aggregation over several iterations can further improve overfitting, while also maintaining acceptable    performance of the diagnostic model. This exemplifies how FL-based model aggregation enables aggregation of diagnostic models from the component level to the plant level. The fact that the models are trained with limited datasets also impacts FL performance, which is anticipated to improve with larger training datasets.

MK-A-SVM Performance Analysis
Further work must be done to build more robust individual models that generalize more readily to previously unseen data. However, as noted above, sufficiently high accuracy is obtainable while retaining a uniform (and thus simpler) architecture for all individual models. The FL approach showed much stronger test set performance for the CWPs than was seen in the individual phase. The added information afforded by examining all the pump data en masse provided clear advantages to the federated-model building process. For MK-SVM-based TL, the overall performance on both CWP A and CWP C data is around 80%. This approach involves using all the samples from CWP A and CWP C as test data in order to classify health using the master model from the FL framework. The performance dictates that the MK-SVM parameters must be further optimized to improve the prediction accuracy. Typically in TL, a small set of sample data is used to retrain the transferred model in order to fine-tune the model parameters for the new environment (i.e., Plant Site 2). For example, only 10-20% of the total number of samples will be used to retrain the model and optimize the parameters of the MK-SVM for the Plant Site 2 data. After retraining with 20% of the data, CWP A's performance did not improve, whereas CWP C's performance significantly improved (to higher than 95% accuracy). The performance of CWP A with TL indicates there were insufficient samples for building the ML model. For comparison with TL, individual models were also trained on the Plant Site 2 data, with an 80:20 split between the training and test data. Individual model performanceparticularly for CWP A-clearly shows the same overfitting trend as seen in the FL case. More samples for training are required in order to generalize the model and avoid overfitting. In comparison, the performance for NN is higher in MK-A-SVM for both CWP A and CWP C. This is also due to the fact that further enhancements to the optimization is necessary in terms of optimizing weights associated with multiple kernels.  Table 3. TL performance (in %) on Plant Site 2 data, using MK-A-SVM and NN models from FL

Practical Considerations
The practical applicability of the proposed FTL using MK-A-SVM can be considered for mainly three scenarios. Firstly, for a distributed asset/entity scenarios in which the data sharing is challenging due to privacy, communication bandwidth, and economical competence issues. Secondly, to implement data-driven decision-making on a new asset/entity by transferring previously trained model. Finally, to develop a model which could capture all the patterns of an asset/equipment behavior.

Conclusion
A MK-A-SVM is developed to demonstrated FTL framework by considering the application of predicting CWP health conditions in NPP CWSs. The Plant Site 1 CWS fault signatures were used to develop FL based on MK-SVM. The federated models developed for Plant Site 1 were then used to estimate the state of health of the Plant Site 2 CWS (a process referred to as TL). The results obtained were comparable to predictive models individually trained on Plant Site 2 data. This demonstrated the significance of the FTL approach and the development feasibility through MK-A-SVM. The performance results were also compared with NN algorithm. As a path forward, we continue to update and improve the MK-A-SVM framework to continuously optimize weight parameter associated with multiple kernels. In addition, future work will also expand the approach to multi-class classification.

Acknowledgements
Vivek