Rethinking Reliability in Terms of Margins

Current reliability approaches were designed to assess and quantify the reliability associated with complex systems such as nuclear power plants (NPPs). These approaches are generally based on classical Boolean logic structures such as event trees (ETs) and fault trees (FTs) [Rausand, 2020]. The outcome obtained by combining FTs and ETs is the set of minimal cuts sets (MCSs), with each MCS representing a unique combination of BEs that leads to an undesired outcome (e.g., core damage). Probabilistic evaluation of a MCS is performed by evaluating the product of the probability values associated with each BE. A relevant factor here is that the probability values associated with BEs used in the plant models are updated at least every 4 years based on past operational experience and through the use of a Bayesian statistical process [Siu, 1998]. Hence, the probability value of a BE associated with a physical asset (e.g., a centrifugal pump or motor-operated valve) in no way reflects that asset’s actual condition and performance. 
This fact plays a major role in the application of plant reliability models to support risk-informed decisions. With the particular goal of reducing operation and maintenance costs, existing NPPs are moving from corrective and periodic maintenance toward new types of predictive maintenance strategies [Agarwal, 2021]. This transition is designed such that maintenance is conducted only when the asset requires it (i.e., prior to undergoing imminent failure). And though these benefits cannot be achieved through actual reliability modelling methods and currently employed reliability data, they can be achieved by employing asset-monitoring sensors, automated data acquisition systems, data analysis methods, and improved decision-making processes. Combined, these resources can provide precise information on the health of an asset, track its degradation trends, and estimate its expected failure time. Based on such information, maintenance operations can be scheduled and performed for each asset on an as-needed basis. This dynamic context of predictive maintenance operations requires new methods of data analysis, the propagation of asset health information from the asset level to the system level, and the optimization of plant resources. 
This paper provides an alternative reliability approach designed for a predictive maintenance context in which a direct link is created between monitoring data and decision-making. Rather than thinking of reliability in terms of system/asset probability of failure, we propose a reliability mindset based on the concept of margin [Mandelli, 2023]. An asset’s health is quantified by determining its margin, based on the asset’s current and historical monitoring data. The margin values of the monitored asset are then propagated through system reliability models (e.g., FTs or reliability block diagrams) to identify the assets that are more critical to guarantee system operation. We show how a margin-based approach can be used assess asset health, based solely on current and historic monitoring data (e.g., condition-based, anomaly detection, diagnostic, and prognostic data) [Xingang, 2021]. A margin-based approach directly addresses the limitations of classical reliability modelling approaches and provides a snapshot of system health—given the availability of monitoring data. These two different approaches are designed to address different types of decisions: classical reliability models support static decisions (e.g., a set frequency of periodic maintenance or surveillance operations) based on past operational experience, whereas a margin-based approach directly supports dynamic decisions involving maintenance operations that should only be performed when necessary, based on monitoring data (i.e., a predictive maintenance context).


INTRODUCTION
Current reliability approaches assess and quantify the reliability associated with complex systems, such as nuclear power plants (NPPs).These approaches are generally based on classical Boolean logic structures, such as event trees (ETs) and fault trees (FTs) (Rausand, 2020).The outcome obtained by combining FTs and ETs is the set of minimal cuts sets (MCSs), with each MCS representing a unique combination of basic events (BEs) that leads to an undesired outcome (e.g., core damage).The probabilistic evaluation of an MCS is performed by evaluating the product of the probability values associated with each BE.A relevant factor here is that the probability values associated with the BEs used in the plant models are updated at least every 4 years based on past operational experience through the a Bayesian statistical process (Siu, 1998).Hence, the probability value of a BE associated with a physical asset (e.g., a centrifugal pump or motor-operated valve) in no way reflects that asset's actual condition and performance.
This fact plays a major role in the application of plant reliability models to support risk-informed decisions.To reduce operation and maintenance costs, existing NPPs are moving from corrective and periodic maintenance to new types of predictive maintenance strategies (Agarwal, 2021).This transition is designed such that maintenance is conducted only when the asset requires it (i.e., prior to undergoing imminent failure).And though these benefits cannot be achieved through actual reliability modeling methods and currently employed reliability data, they can be achieved by employing asset-monitoring sensors, automated data acquisition systems, data analysis methods, and improved decision-making processes.Combined, these resources can provide precise information on asset health, track its degradation trends, and estimate its expected failure time.Based on such information, maintenance operations can be scheduled and performed for each asset as needed.This dynamic context of predictive maintenance operations requires new methods of data analysis, propagating asset health information from the asset level to the system level, and optimizing plant resources.
This paper provides an alternative reliability approach for a predictive maintenance context in which a direct link is created between equipment reliability (ER) data and decision-making.Rather than thinking of reliability in terms of system and asset probability of failure, we propose a reliability mindset based on the concept of margin (Mandelli, 2023).An asset's health is quantified by determining its margin, based on the asset's current and historical monitoring data.The margin values of the monitored asset are then propagated through system reliability models (e.g., FTs or reliability block diagrams) to identify the assets that are more critical to guarantee system operation.We show how a margin-based approach can assess asset health, based solely on current and historic monitoring data (e.g., conditionbased, anomaly detection, diagnostic, and prognostic data) (Xingang, 2021).The novelty of this margin-based approach is that it directly addresses the limitations of classical reliability modeling approaches by propagating and integrating health data from the asset to the system level.
Here we focus on directly employing ER data effectively optimize maintenance operations (Pinciroli, 2023).As part of this decision-making process, the assessment of asset current and/or future conditions is required; this knowledge can be produced by employing condition monitoring, anomaly detection, or prognostic systems (Zio, 2022).In particular, we are answering this question: how health data/knowledge can be propagated from the asset to the system level?It is common practice to measure asset health using asset health indices (AHIs) (Hjartarson, 2006); the definition of such indices is typically situation and asset dependent (e.g., an AHI might be defined using a color-coded strategy or can be numerically quantified using arbitrary scales).One of the objectives of this paper is to provide a margin-based definition of AHI that is consistent on several operational contexts.The concept of margin is here borrowed from structural reliability analysis theory where margin is defined as the "distance" between the load and resistance probability distribution functions (Melchers, 2018).Reference (Lewis, 2022) presents a bridge between prognostics and health management (PHM) and probabilistic risk assessment (PRA).Our work conceptually differs from (Lewis, 2022) in two elements.The first one is related to the fact that once an asset is supported by a PHM system, the reliability of that asset loses most of its stochasticity since the monitoring activity of such asset is designed to inform on its conditions that might lead to its failure.The second one is related to the kind of decisions that a PHM and PRA are supporting.(Mandelli, 2023) claims that PHM systems support dynamic decisions where maintenance activities are scheduled only when required, while PRA models support static decisions, such as setting periodic surveillance and maintenance activities.

MARGIN MODELING
Reference (Mandelli, 2023) expands the meaning of the word "reliability" to better reflect the needs of system health and asset management decision-making processes.Rather than focusing on the likelihood of a given event (in probabilistic terms), we think in terms of how far this event is from occurring.This new interpretation of reliability shifts the focus away from probability of occurrence and toward an assessment of how close an asset is to reaching an unacceptable level of performance or failing (see Fig. 1).Note that two data elements are required for this assessment: the estimated actual health condition of the asset, which can be acquired by the asset-monitoring system or through diagnostic methods, and the limiting conditions that must be avoided, which can be acquired from past operational experience (e.g., monitoring data generated by similar assets under failure conditions).
An asset's margin value  is defined over the [0,1 ] interval, where  = 1 corresponds to a perfectly healthy asset (requiring minimal to no maintenance attention) and  = 0 corresponds to a faulty asset (requiring maintenance attention).Figure 1 provides a glimpse (in graphical form) of how a link between monitoring data and decision-making can be established through a margin-based reliability mindset.
Note that margin quantification is impacted by the availability of monitoring data and can be defined over heterogenous variables, such as pressure, vibration spectra, and time.For example, when dealing with condition-based monitoring data (both current and archived), margin  is defined here as the distance between actual and past conditions (e.g., oil temperature and vibration spectrum) that lead to failure (see Fig. 2).Hence, margin-based reliability modeling provides a unified approach to dealing with heterogeneous monitoring data elements.Note that the margin value of an asset is not static but changes with time, depending on asset conditions.For example, if degradation due to usage is observed from the monitoring data, the corresponding asset margin value decreases.Conversely, if a maintenance operation is performed on that same asset (e.g., restoration of centrifugal pump bearings), the asset margin value increases.
This mindset shift regarding the concept of reliability (i.e., margin based instead of probability based) offers the advantage of directly linking the asset health evaluation process with standard plant processes for managing plant performance (e.g., plant maintenance operations and budgeting processes).The transformation also supports decision-making in a form that is more familiar and readily understandable to plant system engineers and decision makers.
So far, margin has been defined for one single asset; the next step is to quantify the system's margin value after obtaining the margin values of its assets.The propagation of margin values from the asset level to the system level is performed through classical reliability models, such as FTs or reliability block diagrams (Lee, 2011), which are solved using different rule sets (Mandelli, 2023) instead of set theory-based operations.
In this respect, margin-based operators for assets in both series (OR operator) and parallel (AND operator) configurations must be defined.As an example, consider two assets (  and  ).The margin  of both assets can be visualized in a 2D space, as shown in Fig. 2. Starting with brand-new assets (i.e.,   ,   = 1 ) the aging and degradation that affects both is represented by the blue line, which parametrically signifies the combination of both margins   () and   () at a specific point in time t.Note that if no maintenance (preventive or corrective) was ever performed on either asset, this path would move from coordinates (1,1) to coordinates (0,0) , where both assets would be considered failed.Hence, the coordinates (0,0) in Fig. 2 represent the event "A AND B." Similarly, when the blue line reaches the x or y axis of Figure 2 (characterized by   = 0 and   = 0, respectively), either asset A or B has failed.Hence, the points in Fig. 2 characterized by either   = 0 or   = 0 represent the event "A OR B." Now we can calculate the margin  for the AND and OR events described above.This is accomplished by following the definition of margin: by measuring the distance between the actual condition of assets  and  and the conditions identified by the event under consideration (e.g., the occurrence of both or either event).The margin for    can be calculated as the distance between the current point of coordinates (  ,   ) to the point (0,0).The margin for    is the minimum distance from the current point of coordinates (  ,   ) to the x or y axis of Fig. 2 (where   = 0 and   = 0, respectively): (2) where the function  [. , . ] indicates the metric designed for calculating the distance between two points in an Euclidean space (e.g., if Euclidean distance is employed, (  ) = √  2 +   2 ).Mandelli (2023) provides a set of considerations regarding the choice of appropriate distance metric [., .] to be employed.In summary, Euclidean and Manhattan distance metrics represent the lower and upper bounds for (  ) (i.e., √  2 +   2 ≤ (  ) ≤   +   ).If the temporal evolution of   and   is available, a more precise estimate of (  ) can be obtained.
Figure 2. Graphical representation of event occurrences, based on a margin framework.
Eqs. ( 1) and ( 2) allow us to propagate margin values through classical reliability models (e.g., FTs or reliability block diagrams) to quantify the system margin   .The next step is to determine each asset's importance (in a margin-based reliability context).In a classical reliability setting, this is done by relying on risk-importance measures (Lee, 2011), such as the Birnbaum or Fussell-Vesely measures.Given the different nature of the margin concept, we require a reliability importance measure, here indicated as   , that captures the impact of asset margin   on system margin   .Here, we rely on a classical sensitivity measure (derivative based) for an asset , defined as: Simply stated,   indicates how a small variation of   (e.g., improving the health of asset ) directly affects system margin   .

INTEGRATION OF ER DATA INTO MARGIN MODELS
The definition of margin presented in Section 2 is abstract; an application within a more practical setting depends on the phenomena of interest-and especially the monitoring data available.This section provides more quantitative details on how margin can be quantified depending on the available ER data.

Technical Specifications Data
As indicated in Section 2, a margin value can be calculated as the distance between the actual and limiting conditions.In practical settings, limiting conditions can be represented by the technical specifications of the considered asset, which are normally provided by the manufacturer.As an example, to ensure the proper function of induction motors, oil viscosity must be below a specified limiting condition.Oil viscosity can significantly change as a function of motor rotation speed.In this context, asset margin can be calculated as the difference between the limiting condition specified in the technical specifications and the currently measured oil viscosity.
In general terms, given an upper limiting condition   for a monitored variable   , a margin  can be defined as: where min(  ) indicates the minimum allowable value for   .
As an example, induction motors are designed to operate within specified differential temperature limits.These limits indicate the maximum permissible difference between the motor temperature and environmental temperatures that various classes of insulation materials can withstand (this temperature limit can range from 80°C to 120°C, depending on the insulation material).In this scenario,   is represented by the specified temperature limit, while   is the difference between the actual motor temperature and environmental temperature.

Observed Reliability Parameters
Current industrywide available datasets often report the mean time to failure (MTTF) values for assets, given the past operational experience of similar assets operating under similar environmental conditions (e.g., temperature and humidity).In this context, no monitoring data are available, and only past operational experience can be used.Similar to the reasoning behind Eq. ( 4), based on the asset's current age  (since installation or refurbishment) and estimated , its margin can be defined as a linear function of : When the considered asset is brand new (i.e.,  = 0), margin  = 1.When the same asset is approaching its estimated , margin becomes  = 0.

Condition-based Data: Healthy Data
Here, we consider a case in which the available monitoring data for the asset being considered were collected exclusively when the asset was healthy Ξ ℎℎ , meaning that data pertaining to asset degradation or failure are unavailable.Ξ ℎℎ represents a collection of past observation data elements   .The following notation is used throughout this paper: a single observation data element   can be composed of  observed variables   ( = 1, … , ) (i.e.,   = [ 1 , . .,   ]) , and the nature of the observed variables   can be heterogenous in nature (e.g., temperature, pressure).
In this kind of situation, an asset's health status can be established by measuring how actual monitoring data differ (distance-wise) from healthy data.In this respect, anomaly detection tools (Nassif, 2021) designed to quantify the residual between the actual observed data   and the predicted data   (which are computed from   and Ξ −ℎℎ ) can be employed.Such tools can be based on a kernel density estimation, for example the auto-associative kernel regression method (Baraldi, 2015) or on deeplearning-based methods, e.g., see (Zhang, 2019).Under normal conditions,   is very similar to   (i.e.,   ≅   ).  ≠   indicates anomalous behavior (e.g., asset degradation).
In this context, a margin value can then be defined by measuring the difference between   and   as: where ‖  −   ‖ indicates the residual between the observed and predicted data and ℎ represents the comparison parameter between   and   (expressed in terms of standard deviation).When the asset is experiencing normal conditions,   ≅   ,  = 1.If the asset is experiencing abnormal conditions, the norm of the difference between   and   increases; consequently,  drops to 0.
Note that Ξ ℎℎ is here assumed to cover all possible healthy asset conditions.If this is not the case, when   enters an unforeseen healthy condition, the obtained margin value will show the asset to be unhealthy.However, once newly observed healthy conditions are recorded, they can be added to the original dataset Ξ −ℎℎ .
An example is shown in Fig. 3, which reflects a set Ξ ℎℎ of observed data elements   = [ 1 ,  2 ] being collected (the green dots in the top image of Fig. 3).Actual observed data   are constantly recorded, while   are determined based on   and Ξ ℎℎ (see the black and red lines in the top image of Fig. 3), using the auto-associative kernel regression method (Baraldi, 2015).Applying Eq. ( 6) to this test case makes it possible to generate a temporal profile for the corresponding margin (bottom plot of Fig. 3).

Condition-based Data: Healthy and Faulty Data
This case extends the one described in Section 3.3 (in which only data generated under healthy conditions Ξ ℎℎ are available) by incorporating data generated under faulty conditions, indicated here as Ξ  .It is assumed that, in the presence of an asset fault, the actual observed data   can be seen transitioning from Ξ ℎℎ to Ξ  .
In this scenario, by following the definition of margin given in Section 2 and by being provided with actual observed data (containing both historic healthy Ξ ℎℎ and faulty data Ξ  ), a margin value can be determined by comparing the mutual distance of   from the two populations: Ξ ℎℎ and Ξ  (see Fig. 4).In mathematical form, a margin can be written as: where the operator (.; .) represents the distance one single data element (i.e.,   ) and a population of data elements (either Ξ ℎℎ or Ξ  ).The choice of operator (.; .) may depend on several factors, as dictated by the distribution of the Ξ ℎℎ and Ξ  populations in the data space.Note that a distance-based approach for (  ; Ξ) is effective when the healthy and faulty data are well separated from each other in the [ 1 , . .,   ] space.In practical scenarios, however, these two populations of data elements may overlap.In such cases, margin can be quantified by using density-based methods (Hastie, Tibshirani, and Friedman, 2001), which are designed to translate (e.g., via kernel density estimation methods) the Ξ ℎℎ and Ξ  datasets into probability distribution functions (PDFs):  ℎℎ and   .Then, given a current observed measurement   , margin can be quantified by evaluating these two PDFs at the coordinate   : (  ) =  ℎℎ (  )  ℎℎ (  )+  (  ) (8) This equation weighs the PDF values at coordinate   for both the healthy and faulty conditions.When   is located in a region of the [ 1 , . .,   ] space dominated by healthy data,  ℎℎ (  ) ≫   (  ) , and (  ) ≅ 1.0.Conversely, when   is located in a region of the [ 1 , . .,   ] space dominated by faulty data,  ℎℎ (  ) ≪   (  ) and (  ) ≅ 0.0.Figure 5 illustrates an example that extends the one shown in Section 3.3.In Fig. 5, Ξ ℎℎ and Ξ  are shown in the top plot,   is represented as the black line moving from left to right, and the corresponding margin is shown in the bottom plot.Here,  ℎℎ (  ) and   (  ) were generated using kernel density estimation methods (Hastie, Tibshirani, and Friedman, 2001).
An alternative formulation to Eq. ( 8) can be derived when machine learning (ML) methods (Mohri, 2012) are employed.In this setting, a supervised ML model (i.e., a classifier) is trained using both the faulty and healthy datasets (Ξ  , Ξ ℎℎ ) and is employed to predict, given   , the class  (either faulty or healthy) to which   belongs.Such a prediction can be augmented by also determining the probability estimate   associated with the prediction .If the [0,1] margin interval is divided into two equally long segments, we can assign the "healthy" class to the [.5,1] interval and the "faulty" class to the [0, .5]interval.Hence, the predicted class  generated by the ML model determines the margin variability interval (either [0, .5]or [.5,1]).The variable   (see Fig. 6) is essentially a measure of the prediction accuracy.More precisely, a high value of   implies a high degree of accuracy in the prediction; conversely, a very low value implies low accuracy.In this context,   is used to determine the precise margin location in the [0, .5]or [.5,1] intervals.A high value of   would drive the margin toward the extremes of the intervals (either 0 or 1), whereas a low value of   would drive the margin toward the common point of the intervals (i.e., 0.5).
Consequently, provided   and a ML model that can generate both  and   , a margin value can be defined as:  In this context, margin quantification directly employs the two generated probability values (i.e.,  ℎℎ and   ) as: (  ) =  ℎℎ = 1 −   (11)

Prognostic Data
Estimating an asset's remaining useful life (RUL) provides valuable information regarding the temporal occurrence of the loss of function for the considered asset.Given the stochastic nature of the failure phenomena, RUL is typically expressed in terms of a probabilistic distribution along the temporal axis.Many methods have been developed in the literature to predict RUL for specific assets, and Ferreira and Gonçalves (2022) summarize the most widely used methods.To integrate the RUL PDF (indicated here as   ) into a margin-based reliability model, we apply reasoning similar to that presented in Section 2. Here, a margin is the distance between the actual time and predicted RUL.The main differences are that the RUL is estimated once a degradation mechanism has been identified (e.g., through an anomaly detection method) and is an actual distribution function rather than a point value.
Once the RUL PDF is estimated, the corresponding margin value can be estimated via two approaches.The first defines the margin as: where   indicates the cumulative distribution function corresponding to   .The second approach estimates margin as the distance between the actual asset life and a point estimate of the RUL distribution (e.g., the 5 th percentile  5%  ): where  5%  indicates the 5 th percentile of the RUL distribution   .

TEST CASE: CIRCULATING WATER SYSTEM (CWS) SYSTEM
To develop initial methods and models, a CWS at a Public Service Enterprise Group Nuclear, LLC owned plant site was selected as the target plant asset.The CWS is an important non-safety-related system.As the heat sink for the main steam turbine and associated auxiliaries, the CWS is designed to maximize steam power cycle efficiency (Agarwal, 2021).A CWS consists of the following major equipment (Agarwal, 2021): • Vertical, motor-driven circulating water pumps (CWPs), each with an associated fixed trash rack and traveling screen at the pump intake to filter out debris and marine life • Main condenser • Condenser waterbox air removal system • Circulating water sampling system • Screen wash system • Necessary piping, valves, instrumentation, and controls to support system operation.
The selected plant site (a two-unit pressurized-water reactor) features six circulators at each unit.Schematic representations of the main condensers for Plant Site Unit 2 are shown in Fig. 7.
In this research, the project team focused on optimizing the maintenance strategy for the CWS.To differentiate between motor and pump maintenance activities for each circulator, those assets are hereafter referred to as the CWP motor and the CWP, respectively.
The Unit 1 and Unit 2 CWS process data are collected once per minute and stored in the Plant Site 1 monitoring system.Due to file size restrictions, the project team received CWS process data hourly for both units, ranging from 2009 to 2019.The process data include: • Ambient air temperature (°F) • CWP inlet and outlet river temperature (°F) • CWP motor status (ON or OFF) • CWP motor stator winding temperature (°F) • CWP motor inboard-bearing (MIB) and outboardbearing (MOB) temperature (°F) • CWP motor current (amps).

Data Processing
As indicated by Agarwal (2021a.), the raw data collected from the NPP are distributed over several data and were processed by completing the following steps: • Text data are converted into numeric form (e.g., the ON/OFF data element is converted into 0/1) • New features are generated (e.g., pump differential temperature [DT], pump age since refurbishment) • Pump vibration data are processed through a fast Fourier transform algorithm, and the magnitude of the vibration signal for specific frequencies is captured • Based on the system operational history (e.g., maintenance records), data elements are labeled (as pertaining to either healthy or faulty state) • Missing data entries are resolved • Data conflicts between the operational history and recorded numerical values are resolved • All data sources are merged into a single time series , where the operators () and _() correspond to the mean value and standard deviation of the considered variable , respectively).
A series of preprocessed time series plots is shown in Fig. 8.For the specific case, by looking at the operational history of the CWS system, we were able to label portions of the data under healthy and faulty conditions (i.e., we were in the scenario described in Section 3.4, if this were not possible, we would have relied only on the known healthy conditionssee Section 3.3 ).In this respect, Fig. 9 shows box plots of four of the considered features for the healthy and faulty states.These variables were chosen based on their coverage of all healthy and faulty states.Note that the structure of the box plots shifts between healthy and faulty states.This is essential for correctly capturing system health from the available monitoring data.Note that the box plots in

Margin Model for Air Intake and Misalignment
Given the provided context, both Ξ  and Ξ ℎℎ data are available; hence, we employed the density-based method described in Section 3.4 to estimate the  ℎℎ (  ) and   (  ).For this specific test case, we considered a subset of the original data points contained in Ξ  and Ξ ℎℎ over four monitored variables (i.e., DT, motor stator temperature, MIB temperature, and MOB temperature).We considered a CWS snapshot in which an air intake instance was observed (between May 15 July 8, 2008).Directly applying Eq. ( 7) to each   made it possible to determine the corresponding margin value (see Fig. 10).This plot shows the initial situation, in which the system is in a healthy state (  = 1), before the margin then rapidly plummets once the faulty condition is initiated.The two peaks that follow were generated during the repair time window.
Figure 10.Graphical representation of the margin for air intake during an air intake occurrence, using the densitybased method.
An important element to highlight here is that the populations Ξ  and Ξ ℎℎ for air intake are fairly separated, as shown in Fig. 9.This allows us to assign a margin value of   = 1 when   is located near the Ξ ℎℎ population, and   = 0 when   is located near the Ξ  population.
A similar situation can be generated for the misalignment failure mode.In this case, however, the populations Ξ  and Ξ ℎℎ are not completely separated as they were shown in Fig. 9.This situation is not uncommon and may be caused by the labeling process applied to the original data (healthy vs. misalignment).We considered a CWS snapshot in which a misalignment instance was observed (between April 2013 and January 2015).Directly applying Eq. ( 7) to each   made it possible to determine the corresponding margin value (see Fig. 11).This plot shows that the initial situation, in which the system is in a healthy state, is actually characterized by   = .8(instead of   = 1 ).This is caused by the fact that the distributions associated with the two populations (Ξ ℎℎ and Ξ  ) for the misalignment failure mode share some degree of overlap (see also Fig. 9).If the distributions of these two populations do not overlap,   = 1 when the system is in a healthy state.

Margin Model from ML Models
As indicated in (Agarwal, 2021a;2021b), the following two ML models were generated to perform health and fault classification: • Binary classifier: This module is a XGBoost binary classifier.With CWP data, it predicts whether the CWP is experiencing normal operation or undergoing any degradation at the pump, motor, or system levels.The outputs of these two ML models have been merged to assess the margin for each failure mode by using Eqs.( 9)-( 11), as indicated in Section 3.4.Figure 13 presents the margin associated with air intake when using ML models; the same temporal profile can be compared against the one shown in Fig. 10, in which a density-based approach was applied to the same dataset.This margin calculation was applied to a subset of observation data   showing a transition from a healthy state to an air intake faulty state.This transition is captured in a margin sense by observing how the CWS margin for air intake drops from about 0.9 (system healthy) to 0.08 (system in an air intake faulty state).
Figure 13.Graphical representation of the margin for air intake during an air intake occurrence, using ML models.

CONCLUSION
This paper has described a reliability approach designed to directly employ available condition-based, diagnostic, and prognostic data.It proposed a margin-based approach for assessing asset health, which is based solely on current and historic monitoring data (e.g., condition-based, anomaly detection, diagnostic, and prognostic data).We provided details on how heterogenous ER data elements are employed to assess the status of an asset through a margin value that serves as an analytical measure of its health.We then showed how, depending on the operational context of the asset (e.g., type of failure modes) and the available pertaining to it, a margin value can be quantified using well-known statistical and ML algorithms.
Assessing system health is performed by propagating, through classical reliability models (e.g., FTs or reliability block diagrams), the margin values of those assets that support system function(s).Such propagation is not performed through set theory-based rules but rather through distance-based operations.This information can then be used to assess the reliability importance of each asset in order to identify the most critical assets.A margin-based approach directly addresses the limitations of classical reliability modeling approaches and provides a snapshot of system health-given the availability of monitoring data.These two different approaches are designed to address different types of decisions: classical reliability models support static decisions (e.g., a set frequency of periodic maintenance or surveillance operations) based on past operational experience, whereas a margin-based approach directly supports dynamic decisions involving maintenance operations that should only be performed when necessary, based on monitoring data (i.e., a predictive maintenance context).Note that the application of these two decision types (static and dynamic) is dictated by the degradation process being considered.When asset failure occurs suddenly, or the monitoring system cannot capture asset degradation, classical reliability approaches can be used to set preventive maintenance and periodic surveillance frequencies.On the other hand, when assets progressively degrade and the installed monitoring system is able to capture the degradation trend, a predictive maintenance context that relies on a margin-based approach can be set.
An analysis of the CWS system of an existing power plant generated insights into the structure and operational context of real data.Developed statistical and ML methods were employed to assess system health via margin-based operations.
A margin-based interpretation of reliability shifts the focus of the concept away from the probability of occurrence and toward assessing how far away (or close) an asset is to reaching an unacceptable level of performance or undergoing failure.This shift in focus provides a direct link between the asset and system health evaluation process and standard plant processes for managing performance (e.g., plant maintenance and budgeting processes).It also supports decision-making in a predictive maintenance context in a form that is more familiar and readily understandable to plant system engineers and decision makers (Xingang, 2021).

Figure 1 .
Figure 1.Graphical representation of margin, based on actual asset-monitoring data.

Figure 4 .
Figure 4. Margin calculation, given the current status of the monitored asset   when both healthy Ξ ℎℎ and faulty Ξ  data are available in the [ 1 , . .,   ] data space.

Figure 6 .
Figure 6.Graphical representation of margin based on the  and   provided by a ML model.

Figure 7 .
Figure 7. Plant Site Unit 2 CWP combination of 21A and 21B, with sensors and instrumentation.

Figure 8 .
Figure 8. Plot of five features of the preprocessed time series.Note that online motor current data are available from 2017, whereas process variables are available from 2009.

Fig. 9
Fig. 9 enable comparison between healthy and faulty states by looking at the distribution of one individual feature at a time.

Figure 9 .
Figure 9. Box plots of four of the considered features (DT, motor stator temperature, MIB temperature, and MOB temperature) for healthy and failure states.

Figure 11 .
Figure 11.Graphical representation of the margin for misalignment during a misalignment occurrence.

•
The model is developed by considering time domain features extracted from vibration data, along with the features extracted from monitoring data.Features such as motor current and vibration data are unavailable prior to September 2017 and October 2019, respectively.The missing features are mapped with NaN values.While training and making predictions, the XGBoost model discards all features with NaN values.Diagnostic model: This module is a multiclass classifier.For CWP data, it predicts the type of fault a CWP is currently undergoing.The model is developed by considering frequency domain features extracted from vibration data, along with the features extracted from the raw data.Features such as motor current and vibration data are unavailable prior to September 2017 and October 2019, respectively.The missing features are mapped with NaN values.While training and making predictions, the XGBoost model discards all features with NaN values.Figure 12 shows an example prediction generated by the diagnostic model.

Figure 12 .
Figure 12.Example prediction by the diagnostic model.The outputs of these two ML models have been merged to assess the margin for each failure mode by using Eqs.(9)-(11), as indicated in Section 3.4.Figure13presents the margin associated with air intake when using ML models; the same temporal profile can be compared against the one shown in Fig.10, in which a density-based approach was applied to the same dataset.This margin calculation was applied to a subset of observation data   showing a transition from a healthy state to an air intake faulty state.This transition is captured in a margin sense by observing how the CWS margin for air intake drops from about 0.9 (system healthy) to 0.08 (system in an air intake faulty state).