Empirical Evaluation of Diagnostic Algorithm Performance Using a Generic Framework

A variety of rule-based, model-based and datadriven techniques have been proposed for detection and isolation of faults in physical systems. However, there have been few efforts to comparatively analyze the performance of these approaches on the same system under identical conditions. One reason for this was the lack of a standard framework to perform this comparison. In this paper we introduce a framework, called DXF, that provides a common language to represent the system description, sensor data and the fault diagnosis results; a run-time architecture to execute the diagnosis algorithms under identical conditions and collect the diagnosis results; and an evaluation component that can compute performance metrics from the diagnosis results to compare the algorithms. We have used DXF to perform an empirical evaluation of 13 diagnostic algorithms on a hardware testbed (ADAPT) at NASA Ames Research Center and on a set of synthetic circuits typically used as benchmarks in the model-based diagnosis community. Based on these empirical data we analyze the performance of each algorithm and suggest directions for future development.


INTRODUCTION
Fault Diagnosis in physical systems involves the detection of anomalous system behavior and the identification of its cause.Some key steps in diagnostic inference are fault detection (is the output of the system incorrect?),fault isolation (what is This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 United States License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.Submitted 2/2010; published 7/2010.broken in the system?), fault identification (what is the magnitude of the failure?), and fault recovery (how can the system continue to operate in the presence of the faults?).To develop diagnostic inference algorithms requires expert knowledge and prior know-how about the system, models describing the behavior of the system, and operational sensor data.This problem is challenging for a variety of reasons including: • incorrect and/or insufficient knowledge about system behavior • limited observability • presence of many types of faults (such as system, supervisor, actuator, or sensor faults; additive and multiplicative faults; abrupt and incipient faults; persistent and intermittent faults; etc.) • non-local and delayed effect of faults due to dynamic nature of the system.• presence of other phenomena that influence/mask the symptoms of faults (unknown inputs acting on system, noise that affects the output of sensors, etc.) Several communities have attempted to solve the diagnostic inference problem using various methods.Some approaches have been: • Expert Systems -These approaches encode knowledge about system behavior into a form that can be used for inference.Some examples are rule-based systems (Russell & Norvig, 2003) and fault trees (Kavčič & Juričić, 1997).• Model-Based Methods -These approaches use an explicit model of the system configuration and behavior to guide the diagnostic inference.Some examples are Fault Detection and Isolation (FDI) methods (Gertler, 1998), International Journal of Prognostics and Health Management, ISSN 2153-2648, 2010 002 statistical methods (Basseville & Nikiforov, 1993), and "Artificial Intelligence (AI)" methods (Reiter, 1987).• Data-Driven Methods -These approaches use the data from representative runs to learn parameters that can then be used for anomaly detection or diagnostic inference for future runs.Some examples are Inductive Monitoring System (IMS) (Iverson, 2004), and Neural Networks (Sorsa & Koivo, 1998).• Stochastic Methods -These approaches treat diagnosis as a belief state estimation problem.Some examples are Bayesian Networks (Lerner, Parr, Koleer, & Biswas, 2000), and Particle Filters (de Freitas, 2002).Despite the development of such a variety of notations, techniques, and algorithms, efforts to evaluate and compare diagnostic algorithms (DAs) have been minimal.One of the major deterrents is the lack of a common framework for evaluating and comparing diagnostic algorithms.The establishment of such a framework would accomplish the following objectives: • Accelerate research in theories, principles, modeling and computational techniques for diagnosis of physical systems.• Encourage the development of software platforms that promise more rapid, accessible, and effective maturation of diagnostic technologies.• Provide a forum for algorithm developers to test and validate their technologies.• Systematically evaluate diagnostic technologies by producing comparable performance assessments.Such a framework requires the following: • A standard representation format for the system description, sensor data, and diagnosis result.• A software run-time architecture that can run specific scenarios from actual system, simulation, or other data sources such as files (individually or as a batch), execute DAs, send scenario data to the DA at appropriate time steps, and archive the diagnostic results from the DA.• A set of metrics to be computed based on the comparison of the actual scenario and diagnosis results from the DA.In this paper, we present a framework that attempts to address each of the above issues.The framework architecture employed for evaluating the performance of DAs is shown in Fig. 1 and is called DXF.Major elements are systems under diagnosis, DAs, scenario-based experiments, and metrics.System catalogs specify topology, components, and high-level mode behavior descriptions, including failure modes.DXF provides a program for quantitatively evaluating the DA output against known fault injections using predefined metrics.
The current version of DXF and this paper address a class of abrupt failures such as the ones often observed in electrical power systems.Other types of failures, for example intermittent or continuous ones, are left for future work.

Contributions
The contributions of this paper are as follows: • It introduces a benchmarking framework to be used for systematic empirical evaluation of diagnostic algorithm performance.Moreover, it defines and describes the main elements of the framework so that the benchmarking results can be applied to any arbitrary physical or synthetic system by using the architecture described in the paper.
• It provides a comprehensive set of empirical evaluation results in order to validate the proposed framework and to facilitate the understanding and comparative analysis of different diagnostic technologies.

Organization of the Paper
The rest of this paper is organized as follows.Section 2 contains related work.Section 3 presents DXF in detail including the representation languages used, the run-time architecture developed for experimentation, and the diagnostic performance metrics defined.Section 4 describes how the benchmarking was performed including a description of the two systems used, the faults injected, the DAs tested, and the results.Section 5 presents major assumptions made and issues observed.Finally, Section 6 presents the conclusions.

RELATED WORK
The development of monitoring and diagnostic technologies is of great interest to many applications.As these algorithms become more readily available, the necessity for assessing the performance of alternative diagnostic tools becomes important.As a result, there is an increasing need for a framework to evaluate of competing diagnostic technologies.
To address this need, several researchers have attempted to demonstrate benchmarking capability (Orsagh, Roemer, Savage, & Lebold, 2002;Roemer, Dzakowic, Orsagh, Byington, & Vachtsevanos, 2005;Bartyś, Patton, Syfert, de las Heras, & Quevedo, 2006).Among these, Bartyś et al. (2006) presented a benchmarking study for actuator fault detection and identification (FDI).This study, developed by the DAMADICS Research Training Network, introduced a set of 18 performance indices used for benchmarking FDI algorithms on an industrial valve-actuator system.The indices measure the temporal performance of detection and isolation decisions, as well as true and false detection and isolation rates, sensitivity, and diagnostic accuracy.This benchmark study uses real process data, and demonstrates how the performance indices can be calculated for 19 actuator faults using a single fault assumption.Izadi-Zamanabadi and Blanke (1999) presented a ship propulsion system as a benchmark for autonomous fault control.This benchmark has two main elements.One is the development of an FDI algorithm, and the other is the analysis and implementation of autonomous fault accommodation.
Relevant to aerospace industry, Simon, Bird, Davison, Volponi, and Iverson (2008) introduced a benchmarking technique for gas path diagnosis methods to assess the performance of engine health management technologies.
Finally, Orsagh et al. (2002) provided a method to measure the performance and effectiveness of prognostics and health management algorithms for US Navy applications (Roemer et al., 2005).In this work, the performance metrics are defined separately for detection, isolation, and prognosis.In addition, this work also combined individual metrics into a composite score by implementing a weighted average sum.Moreover, it defined effectiveness metrics as a separate category that can be used to incorporate non-technical aspects such as operation, maintenance and implementation costs, computer resource requirements, and algorithm complexity into the analysis.Using these metrics, one can assess the overall effectiveness and benefit of diagnostic health management systems.
Other researchers have also proposed similar cost-benefit formulations for diagnostic systems (Williams, 2006;Kurien & Moreno, 2008;Hoyle, Mehr, Tumer, & Chen, 2007).These approaches, however, are primarily concerned with higher-level trade-offs in integrating diagnostic solutions to provide health management functionality and focus on performance indices such as operational cost, and maintainability.
The DXF framework presented in this paper adopts some of its metrics from Kurtoglu, Narasimhan, Poll, Garcia, Kuhn, de Kleer, van Gemund, and Feldman (2009) and extends prior work in this area by defining a number of novel diagnostic performance metrics; by providing a generic, application independent architecture that can be used for evaluating different monitoring and diagnostic algorithms; and by facilitating the use of real process data on a large-scale, complex engineering system.

FRAMEWORK
We have developed a framework called DXF that allows systematic comparison and evaluation of diagnostic algorithms under identical experimental conditions.The key components of this framework include representation languages for the physical system description, sensor data and diagnosis results, a runtime architecture for executing diagnostic algorithms and diagnostic scenarios, and an evaluation component that computes performance metrics based on the results from diagnostic algorithm execution.
The process to set up the framework in order to perform comparison/evaluation of a selected set of diagnostic algorithms on a specific physical system is as follows: 1.The system is specified in an XML file called the System Catalog.The catalog includes the system's components, connections, components' operating modes, and a textual description of component behavior in each mode.2. The set of sensor points is chosen and sample data for nominal and fault scenarios are generated.
3. DA developers use the system catalog and sample data to create their algorithms using a predefined Application Programming Interface (API) in order to receive sensor data and send the diagnosis results.The DXF API is described later in this section.4. A set of test scenarios (nominal and faulty) is selected to evaluate the DAs. 5.The run-time architecture is used to execute the DAs on the selected test scenarios in a controlled experiment setting, and the diagnosis results are archived.6. Selected metrics are computed by comparing actual scenarios and diagnosis results from DAs.These metrics are then used to compute secondary metrics.In the following subsections we describe the constituent pieces of our framework in more detail.The next subsection describes the various representation languages defined for the framework.We then describe the run-time architecture including the sequence of events and the messages exchanged among the various components and finally we describe a set of representative metrics that measure diagnostic performance.

DXF Data Structures
In what follows we describe the syntax and semantics of the relevant DXF data structures as well as some design rationale.

System Description
We realize that it is impossible to avoid bias towards certain diagnostic algorithms and methodologies when providing system descriptions.Despite attempts to create a general modeling language1 , there is no widely agreed way to represent models and systems.On the other hand, designing a diagnostic framework which is fully agnostic towards the system description is impossible as there would be no way to communicate components or system parts and to compute diagnostic metrics.As a compromise, we have chosen a minimalistic approach, providing formal descriptions of the system topology and component modes only.
The formal part of the DXF system description does not provide all information for building a model.The user may be provided with nonformalized external information, e.g., nominal and faulty functionality of components.This information may be provided in textual, programmatic or any other well-understood format.In the future we may try to extend our XML schema in yet another attempt of providing a complete modeling language beyond interconnection topology.
The XML system description is primarily intended to provide a common set of identifiers for components and their modes of operation within a given system.This is necessary to communicate sensor data and diagnoses.Additionally, basic structural information is provided in the form of component connections.Behavioral information is limited to a brief textual description of each component and its modes, leaving DA developers to deduce behavior from the system's sample data.This is done to avoid bias towards any diagnostic approach.System Topology: DXF uses a graph-like representation to specify the physical connectivity of the system where nodes represent components of a system and arcs capture the connectivity between components.• a name (identifier) • an optional (textual) description • a flag specifying if the mode is nominal or faulty The details of the system description formats are provided in Appendix B.

API Data Types
In DXF, the run-time communication is performed using a messaging framework.Messages are exchanged as ASCII text over TCP/IP.API calls for parsing, sending, and receiving messages are provided with the framework, but developers may choose to send and receive messages directly through the underlying TCP/IP interface.This allows developers to use their programming language of choice, rather than being forced into the languages of the provided APIs.
Every message contains a millisecond timestamp indicating the time at which the message was sent.Though there are additional message types, the most important messages for the purpose of performance evaluation are the sensor data message, command message, and diagnosis message, described below (the details of the messaging formats are provided in Appendix C):  3, where the fault injections, detection, and isolation are all treated as signals.These signals define a number of time points and intervals, as is seen below.
In the beginning of each scenario, a DA is given some startup time to initialize, read data, etc.Even though sensor observations could be available during startup, fault injections are not allowed during this interval.Fault injection and diagnosis take place during the diagnosis interval.Finally, a DA is given some shutdown time to terminate before being killed.
Table 1 summarizes the data collected by the SR for each scenario.These data are used for computing the various metrics discussed in Sec.3.3.The time of first detection t d is derived from the detec- The set Ω = {ω 1 , ω 2 , . . ., ω n } contains all diagnoses computed by the DA at time t i .If a DA never asserts the isolation signal I (i.e., t i = ∞), it is assumed that Ω = ∅.Each candidate in Ω is accompanied by a weight W .We denote the set of weights of all diagnoses in Ω as by dividing each weight W (ω) with the sum of all weights.If a DA fails to provide W , it is assumed that all diagnoses are of the same weight.
In addition to the time-points defined in Table 1, the isolation signal in Fig. 3 shows the time t ffi the DA has isolated a fault for the first time, and the time t fir the DA has retracted its isolation assumption (for example because more faults are expected).Note that t ffi and t fir are not currently used by the evaluator for computing the metrics.

Diagnostic Performance Metrics
The metrics for evaluating diagnostic algorithm performance depend on the particular use of the diagnostic system, the users involved, and their objectives.
In Orsagh et al. (2002), the performance metrics are defined separately for detection, and isolation.For detection, the metrics include thresholds, accuracy, reliability, sensitivity to load, speed, or noise, and stability.The isolation metrics include the detection metrics, but also include measures for discrimination and repeatability.
In this paper, our goal has been to define a number of metrics and to give guidelines for their use.For DXF, we make a distinction between detection, isolation, and computational performance and highlight metrics for each category.In general several other classes of metrics are possible, including cost/utility metrics, effort (in building systems for example) metrics and also other categories such as fault identification and fault recovery metrics.The expectation is that as the DXF evolves a comprehensive list of desired metric classes and categories will be developed to aid framework users in choosing the performance criteria they want to measure.For the first implementation of the DXF, we defined 10 metrics which are summarized in Table 2.These metrics are based on extensive survey of literature and talking to experts from various fields (Kurtoglu, Mengshoel, & Poll, 2008).These metrics are defined next.

Detection Metrics
The distinction between detection and isolation has practical importance.A DA may announce a fault detection before it knows the root cause of failure (for example, a detection announcement can be based solely on surpassing sensor threshold values).A detection signal cannot be retracted by a DA while it is legal to retract an isolation an-nouncement when more faults are expected.The detection metrics include: Fault Detection Time The fault detection time (the reaction time for a diagnostic engine to detect an anomaly) is directly measured as: The fault detection time is reported in milliseconds and is computed only for non-nominal scenarios for which a DA asserts the time detection signal at least once.
False Negative Scenario The false negative scenario metric measures whether a fault is missed by a diagnostic algorithm and is defined as: False Positive Scenario The false positive scenario metric penalizes DAs which announce spurious faults and is defined as: where t ⋆ = ∞ for nominal scenarios (i.e., scenarios during which no fault is injected).Note that the above two metrics (M fn and M fp ) are computed for each scenario and their computation is based on the times of injecting and announcing the fault.We also have false negative and false positive components in the context of individual diagnostic candidates (recall that a DA sends a set of diagnostic candidates at isolation time) which we will discuss later in this paper.

Scenario Detection Accuracy
The scenario detection accuracy metric is computed from M fn and M fp : M da is 1 if the scenario is true positive or true negative and 0 otherwise (equivalently, M da = 0 if M fn = 1 or M fp = 1, and M da = 1 otherwise).
M da splits all scenarios into "true" and "false".Incorrect scenarios are further classified into false positive (M fp ) and false negative (M fn ).Correct scenarios are true positive if there are injected faults and true negative otherwise (the latter separation into true positives and true negatives is rarely of practical importance).

Isolation Metrics
Computation of isolation metrics is more involved due to the fact that an isolation can be retracted.Furthermore, an isolation event contains a set of diagnostic candidates and we need metrics that compare this set of candidates to the injected fault.Accordingly, we have defined several metrics which are computed from the set of diagnostic candidates Ω and the injected fault ω * (classification errors, and utility metrics).Consider a single diagnostic candidate ω ∈ Ω.Both the candidate ω and the injected fault ω ⋆ are sets of components.
The intersection of those two sets are the properly diagnosed components.The false positives are the components that have been considered faulty but are not actually faulty.The false negatives are the components that have been considered healthy but are actually faulty.Figure 4 shows how ω and ω ⋆ partition all components into four sets.
False positives and false negatives in this context relate to individual candidates, i.e., misclassified components in a single diagnostic candidate.There are also scenario-based false negative and false positive metrics (defined earlier in this section), which summarize whole scenarios and are not to be confused with the false positives and false negatives in the context of isolation metrics.
For brevity we use the notation in Table 3 for the Fig. 4 sets.

Var. Set Description
the set of faulty components from the viewpoint of the DA Table 3: Notation for sizes of some frequently used sets Based on the representation given in Figure 4, the meaning of false positives and false negatives can be interpreted differently depending on what the diagnosis results are supporting (abort decisions, ground support, fault-adaptive control, etc.).Researchers have proposed different methods to assess the meaning of isolation accuracy and its practical and economical implications.
DePold et al. ( 2004) introduced metrics based on the receiving operating characteristic (ROC) analysis (Metz, 1978), which illustrates the tradeoff space between the probability of false alarm and the probability of detection for different signal to noise ratio (SNR) levels.The method is used to test the relative accuracy of diagnostic systems based on different threshold settings.Later, they also proposed a combined metric (DePold et al., 2006) that accounts for consequential event costs including missed detection, false alarms, and misdiagnosis.Another widely used metric for isolation accuracy is the Kappa Coefficient (Committee E-32, 2008).It is based on the construction of a confusion matrix that summarizes diagnostic results produced by a reasoner over a number of test/use cases.In essence, the Kappa Coefficient measures the ability of an algorithm to discriminate among many fault candidates.In this paper, we take a simplistic approach and assume that false positives and false negatives have an equal cost for the diagnostic task and operations.The isolation metrics include (for a detailed discussion and derivation of the isolation metrics, see Appendix A): Next, from the isolation signal, we construct a sequence of isolation times for each component.This sequence containing timestamps of rising edges of the isolation signal is denoted as The fault isolation time is then computed as: If there is no isolation for specific fault (i.e., a fault is missed) then there is no difference The fault isolation time is reported in milliseconds and is computed only for non-nominal scenarios for which a DA asserts the time isolation signal at least once.

Classification Error
The classification error metric is defined as: In Eq. ( 7), ω⊖ω ⋆ denotes the symmetric difference of the ω and ω ⋆ sets, i.e., the number of misclassified components.Note that |ω ⊖ ω ⋆ | = n + n and f = N + N .

Utility
The utility metric measures the work for correctly identifying all false negatives and false positives in a diagnostic candidate.Alternatively, the utility metric measures the expected number of calls to a testing oracle that always determines correctly the health state of a component.Note that this metric assumes an equal cost for fixing a false negative and a false positive.The derivation of the utility metric is given in Appendix A. The utility metric (per candidate) is: Computing a weighted average of m utl gives us the "per scenario" utility metric: The utility metric is, in fact, a combination of two "half-utilities"-the system repair utility and the diagnosis repair utility.The latter are defined as secondary metrics in Sec.3.3.4and discussed in detail in Appendix A.
Note that for Ω = ∅, the framework automatically assumes a single "all-healthy" diagnostic candidate with weight 1 at the time of isolation.This affects the M err and M utl metrics.For example, in a non-nominal false-negative scenario, Consistency The next metric comes from MBD (de Kleer, Mackworth, & Reiter, 1992).It only applies to systems for which (1) there is a formally defined system description (model), (2) one can derive a formally defined observation from the sensor data, and (3) the notion of consistency is formally defined.We compute the consistency metric for the synthetic models and scenarios.
Consider a model SD and an observation α (α is derived from the sensor data at time t * ).If SD and α can be expressed as sentences in propositional logic (as is the case with the synthetic models and scenarios) then the set of consistent diagnoses is defined as: The set Ω ⊤ can be computed from SD, α, and Ω by using a DPLL-solver (Davis, Logemann, & Loveland, 1962).The consistency metric can be computed from Ω ⊤ , W and the injected fault ω ⋆ : M sat is a measure of how much probability mass a DA associates with diagnoses consistent with the observations.

Computational Metrics CPU Load
The CPU load during an experiment is computed as: where C s is the amount of CPU time spent by a DA during startup and C is a vector with the actual CPU time spent by the DA at each time step.The CPU load is reported in milliseconds.

Memory Load
The memory load is defined as: where M is a vector with the maximum memory size allocated at each step of the diagnostic session.The memory load is reported in Kb.

Secondary Metrics
The intuition behind classification errors can be realized with multiple metrics.For example, a diagnostician may compute the isolation accuracy using: In general a diagnostician has to perform extra work to "verify" all misdiagnosed components in ω.Suppose that the diagnostician has access to a test oracle that states if a component c is healthy or faulty.The system repair utility is then defined as normalized average number of oracle calls for identifying all false negative components and is defined as: The "per scenario" system repair utility is the defined as: Similarly, a diagnostician has to eliminate all false positive components in a candidate.This is reflected in the diagnosis repair utility: The diagnosis repair utility for a set of diagnostic candidates is defined as: M utl , M sru , and M dru are discussed in detail in Appendix A.
The choice of which utility metric is best for a particular use depends on the relative costs of the available repair actions.For example, if components are nearly free, but the act of replacing them is expensive then it makes no sense to identify which erroneously replaced components were actually correct (thus m sru is preferred).

System Metrics
The metrics M fn , M fp , M da , M fd , M fi , M err , M utl , M sat , M cpu , M mem , M ia , M sru , and M dru are based on a single scenario.To receive "per system" results we combine the metrics of each scenario using unweighted average.For example, if a system SD is tested with scenarios S = {S 1 , S 2 , . . ., S n }, the "per system" utility of SD is computed as: where M utl (SD, S) is the "per scenario" utility of system SD and scenario S.
Note that M fn , M fp , and M da are called false negative scenario, false positive scenario and scenario detection accuracy, respectively.The analogous "per system" metrics Mfn , Mfp , and Mda are called false negative rate, false positive rate, and detection accuracy.Mda , for example, represents the ratio of the number of correctly classified cases to the total number of cases.The latter "per system" metrics ( Mfn , Mfp , and Mda ) are equivalent to the ones in Kurtoglu et al. (2009).In this paper we first define each metric "per scenario" and then "per system".

EMPIRICAL EVALUATION
In order to evaluate the framework presented in the previous section we selected two case studies.The first case study was performed on an Electrical Power System (EPS) testbed located in the ADAPT Lab of NASA Ames Research Center (Poll, Patterson-Hine, Camisa, Garcia, Hall, Lee, Mengshoel, Neukom, Nishikawa, Ossenfort, Sweet, Yentus, Roychoudhury, Daigle, Biswas, & Koutsoukos, 2007).This system mimics components and configurations in a power system that might be found on an aerospace vehicle.The second case study was performed on a set of 14 synthetic systems called the 74XXX/ISCAS85 circuits (Brglez & Fujiwara, 1985), which are purely combinational, i.e., they contain no flip-flops or other memory elements, and represent well-known benchmark models of ISCAS85 circuits.
The empirical evaluation as part of the above two case studies employed 13 diagnostic algorithms (DAs) (Kurtoglu et al., 2009).The results from the DAs were used to compute metrics that were in turn used to evaluate the DAs performance on the aforementioned systems.We first present the DAs used in the evaluation and then present the two case studies.

Diagnostic Algorithms
We have experimented with a total of 13 DAs (see  (Mengshoel, 2007) processes all incoming environment data (observations from a system being diagnosed), and acts as a gateway to a probabilistic inference engine.The inference engine uses an Arithmetic Circuit evaluator which is compiled from Bayesian network models.The primary advantage of using arithmetic circuits is speed, which is key in resource bounded environments.RacerX: RacerX is a detection-only algorithm which detects a percentage change in individual filtered sensor values to raise a fault detection flag.RODON: RODON (Karin, Lunde, & Münker, 2006) is based on the principles of the General Diagnostic Engine (GDE) as described by de Kleer and Williams (1987) and the G + DE (Heller & Struss, 2001).RODON uses contradictions (conflicts) between the simulated and the observed behavior to generate hypotheses about possible causes for the observed behavior.If the model contains failure modes in addition to the nominal behavior, these can be used to verify the hypotheses, which speeds up the diagnostic process and improves the results.RulesRule: RulesRule is a rule-based isolationonly algorithm.The rule base was developed by analyzing the sample data and determining characteristic features of faults.There is no explicit fault detection though isolation implicitly means that a fault has been detected.StanfordDA: StanfordDA is an optimizationbased approach to estimating fault states in DC power systems.The model includes faults changing the system topology along with sensor faults.The approach can be considered as a relaxation of the mixed estimation problem.
The authors have developed a linear model of the circuit and pose a convex problem for estimating the faults and other hidden states.
A sparse fault vector solution is computed by using L1 regularization (Zymnis, Boyd, & Gorinevsky, 2009).Wizards of Oz: Wizards of Oz (Grastien & Kan-John, 2009) is a consistency-based algorithm.The model of the system completely defines the stable (static) output of the system in case of normal and faulty behavior.
Given a new command or new observations, the algorithm waits for a stable state and computes the minimum diagnoses consistent with the observations and the previous diagnoses.

Case Study I: ADAPT EPS
We next describe the ADAPT EPS system, the diagnostic scenarios and the experimental results.

System Description
The ADAPT EPS testbed provides a means for evaluating DAs through the controlled insertion of faults in repeatable failure scenarios.The EPS testbed incorporates low-cost commercial off-theshelf (COTS) components connected in a system topology that provides the functions typical of aerospace vehicle electrical power systems: energy conversion/generation (battery chargers), energy storage (three sets of lead-acid batteries), power distribution (two inverters, several relays, circuit breakers, and loads) and power management (command, control, and data acquisition).
The EPS delivers Alternating Current (AC) and Direct Current (DC) power to loads, which in an aerospace vehicle could include subsystems such as the avionics, propulsion, life support, environmental controls, and science payloads.A data acquisition and control system commands the testbed into different configurations and records data from sensors that measure system variables such as voltages, currents, temperatures, and switch positions.Data are presently acquired at a 2 Hz rate.
The scope of the ADAPT EPS testbed used in this case study is shown Fig. 5. Power storage and distribution elements from the batteries to the loads are within scope; there are no power generation elements defined in the system catalog.We have created two systems from the same physical testbed, ADAPT-Lite and ADAPT: ADAPT-Lite ADAPT-Lite includes a single battery and a single load as indicated by the dashed lines in the schematic (Fig. 5).The initial configuration for ADAPT-Lite data has all relays and circuit breakers closed and no nominal mode changes are commanded during the scenarios.Hence, any noticeable changes in sensor values may be correctly attributed to faults injected into the scenarios.Furthermore, ADAPT-Lite is restricted to single faults.
ADAPT ADAPT includes all batteries and loads in the EPS.The initial configuration for ADAPT has all relays open and nominal mode changes are commanded during the scenarios.The commanded configuration changes result in adjustments to sensor values as well as transients which are nominal and not indicative of injected faults, in contrast to ADAPT-Lite.Finally, multiple faults may be injected in ADAPT.The differences between ADAPT-Lite and ADAPT are summarized in

Diagnostic Challenges
The ADAPT EPS testbed offers a number of challenges to DAs.It is a hybrid system with multiple modes of operation due to switching elements such as relays and circuit breakers.There are continuous dynamics within the operating modes and components from multiple physical domains, including electrical, mechanical, and hydraulic.It is possible to inject multiple faults into the system.Furthermore, timing considerations and transient behavior must be taken into account when designing DAs.For example, when power is input to the inverter there is a delay of a few seconds before power is available at the output.For some loads, there is a large current transient when the device is turned on.System voltages and currents depend on the loads attached, and noise in sensor data increases as more loads are activated.Measurement noise occasionally exhibits spikes and is non-Gaussian.The 2 Hz sample rate limits the types of features that may be extracted from measurements.Finally, there may be insufficient information and data to estimate parameters of dynamic models in certain modeling paradigms.

Fault Injection and Scenarios
ADAPT supports the repeatable injection of faults into the system in three ways: Hardware-Induced Faults: These faults are physically injected at the testbed hardware.
A simple example is tripping a circuit breaker using the manual throw bars.Another is using the power toggle switch to turn off an inverter.Faults may also be introduced in the loads attached to the EPS.For example, the valve can be closed slightly to vary the LGT401 LGT402 LGT405 LGT406 LGT407 (2) blocking commands sent to the testbed; and (3) altering the testbed sensor data.Real Faults: In addition the aforementioned two methods, real faults may be injected into the system by using actual faulty components.A simple example includes a burned out light bulb.This method of fault injection was not used in this study.For results presented in this case study, only abrupt discrete (change in operating mode of component) and parametric (step change in parameter value) faults are considered.Nominal and failure scenarios are created using hardware and softwareinduced fault injection methods.The diagnostic algorithms are tested against a number of scenarios, each approximately four minutes in length.
The ADAPT-Lite experiments include 36 nominal and 56 single-fault scenarios.The ADAPT experiments have 48 nominal and 111 fault scenarios, which include single-fault, double-fault, and triple-faults.Figure 6 shows the fault-cardinality distribution of the ADAPT scenarios.Table 7 summarizes the type of faults used for ADAPT.The majority of faults involve sensors (102) and loads (30).

Experimental Results
We next compute the metrics described in Sec.3.3 for the ADAPT-Lite and ADAPT scenarios.

ADAPT-Lite
The DA benchmarking results for ADAPT-Lite are shown in The bottom-right plot of Fig. 7 shows the false positive and false negative rates.The corresponding detection accuracy can be seen in Table 8.As is evident from the definition of the metrics in Sec.3.3, a DA that has low false positive and negative rates has high detection accuracy.False positives are counted in the following two situations: (1) for nominal scenarios where the DA declares a fault; and (2) for faulty scenarios where the DA declares a fault before any fault is injected.Noise in the data and incorrect models are the main causes  The classification error metric for each DA is shown in the top-left plot of Fig. 7, where the error contributions of scenarios labeled false neg-ative, false positive, and true positive are noted.Many DAs have difficulties distinguishing between sensor-stuck and sensor-offset faults.The distinction in the fault behavior is that stuck has zero noise while offset has the noise of the original signal; the rightmost plot in Fig. 8 shows the fan speed sensor ST516 with sensor-offset and sensorstuck faults.In many scenarios, the sensor-stuck faults are set to the minimum or maximum value of the sensor or held at its last reading.The latter case presents the most difficulties to DAs.   7. RacerX is a detection-only DA and does not perform isolation (its detection time is very low).Note that M fd ≤ M fi , hence the bottom-left plot of Fig. 7 shows the isolation time stacked on the detection time (assume that part of the time first goes into detection and then into isolation).
The top-right plot of Fig. 7 shows the system repair utilty, M sru , and the diagnosis repair utility, M dru .The diagnosis repair utility is very close to 1 for all DAs, which reflects the small fault cardinality and diagnosis ambiguity groups for the system.The number of components that a DA considers faulty, N , in any given scenario is typically close to the number of faults injected in the scenario.Since N is much less than the number of components, f , it is evident from equation ( 17) that M dru approaches 1.Furthermore, since the number of healthy components, N , as determined by the DA is larger than the number of faulty components, N , whereas n is typically not much different from n, the system repair utility is smaller than the diagnosis repair utility.
Note that HyDE has been used by two different modelers of ADAPT-Lite.HyDE was modeled primarily with the larger and more complex ADAPT in mind and had a policy of waiting for transients to settle before requesting a diagnosis.The same policy was applied to ADAPT-Lite as well, even though transients in ADAPT-Lite corresponded strictly to fault events; this prevented false positives in ADAPT but negatively impacted the timing metric in ADAPT-Lite.On the other hand, HyDE-S was modeled only for ADAPT-Lite and did not include a lengthy time-out period for transients to settle.HyDE-S had dramatically smaller mean detection and isolation times (see the bottom-left plot of Fig. 7) with roughly the same M err (see Table 8) as HyDE.This illustrates the impact that modeling and implementation decisions have on DA performance.While this gives some insight into trade-offs present in building models, in this work we did not define metrics that directly address the ease or difficulty of building models of sufficient fidelity for the diagnosis task at hand.
As is visible from Table 8, there exist significant differences in M cpu and M mem .Part of these differences can be attributed to the operating system (Linux or Windows TM ).RODON was the only Java DA that was run on Windows TM , which adversely affected its memory usage metric.
ADAPT The empirical DA benchmarking results for ADAPT are shown in Table 9.The comments in the ADAPT-Lite discussion about noise and sensor stuck apply here as well.Additionally, false positives also result from nominal commanded mode changes in which the relay feedback did not change status as of the next data sample after the command.Here is an extract from one of the input scenario files that illustrates this situation: command @120950 EY275_CL = false; sensors @121001 {..., ESH275 = true, ... }; sensors @121501 {..., ESH275 = false, ... }; A command is given at 120.95 seconds to open relay EY275.The associated relay position sensor does not indicate open as of the next sensor data update 51 milliseconds later.This is nominal behavior for the system.A DA that does not account for this delay will indicate a false positive in this case.
The detection and isolation times are generally within the same order of magnitude for the different DAs (see the bottom-left plot of Fig. 9).Some DAs have isolation times that are similar to its detection times while others show isolation times that are much greater than the detection times.This could reflect differences in reasoning strategies or differences in policies for when to declare an isolation based on accumulated evidence.
The CPU and memory usage are shown in Table 9.The same comment for RODON mentioned previously in regards to memory usage applies here.The convex optimization approach applied in the StanfordDA and the compiled arithmetic circuit in ProADAPT lead to very low CPU usages.

Fault Type and Cardinality Analysis
The plots on the left-hand side of Fig. 10 show detection accuracy for all DAs by fault type for ADAPT-Lite and ADAPT.In general, M da is not very sensitive to the component type, except in the case of load and sensor faults where it is lower.The data on the battery detection accuracy is not representative due to the limited number of fault scenarios containing battery faults (see Table 6 and Table 7).
The plots on the right-hand side of Fig. 10 show classification errors for all DAs by fault type for ADAPT-Lite and ADAPT.While the overall performance (averaged for all DAs) indicates that most fault categories result in roughly the same number of errors per scenario, it can be seen that a given DA may do better on some faults compared to others; furthermore, several DAs have the fewest classification errors for the different fault types.We should also note that in this benchmarking study, no partial credit was given for correctly naming the failed component but incorrectly isolating the failure mode.We realize however, that isolating to a failed component or linereplaceable-unit (LRU) in maintenance operations is sometimes all that is required.We plan to revisit this metric in future work.
Figure 11 shows the breakdown of classification errors by the number of faults in the scenario.In general, the number of errors increased approximately linearly with the number of faults in the  scenario.The errors in the multiple fault scenarios were evenly divided among the faults; for example, if there were four classification errors in a scenario where two faults were injected, each fault was assigned two errors.We also did a more thorough assessment in which each diagnosis candidate was examined and classification erorrs were assigned to fault categories based on an understanding of which sensors are affected by the faults.The results are similar to evenly dividing the errors among the faults and are not shown here.

Metric Correlations
The correlation matrix shown in Table 10 contains the Pearson's linear correlation coefficients between each metric for the industrial systems ADAPT and ADAPT-Lite.
Ideally, metrics should measure different aspects of DAs, i.e., the correlation matrix should contain small values only.Alternatively, users may use the correlation matrix from Table 10 to select metrics and adjust metric weights in computing the parameters of their DAs.Unexpected high correlations (or anti-correlations) between metrics indicate (1) bias due to the system or the sensor data, or (2) hidden metric dependencies.
All correlation coefficients in Table 10, except those shown in bold red, are significant-the pvalues according to the Student's t distribution are smaller than 0.03.
Figure 12 is a color map of the correlation matrix from Table 10.Correlations or anticorrelations close to 1 are colored in red, while values closer to 0 are shown in blue colors.
The anti-correlation between M ia and M err is trivial (see ( 22)) and the only reason for including it is to show the correctness of our implementation.
The utility metric shows high correlation with the isolation accuracy/classification errors (ρ = 0.75).This is expected as both metrics measure similar properties of the DAs' results.Less trivial is the high correlation between M − utl and M + utl (ρ = 0.84).This indicates that DAs do not show The time for fault isolation M ia correlates highly with the three utility metrics, for which we have no explanation.The M fn metric correlates high with M da which comes from the metric design and indicates that, in general, DAs are tuned to avoid false positives at the price of more false negatives.

Case Study II: Synthetic Systems
We continue our discussion with an overview of the synthetic systems.The major differences between this case study and the previous are the sizes of the systems and the cardinalities of the injected faults.Furthermore, all system variables in this case study are of Boolean type.This case study aims to compare the robustness, CPU performance, and memory consumption of various DAs under stress conditions (large systems, faults of multiple-cardinality, etc.).

Description of Systems
The original 74XXX/ISCAS85 netlists can be mechanically translated into propositional Wff s.We have translated the propositional Wff s into logically equivalent Conjunctive Normal Form (CNF) formulae (Forbus & de Kleer, 1993).These CNF formulae are described in Table 11.
For each 74XXX/ISCAS85 CNF formula, Table 11 gives the number of inputs |IN|, the number of outputs |OUT|, the size of the components sets |COMPS|, the number of variables |V |, and the number of clauses |C|.The size of the 74XXX/ISCAS85 circuits can be reduced by using cones for computing single-component ambiguity groups (Siddiqi & Huang, 2007) or using fault collapsing.
Algorithm 1 uses a number of auxiliary functions.RandomInputs (line 3) assigns uniformly for all v ∈ OUT do 7: end for 15: end for 16: return A 17: end function distributed random values to each input in IN (note that for the generation of observation vectors we partition the observable variables OBS into inputs IN and outputs OUT and use the input/output information which comes with the original 74XXX/ISCAS85 circuits for simulation).Given the "all healthy" health assignment and the diagnostic system, NominalOutputs (line 4) performs simulation by propagating the input assignment α.The result is an assignment β which contains values for each output variable in OUT.
The loop in lines 7 -14 increases the cardinality by greedily flipping the values of the output variables.For each new candidate observation α n , Alg. 1 uses the diagnostic algorithm Safari to compute a minimal diagnosis of cardinality c (Feldman, Provan, & van Gemund, 2008a).As Safari returns more than one diagnosis (up to N ), we use MinCardDiag to choose the one of smallest cardinality.If the cardinality c of this diagnosis increases in comparison to the previous iteration, the observation is added to the list.
By running Alg. 1 we get up to K observations leading to faults of cardinality 1, 2, . . ., m, where m is the cardinality of the MFMC diagnosis (Feldman, Provan, & van Gemund, 2008b) for the respective circuit.Alg. 1 clearly shows a bootstrapping problem.In order to create potentially "difficult" observations for a DA we require a DA to solve those "difficult" observations.In our case we have used the anytime Safari.As Safari is a stochastic algorithm, sometimes it returns a minimal diagnosis when we need a minimal-cardinality one.This leads to scenarios resulting in lower cardinalities than intended but this seemingly causes no problems except minor difficulties in the analysis of the DAs' performance.

Experimental Results
We start this section by computing the relevant metrics for this case study: Mutl , Mcpu , and Mmem .The results are shown in Table 12.
It can be seen that Lydia has achieved significantly better Mcpu and Mmem than NGDE and RODON.M utl of Lydia is slightly worse due to smaller number of diagnostic candidates computed by this DA.Lydia and RODON showed similar results in the utility metrics.
We have computed Msat and the results are shown in Table 14.The SAT and UNSAT columns show the number of consistent and inconsistent candidates, respectively.NGDE has generated approximately two orders of magnitude more satisfiable candidates than Lydia and RODON.The policy of Lydia has been to compute a small number of candidates, minimizing Mmem and Mcpu .In order to improve Mutl , Lydia has mapped multiple-cardinality candidates into single-component failure probabilities.Hence, only single-fault scenarios contribute to the Msat score for Lydia.

DISCUSSION
The primary goal of the empirical evaluation presented in this paper was to demonstrate an end-toend implementation of DXF and create a foundation for future usage of the framework.As a result we made several simplifying assumptions.We also ran into several issues during the course of this implementation that could not be addressed.In this section, we present those assumptions and issues, which we hope can be addressed in future implementations.

DXF Data Structures
The system catalog has been intentionally defined as a general XML format to avoid committing to specific modeling or knowledge representations (e.g., equations).It is expected that the sample training data and pointers to additional documentation would be sufficient for DA developers to learn the behavior of the system.We will continue to look for ways to extend the system catalog representation to provide as much general information about the system as possible.The diagnosis result format is defined to be a set of candidates with a weight associated with each candidate.Each candidate reports faulty modes of 0 (all nominal) or more components.Obviously this is a simplistic representation since it does not allow reporting of intermittent faults, parametric faults, among others.Also, in some cases it may be desirable to report a belief state (a probability distribution over component states) as opposed to a set of candidates.

Run-Time Architecture
For the ADAPT system, the fault signatures were limited to abrupt parametric and discrete types.We plan to introduce other fault types (incipient, intermittent, and noise) in the future.The runtime architecture was defined such that no assumptions were made regarding the actual operational environments in which the diagnostic algorithms may be run.We understand that a true test would simulate operating conditions of the real system, i.e. the system operates nominally for long periods of time and failures occur periodically following the prior probability of failure distribution.In this work, faults were inserted assuming equal probabilities.In the future, we will provide the failure rates of components and use these to evaluate the performance of DAs.It was also assumed that all sensor data was available to the DAs at all time steps.In the future, we would like to relax this assumption and provide only a subset of the sensor data.Additional ideas for future research include giving DAs reduced sensor sets, introducing multi-rate sensor data, injecting transient faults, allowing for autonomous transitions, adding variable loads, and extending the scope and complexity of the physical system.
For the synthetic systems, all the systems have been known in advance.This means researchers could optimize for these circuits.In addition, only one observation time was sampled.In the future, we will provide multiple observations.This will evaluate a DA's ability to merge information from multiple times.An important component of troubleshooting is introducing probe points.In the future, we can evaluate the number of probes needed to isolate the fault.

Diagnostic Metrics
The set of metrics we have chosen as primary is based on literature survey and expert opinion on what measures are important to assess the effectiveness of DAs.However, we realize that this set is by no means exhaustive.Different sets of metrics may be applicable depending on what the diagnosis results are supporting (abort decisions, ground support, fault-adaptive control, etc.).In addition there might be a set of weights associated with the metrics depending on their impor- tance (for abort decisions the fault detection time is of utmost importance).We expect to add more metrics to the list in the future (with support tools to compute those metrics).In addition since we were dealing with abrupt, persistent, and discrete faults, metrics associated with incipient, intermittent, and/or continuous faults were not considered.
Finally, the metrics listed in this paper do not capture the amount of effort necessary to build models of sufficient fidelity for the diagnosis task at hand.Furthermore, we have not investigated the ease or difficulty of updating models with new or changed system information.The art of building models is an important practical consideration which is not addressed in the current work.
In future work, we would like to determine a set of application-specific use cases (maintenance, autonomous operation, abort decision etc.) that the DA is supporting and select metrics that would be relevant to that use case.

Empirical Evaluation
Some practical issues arose in the execution of experiments.Much effort was put into ensuring stable, uniform conditions on the host machines; however, during the implementation, it was necessary to take measures that may have caused slight variability.One example was the manual examination of ongoing experiment results for quality assurance.Future releases of the DXF can address this by being more robust to unexpected DA behavior, and sending notifications in the event of such.Additionally, for Java DAs, significant dif-

CONCLUSION
We presented a framework for evaluating and comparing DAs under identical conditions.The framework is general enough to be applied to any system and any kind of DA.The run-time architecture was designed to be as platform independent as possible.We defined a set of metrics that might be of interest when designing a diagnostic algorithm and the framework includes tools to compute the metrics by comparing actual scenarios and diagnostic results.
Using the framework, we have experimented with 13 diagnostic algorithms on 16 systems of various size and synthetic/real-world origin.We have, both manually and programatically, created 1 651 observation scenarios of various complexity.We have designed 10 metrics for measuring diagnostic performance.This has resulted in the execution of 6 484 scenarios with a total duration of more than 169.7 hours and the computation of 84 292 metrics.
We presented the results from our effort to evaluate the performance of a set of diagnostic algorithms on the ADAPT electrical power system testbed, and a set of synthetic circuits.We learned valuable lessons in trying to complete this effort.One major take-away is that there is still a lot of work and discussion needed to determine a common comparison and evaluation framework for the diagnosis community.The other key observation is that no DA was able to be best in a majority of the metrics.This clearly indicates that the selection of DAs would necessarily involve a trade-off analysis between various performance metrics.
The framework presented is by no means a finished product and we expect it to evolve over the years.In the paper, we have identified some of the limitations and expected scope for future expansion.Our sincere hope is that the framework is adopted by growing number of people and applied to a wide variety of physical systems including diagnosis algorithms from several different research communities.The long-term goal is to create a database of performance evaluation results which will allow system designers to choose the appropriate DA for their system given the constraints and metrics in their application.
One can see that M ia and M err are duals, i.e.: Consider the isolation accuracy (m ia ) of a single diagnostic candidate ω ∈ Ω: m ia "penalizes" a DA for each misclassified component.As is visible from Fig. 13, the penalty is applied linearly.The isolation accuracy metric M ia originates in the automotive industry (Committee E-32, 2008).The Aerospace Recommended Practice (ARP) computes the closely related probability of correct classification in the following way.For each component we compute the square confusion matrix.The probability of correct classification is the sum of the main diagonal divided by the total number of classifications (see the referenced ARP (Committee E-32, 2008) for details and examples).
It can be shown that the probability of correct classification, as defined in the above ARP, is equivalent to M ia , if both fault and nominal component modes are used for the computation of the confusion matrices.The probability of correct classification is conditioned on the fault probability while the probability measured by M ia is not.The latter is purely a metric design consideration.The fact that we use nominal modes for computing M ia leads to higher correlation of M ia with the detection accuracy metrics defined later in this section.
If more than one predicted mode vector is reported by a DA, (meaning that the diagnostic output consists of a set of candidate diagnoses), then the isolation accuracy and the classification errors are calculated for each predicted component mode vector and weighted by the candidate probabilities reported by the DA as it is seen in Eq. ( 20) and Eq. ( 14).M ia and M err are very useful for single diagnoses but with multiple candidates they are less intuitive.The metric that follows is loosely based on the concept of "repair effort" and partly remedies this problem.

A.2 Utilities
In what follows we show the derivations of the three utility metrics (system repair utility M sru , diagnosis repair utility M dru , and utility M utl ).

A.2.1 System Repair Utility
Consider an injected fault ω ⋆ (ω ⋆ is a set of faulty components) and a diagnostic candidate ω (the set of components the DA considers faulty).The number of truly faulty components that are improperly diagnosed by the diagnostic algorithm as healthy (false negatives) is n = |ω ⋆ \ω| (see Fig. 4).In general a diagnostician has to perform extra work to verify a diagnostic candidate ω, which must be reflected in the system repair utility.We assume that he or she has access to a test oracle that reports if a component c is healthy or faulty.
We first determine what the expected number of tests a diagnostician has to perform to test all components in ω ⋆ \ ω (the false negatives) if the diagnostician chooses untested components at random with uniform probability.In the worst case, the diagnostician has to test all the remaining COMPS\ ω components (the diagnostic algorithm has already determined the state of all components in ω).Consider the average situation.We denote N = |COMPS \ ω|.N is the size of the "population" of components to be tested.
The probability of observing s − 1 successes (faulty components) in k + s − 1 trials (i.e., k oracle tests) is given by the direct application of the hypergeometric distribution: The formula above is the probability mass of the inverse hypergeometric distribution that, in our case, yields the probabilities for testing k healthy components before we find s faulty components out of the population (no repetitions).The expected value E ′ [k] of p ′ (k, s, n, N ) (from the definition of a first central moment of a random variable) is: Replacing p ′ (k, s, n, N ) in ( 27) and simplifying gives us the mean of the inverse hypergeometric distribution 2 : As we are interested in finding s = n faulty components, the expected value E ′ (n, N ) becomes: The expected number of tests E[t] (as opposed to the expected number of faulty components E ′ [k]) then becomes: The expected number of tests E[t] is then normalized by the number of components f and flipped alongside the y axis to give the system repair utility: Plotting the system repair utility m sru against a variable number of false negatives is shown in Fig. 14.One can see that unlike m err which changes linearly, m sru "penalizes" improperly diagnosed components exponentially.
The system repair utility for a set of diagnoses is defined as: where W (ω) is the weight of a diagnosis ω such that: ω∈Ω W (ω) = 1 (33) All weights W (ω), ω ∈ Ω, are computed by the diagnostic algorithm.
2 For a detailed derivation of the negative hypergeometric mean, see (Schuster & Sype, 1987).The diagnosis repair utility for a set of diagnoses is defined as: A.4 Utility The utility metric (per candidate) is a combination of m sru and m dru : The utility metric (per scenario) is Figure 15 plots m utl for varying numbers of false negatives and false positives in a (symmetric) case where the cardinality of the injected fault is half the number of components.Normally, the number of injected faulty components |ω ⋆ | is small  Commandable components contain an additional entry in the system catalog specifying a command ID and command value type (analogous to sensor value type).The command message represents the issuance of a command to the system.In the ADAPT system, for example, the message (EY144 CL, true) signifies that relay EY144 is being commanded to close."EY144 CL" is the command ID, and "true" is the command value (in this case, a Boolean).

C.2 Diagnosis Result Format
The DA's output (i.e., estimate of the physical status of the system) is standardized to facilitate the generation of common data sets and the calculation of the benchmarking metrics, which are introduced in Sec.3.3.The resulting diagnosis message is summarized in Figure 1: Framework architecture

Figure 4 :
Figure 4: The diagnostic candidate ω and the injected fault ω ⋆ partition COMPS into four sets

Figure 5 :
Figure 5: A schematic overview of the ADAPT EPS

M
fd and M fi are shown in the bottom-left plot

Figure 8 :
Figure 8: Examples of sensor readings Figure 9 shows (1) M err by DA (top-left), (2) M sru and M dru by DA (top-right), (3) M fd and M fi by DA (bottom-left), and (4) M fn and M fp (bottomright).Five of eight DAs tested were best or second best with respect to at least one of the metrics for ADAPT.

Figure
Figure 10: Mda and Merr per fault type for all DAs

Figure 11 :
Figure 11: M err per fault cardinality for all DAs (ADAPT)

Algorithm 1
Algorithm for generation of observation vectors 1: function MakeAlphas(DS, N, K) returns a set of observations inputs: DS = SD, COMPS, OBS OBS = IN ∪ OUT, IN ∩ OUT = ∅ N , integer, number of tries K, integer, maximal number of diagnoses per cardinality local variables: α, β, α n , ω, terms c, integer, best card.so far Ω, set of terms, diagnoses A, set of terms, result

Figure 13 :
Figure 13: m ia as a function of n and n p(k, s) of then observing a faulty component in the next oracle test is simply the number of remaining false negatives n − (s − 1) divided by the size of the remaining population (N − (s + k − 1)):p(k, s) = n − s + 1 N − k − s + 1 (25)and the probability of having exactly k oracle faults up to the s -th test, is then the product of these two probabilities:p ′ (k, s, n, N ) =

Figure 14 :
Figure 14: m sru as a function of n

Table 1 :
Scenario execution summary data

Table 2 :
Metrics summary Table 4 for an overview).In what follows we provide a brief description of each DA.
energy, and information.It is a fast root cause analysis with linear computational complexity.Its main advantage is that it is very efficient to knowledge engineer a model.The algorithm has been proven in several commercial applications.

Table 5 :
ADAPT and ADAPT-Lite differences

Table 6 :
Table 6 summarizes the type of faults used for ADAPT-Lite.ADAPT-Lite faults

Table 10 :
ADAPT metrics correlation matrix (correlations with p-values smaller than 0.03 are shown in bold red)

Table 12 :
Synthetic systems metrics results

Table 13 :
Synthetic systems secondary metrics results

Table 14 :
Synthetic systems satisfiability results ferences were evident in the peak memory usage metric when run on Linux versus Windows.The problem was mostly bypassed by running all but one Java DA on Linux.
This appendix provides detailed derivation of the formulae for the technical accuracy metrics.In this appendix we use notation of Sec.3.3 (in particular, recall Fig.4and Table3).A.1 Classification Errors and IsolationAccuracy Recall the definition of M err and M ia : Breaks the circuit and must be ... Though there are additional message types, the most important messages for the purpose of benchmarking are the sensor data message, command message, and diagnosis message, described below.C.1 Sensor/Command DataSensor data are defined broadly as a map of sensor IDs to sensor values (observations).Sensor values can be of any type; currently the framework allows for integer, real, boolean, and string values.The type of each observation is indicated by the system's XML catalog.

Table 15 :
Sensor and command message format Table 16 and contains: timestamp: a value indicating when the diagnosis has been issued by the algorithm.candidateSet: a candidate fault set is a list of candidates an algorithm reports as a diagnosis.A candidate fault set may include a single candidate with a single or multiple faults; or multiple candidates each with a single or multiple faults.It is assumed that only one candidate in a candidate fault set can represent the system at any given time.detectionSignal: a Boolean value as to whether the diagnosis system has detected a fault.isolationSignal: a Boolean value as to whether the diagnosis system has isolated a candidate or a set of candidates.

Table 16 :
Diagnosis message formatIn addition, each candidate in the candidate set has an associated weight.Candidate weights are normalized by the framework such that their sum for any given diagnosis is 1.