Hierarchical Reasoning about Faults in Cyber-Physical Energy Systems using Temporal Causal Diagrams

Cyber-physical systems are often equipped with specialized fault management systems that observe the state of the system, decide if there is an anomaly and then take automated actions to isolate faults. For example, in electrical networks relays and breaks isolate faults in order to arrest failure propagation and protect the healthy parts of the system. However, due to the limited situational awareness and hidden failures the protection devices themselves, through their operation (or mis-operation) may cause overloading and the disconnection of parts of an otherwise healthy system. Additionally, often there can be faults in the management system itself leading to situations where it is difficult to isolate failures. Our work presented in this paper is geared towards solution of this problem by describing the formalism of Temporal Causal Diagrams (TCD-s) that augment the failure models for the physical systems with discrete time models of protection elements, accounting for the complex interactions between the protection devices and the physical plants. We use the case study of the standard Western System Coordinating Council (WSCC) 9 bus system to describe four different fault scenarios and illustrate how our approach can help isolate these failures. Though, we use power networks as exemplars in this paper our approach can be applied to other distributed cyberphysical systems, for example water networks.


INTRODUCTION
Recent advances in sensor networks, embedded systems, information and communication technology have steered the interest of scientific community towards the development of cyber physical systems (CPSs).A cyber physical system is the integration of physical processes with computation.Tight coupling between physical processes and software is the hallmark of such systems.These ubiquitous engineered systems form the backbone of control infrastructures in modern society.The focus of CPSs is to improve the collaborative link Ajay Chhokra et al.This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 United States License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
between physical and computational elements for enhancing autonomy and intelligence of the physical systems to be able to plan and modify their actions for evolving environments based on the self-awareness.
According to (Reppa et al., 2015b), the key concerns in designing CPSs are safety, reliability and fault tolerance.In order to address these concerns, the cyber ecosystem of many critical systems such as power systems is empowered with fault management components for arresting failure propagation.These specialized devises have supervision capabilities for diagnosing faults in the physical system and taking appropriate remedial actions for removing faulty components as mentioned in (Blanke et al., 2006;Isermann, 2006).Figure 1 shows a network of interconnected CPSs.The cyber system of each CPS includes a specialized fault management component.A fault management component consists of an anomaly detector and a reconfiguration controller.Anomaly detector detects discrepancies, as a result of a fault in physical plant, from the sensor data and informs the reconfiguration controller about the observed anomaly.The reconfiguration controller instructs the actuator to change its state leading to modification of the operating conditions that can arrest failure effect in the physical system.Various quantitative and qualitative approaches have been developed over the years to diagnose faults in physical plant, sensors and actuators, see (Bouamama et al., 2014) for details.In this paper, we limit the scope of cyber system to fault managers (anomaly detectors and reconfiguration controllers) and communication amongst these computation elements only.
Apart from sensors and actuators, cyber fault management components such as anomaly detectors and reconfiguration controllers can have faults too.Anomalous behavior can cause inadvertent changes in the physical system that can lead to secondary failures.Moreover, in critical systems, the fault management components are based on reflex healing approaches and have to act on local information in a limited amount of time.These actions are devoid of system wide perspective and can cause cascading failures.A similar phenomenon was seen in the blackouts of 2003 in the USA, where misoperations of protection devices exacerbated the initial disturbances into cascading outages in the other parts of electric grid (North American Electric Reliability Corporation, 2012).
System of Interest: One of the emerging applications of CPS is the modern power system or referred to as Cyber Physical Energy System (CPES).CPES is the amalgamation of power grid technology with intelligent control, co-ordination and communication between demand and supply side to deliver electricity efficiently.Physical components in power systems such as transmission lines, generators and transformers etc work in dynamic environments resulting from varying load, changing operational requirements and component degradation.To achieve fault tolerance and required level of resiliency, a number of fast acting localized protection mechanisms are used to detect and isolate faults.These protection systems include detection devices such as fast-acting numerical relays that are designed to detect abnormal changes in physical properties (current, voltage, impedance) and actuation devices such as breakers that can be triggered to open the circuit in electrical networks.While these protection devices are effective in detecting and isolating faults in specific regions of a system their decisions are based on local information.This results in a highly conservative reaction from protection devices without considering the consequences of the control actions.Apart from lack of system-wide perspective, these protection devices have faults also.The change in system state due to (mis)operation of the protection devices can eventually increase stress on other parts of the system and thus cause secondary failures.These failures result in the triggering of other protection devices.This domino effect can quickly cascade to the whole system, ultimately leading to complete system shutdown.
Traditional data and model based failure diagnosis approaches, listed in (Sekine et al., 1992), do not fully capture the failure propagation in physical and cyber systems as a result of the interactions between the faults and their effects in the two systems.A new modeling and diagnosis strategy is needed that isolates the faults in physical and cyber components and is robust to the changes in the underlying physical system, cyber fault management system, sensors and actuators.
Our Contributions: In this paper we are presenting a diagnosis approach based on Temporal Causal Diagrams (TCDs) by considering 1) Discrete and continuous dynamics of the underlying components 2) Faults in the physical components 3) Misoperations or malfunctions of the discrete components (sensors, anomaly detectors, actuators, controllers) 4) Propagation of failure effects in both cyber and physical components.A TCD model includes a fault propagation graph as well as the behavioral model of protection devices under nominal and faulty conditions.It is an extension of our prior work on Temporal Failure Propagation Graphs (TFPG) Abdelwahed & Karsai (2007); Dubey et al. (2011).
We present a TCD model based diagnosis scheme which uses local observers and a system level reasoning engine to diagnose faults.The observers are discrete state machines derived from behavior models captured in the TCD model.They use the incoming observable events in real-time to produce alarms that are then consumed by the reasoning engine that produces system level hypotheses consistent with failure propagation graphs, to identify fault sources and predict impending system level effects, see Figure 1.A key feature of this technology lies in its ability to model and diagnose not only faults in physical system but also protection element or controller failures, where the controllers are tightly coupled with the physical components.
We also describe in this paper, a set of discrete-time behavioral models of widely used power system protection devices.
In the end we demonstrate the proposed TCD reasoning technique for single and multi-fault scenarios using a standard Western System Coordinating Council (WSCC) 9 bus system.
Outline: The rest of the paper is organized as follows: Section 2 provides a survey of some of the published works on fault diagnosis for electrical power grids and section 3 highlights the key aspects of our approach in light of lessons learned from past work.Section 4 describes the TCD modeling formalism in detail.Section 5 gives insight into various physical and cyber elements associated with power transmission system and also describes their respective TCD models with the help of simple example.The failure diagnosis approach including observers and reasoning logics are described in section 6. Section 7 presents the case study with diagnosis results for different scenarios.Concluding remarks are provided in section 8.

RELATED RESEARCH
Fault diagnosis in cyber physical systems is a challenging tasks due to inherent heterogeneity and large scale of the physical systems.A number of decentralized and distributed schemes for fault detection are proposed in the literature.To enhance fault isolation, hierarchical or multiple levels of diagnosis has also been proposed.The single level of diagnosis is realized by local diagnosers.Specifically, the local diagnosers may exchange estimations (Khalili & Zhang, 2014;Yan & Edwards, 2008;Daigle et al., 2007), or measurements (Ferrari et al., 2012;Boem et al., 2013;Shames et al., 2011) of the interconnected system states or fault signatures (Daigle et al., 2007).Apart from faults in physical systems, a number of approaches have been proposed to diagnose faults in sensors and actuators (Reppa et al., 2015b;Zhang & Zhang, 2013a,b;Reppa et al., 2013Reppa et al., , 2015a)).In (Zhang & Zhang, 2013a,b), a distributed architecture is designed to isolate single faults while (Reppa et al., 2013(Reppa et al., , 2015a,b) ,b) can detect and isolate multiple sensor faults.However, a little attention has been given on diagnosing the behavior of anomaly detectors and reconfiguration controllers.In order to correctly isolate faults in interconnected systems, a holistic approach is required that covers components in physical and cyber systems.
Apart from the distributed and multilevel diagnosis discussed above, there exist a vast literature on the methodologies fine tuned for power systems.Fault diagnosis in power systems is an active area of research see (Ferreira et al., 2016) for details.Many technical papers have focused on fault segment estimation.The diagnosis approach can be broadly classified into three categories based on their underlying technique: expert system (Yongli et al., 1994;Huang, 2002;Cardoso et al., 2008;Jung et al., 2001), artificial neural network (Cardoso et al., 2004;Mahanty & Gupta, 2004;Thukaram et al., 2005;Bi et al., 2002) and analytical model optimization (Wu et al., 2005;Wen & Chang, 1997;He et al., 2009;Guo et al., 2010).In addition, approaches based on petri networks (Sun et al., 2004) and cause-effect bayesian networks (Chen et al., 2001(Chen et al., , 2011;;Guo et al., 2009;Chen, 2012;Yongli et al., 2006) have also been proposed.
Expert Systems are one of the earliest techniques to solve the failure diagnosis problem in Power Systems.The diagnosis process in an expert system can be rule based or model based.A comprehensive survey of such knowledge based approaches is available in (Sekine et al., 1992).The expert systems in general suffer from a number of drawbacks related to the maintenance of the knowledge database and slow response time.These approaches are expected to work well if all the received alarms are correct.Missing and incorrect alarms force the diagnosis technique to produce wrong hypotheses.
Artificial neural networks (ANNs) are adaptive systems inspired by biological systems.ANNs model the complex relationships between inputs and outputs without the explicit description of rules to precisely define the power system protection schemes i.e. based on operational data.Multilayer feedforward perceptron with backward propagation is the most commonly used neural network model (MPNN) for failure diagnosis as described in (Cardoso et al., 2004).However, this learning methodology suffers from slow training and low capability of inference with limited training data.In (Bi et al., 2002;Mahanty & Gupta, 2004), a neural networks with radial basis function (RBF) are presented.Authors in (Thukaram et al., 2005) discuss support vector machine (SVM) in order to avoid the shortcomings of MPNN.The artificial neural networks based approaches in general suffer from convergence problems.Further, the ANNs have to be retrained whenever there is a change in network topology as the weights are dependent upon the structure of the power system.
A number of model based analytical methods have been devised over the years for diagnosing failures in power systems, see (Wu et al., 2005;Wen & Chang, 1997;He et al., 2009) for details.Optimization techniques such as genetic algorithm (Wen & Chang, 1997), particle swarm optimization (He et al., 2009) and evolution algorithm (Wu et al., 2005), have been used to generate optimal failure hypotheses that best explain all the events/ alarms.The analytical model presented in (Guo et al., 2010) not only estimates the faults in the physical component but also hypothesizes the state of protections relays and circuit breakers.But these techniques rely heavily on critical and computationally expensive tasks such as the selection of an objective function, development of exact mathematical models for system actions and protective schemes, which greatly influence the accuracy of the failure diagnosis.
Cause effect networks have also been used to diagnose faults in power systems, as mentioned in (Chen et al., 2001(Chen et al., , 2011;;Guo et al., 2009;Chen, 2012;Yongli et al., 2006).A cause effect network consists of nodes and edges where nodes represent failures and relaying system actions.Edges imply the causal relationship between faults and relay actions.The accuracy of the diagnosis approach presented in (Chen et al., 2001(Chen et al., , 2011) ) decreases if there is uncertainty in the behavior of protection relays (PR) and/or circuit breakers (CB).Authors in (Chen, 2012;Yongli et al., 2006) consider the anomalous behavior of PR and CB by extending the cause effect approach with fuzzy digraphs and Bayesian networks.However these techniques do not provide hypotheses related to the state of PRs and CBs.An on-line alarm analysis approach is presented in (Guo et al., 2009) for diagnosing failure modes in the physical plant as well as in a relaying system based on a temporal causal network.But this approach does not take into account the operating modes and conditions of the system that influence the failure propagation.
The approach described in this paper differs from the current methodologies where fault analysis and mitigation relies on a logic-based approach that depends on hard thresholds and local information assisted by manual system level analysis.The causal model presented in this paper is based on the timed failure propagation graph (TFPG) introduced in (Abdelwahed & Karsai, 2006;Padalkar et al., 1991;Abdelwahed & Karsai, 2007), which is conceptually related to the temporal causal network approach presented in (Guo et al., 2009).
We have extended this work to take into account local protection action in a subsystem which could arrest the fault or lead to larger cascading faults.This is primarily done by considering the discrete behavior of the protection devices and incorporating their effects in fault propagation.Our approach can improve the effectiveness of isolating failures in large-scale systems such as Smart Electric Grids, by identifying impending failure propagation which increases the system reliability and reduces the losses accrued due to power failures.

TIMED FAILURE PROPAGATION GRAPHS AND THEIR LIMITATIONS
In the past, we have used the Timed Failure Propagation Graph (TFPG) based models and reasoning schemes to diagnose faults in physical systems (Abdelwahed & Karsai, 2006) and software systems (Dubey et al., 2011).A temporal failure propagation graph is a labeled directed graph where nodes are either failure modes or discrepancies.Discrepancies are the failure effects, some of which may be observable.Edges in TFPG represent the causality of the failure propagation and edge labels capture operating modes in which the failure effect can propagate over the edge, as well as a time-interval by which the failure effect could be delayed.
Figure 3 shows a simple failure graph with two failure mode nodes FM1 and FM2 with 3 observable discrepancies D1, D2, D5 and 2 silent discrepancies D3, D4.Alarms A1, A2 and A3 signal the detection of monitored discrepancies.The failure effect of FM1 reaches D1 then propagates to D3 and finally reaches D5 under operating conditions quantified by modes a and d.The TFPG reasoner accounts for fault propagation constraints imposed by the operational mode and temporal delays to produce multi-fault hypotheses that are able to con-sistently explain the observed alarms.For instance, the observation of alarm A1 at time t = t 1 triggers the TFPG reasoner to produce a hypotheses, stating the failure mode FM1 was activated during the interval, [t 1 − 2, t 1 − 1], if the current system mode is either a or d.Also, the reasoning engine is robust to alarm faults (false positives or false negatives) which are taken into account while computing the metrics that are used to rank the hypotheses (Abdelwahed & Karsai, 2007).For example, if the current system mode is either b or d and alarm D5 is observed, then TFPG reasoner will produce two hypothesis.One indicating the presence of fault FM2 along with missing alarm A2 and a second is related to false alarm A3.The TFPG based diagnosis scheme has been successfully applied to physical systems including industrial plants (Padalkar et al., 1991) and aerospace systems (Mahadevan et al., 2011).
However in certain cyber physical systems such as transmission and distribution networks (e.g.power and water) there are protection devices that try to arrest the failure effect if detected.These protection devices alter the system topology by instructing breakers (switches) to change their state.These devices can also have faults that alter their response to the effect of the failure and control commands.
Figure 3 also depicts the abstract models of an anomaly detector, a protection device and two actuators that conjointly try to stop the effects of failures in the physical system discussed in the previous paragraph.
The detector generates alarms {A1, A2, A3} in response to unobservable events {E1, E2, E3}, where {E1, E2, E3} represent the failure effects modeled by discrepancies {D1, D2, D5}.The anomaly detector may have a failure mode of its own that will cause the detector to miss the failure effect.The activation of this fault is modeled by an unobservable event F 1 that pushes the automaton from state S1 to S2. • Protection Device also consists of 2 states {S1, S2}.
While in state S1, the protection device appropriately responds to the alarms generated by anomaly detector.The protection device emits commanding events {C1, C2} for alarms {A1, A2}.Similar to the anomaly detector, the protection device also has a missed detection failure mode.The activation of failure mode is represented by the event F 2. • Actuator consist of 3 states {S1, S2, S3}.In response to the command C1 by the protection device the actuator changes its state from S1 to S2.This device also has missed detection fault that forces the breaker to ignore the commands sent by the protection devices.As shown in Figure 3, the system consists of two breakers and the states of the breakers are mapped to system modes as One of the valid traces of the system shown in Figure 3 can be explained as follows: Fault FM1 is injected and after 1.5 secs anomaly detector issued alarm A1.The alarm A1 forces the protection device to emit command C1 which forces the actuator to change state from S1 to S2.The state change modifies the system mode from d to b.The mode change takes place within 1.5 + δ 1 + δ 2 , where δ 1 and δ 2 are maximum communication delays between the anomaly detector and the protection device, and the protection device and the actuator, respectively.
It can be observed that TFPG based approach could correctly isolate the fault-source FM1.However, its difficult to diagnose faults in the cyber infrastructure that includes protection devices along with anomaly detectors and actuators i.e F1, F2, F3.This is extremely desirable for cyber-physical systems where realistic assessment of fault propagation is not possible without accounting for the behavior of the deployed sensing and actuation components.
A more comprehensive approach is desired where the behavioral aspects (including faulty behavior) of local protection elements including anomaly detectors and actuator components can be modeled and tracked in conjunction with the fault propagation graph.It is with this objective, that we introduced the Temporal Causal Diagrams (TCDs) based diagnosis scheme in (Mahadevan et al., 2014) which incorporates the TFPG model and takes into account the problems associated with sensing and actuation elements.
Our initial approach using TCD relies on modifying the TFPG model to account for nominal and faulty operation of the cyber components by appending failure graphs with behavior models forming Temporal Causal Graphs.This quickly complicates a simple TFPG model as it introduces all the variants from the behavior model into the failure prop-agation graph, posing challenges when applying the strategy to large-scale examples of power grids.
Our current approach, as presented in this paper is a refinement of our earlier work using Temporal Causal Diagrams on electrical power grids which is more modular in nature.The refined approach uses a two layer hierarchical reasoning engine, where the lower layer includes observers derived from the behavioral models of the protection equipment.The observers reason about the events observed from their respective components and feed their inference to the higher level TCD reasoner.The TCD reasoner not only handles the fault propagation model (like the TFPG diagnosis engine), but also deals with the derived alarms (or hypotheses) reported by the observer(s).The reasoner uses the fault propagation model to reason about the derived alarms (hypotheses) fed by the observer(s) and computes consistent system level hypotheses.Figure 1 shows the diagnosis system block diagram consisting of multiple observers and the TCD reasoning engine.The hierarchical diagnosis system is supplied with events from the cyber-physical system.A key aspect of this approach is that the reasoner implementation is not affected by any change in the system topology or the behavior of the protection devices.The next section formally explains the structure of a Temporal Causal Diagram.

TEMPORAL CAUSAL DIAGRAM
A temporal causal diagram is a behavior-augmented failure propagation graph.It comprises of a directed graph that captures the failure propagation across the whole system in different operating conditions.It is influenced by the behavioral models of various cyber components (i.e. the protection equipment).The following subsections describe the modeling formalism for capturing the failure dynamics and the model of computation used for representing cyber components.

Temporal Failure Propagation Graphs
A temporal failure propagation graph is a labeled directed graph.In the context of self-correcting cyber physical systems such as power grids, the system mode or operating conditions depends upon the state of sources, sinks and the topology of the system.Identification of all operating conditions, i.e unique system modes is computationally very expensive.In this paper, we use the system topology dictated by the state of the actuators to map an operating condition (i.e.mode) to the failure propagation.However, while such a constraint imposed due to topology of the system is deemed necessary to identify when a fault will not propagate, it is not sufficient to state that the failures will propagate.So we need to extend the TFPG language with an additional map that associates uncertainty to failure edges.Formally, the extended TFPG is represented as a tuple {F, D, E, M, ET, EM, ND}, where • F is a nonempty set of failure nodes.A failure node can be in two states either present denoted by ON state or absent represented by OFF state.
• M is a nonempty set of system modes.At each time instance t the system can be in only one mode.• ET : E → I is a map that associates every edge in E a time interval [t min , t max ] ∈ I that represents the minimum and maximum time for failure propagation over the edge.• EM : E → P(M ) is a map that associates every edge in E with a set of modes in M when the edge is active.For any edge e ∈ E that is not mode-dependent (i.e.active in all modes), EM(e) = ∅.• ND : E → {True, False} is a map that associates an edge, e ∈ E to True or False, where True implies the propagation along the edge, e Will happen, whereas False implies the propagation is uncertain and Can happen.

Discrete Behavior Models
The behavior of discrete devices is modeled using extended time triggered automaton (Krčál et al., 2004).The extension includes sets of failure modes and failure mode guards.Mathematically, an extended time triggered automaton is represented as tuple (Σ, Q, q 0 , Q m , F cyber , D cyber , M, α(F ), Φ, T).
• Event Set: Σ is a finite set of events that consists of observable and unobservable events partitioned as Σ = Σ obs ∪ Σ unobs such that Σ obs ∩ Σ unobs = φ.Observable events are alarms, commands and messages exchanged between discrete components.Whereas, unobservable events are related to introduction of faults in system components.
• Locations: Q is a finite set of locations.q 0 ∈ Q is the initial location of the automaton and Q m ⊂ Q is a finite set of marked locations.
where f ∈ F cyber is a failure mode and ω 1 , ω 2 are failure mode constraints.A failure mode constraint is True if the Boolean expression is evaluated to be True and False otherwise.• Timing Constraints: Φ is a set of timing constraints defined as, Φ = [n], (n)|n ∈ N + , where [n] denotes instantaneous constraints and (n) represents periodic constraints.The timing constraints specify a pattern of time points at which the automaton checks for events and failure node constraints.For instance, periodic constraint, (4), on any outgoing transition from the current state forces the automaton to periodically look for events specified by the edge, every 4 units of time whereas in the case of instantaneous constraint, [4], automaton checks only once.
is a finite set of edges.An edge represents a transition between any two locations.The activation conditions of an edge depends upon the timing, failure mode constraints and an input event.For example, an edge represents a transition from location q 1 to q 2 with an instantaneous time constraint of n units of time and failure mode constraint δ(f 1 ) ∧ ¬δ(f 2 ) ∈ α(F cyber ) defined over the failure modes f 1 , f 2 ∈ F cyber .σ 1 ∈ Σ, is the required input event for this transition to be valid.σ 2 ∈ Σ, represents the event generated when the transition is taken.Syntactically, a transition is represented as Event(timing constraint){failure constraint}/Event.In the case, no event is mentioned, then the transition is valid only if the failure mode constraint evaluates to true as per the timing constraints.

POWER TRANSMISSION SYSTEM
Figure 4 shows a simple cyber physical energy system where a load L1 is fed by two generators, G1 & G2 via transmission lines TL1 and TL2.Buses B1, B2 and B3 act as interface points for different system elements.The example system also consists of 4 protection assemblies (PA1, PA2, PA3 and PA4).Each protection assembly has a relaying system which consists of a transformer (current and voltage), a protection relay and a breaker assembly.This section briefly describes these components along with the TCD model.

Physical System (Plant)
In the context of power systems, the physical system components can be broadly classified into 3 categories A) power conversion elements B) power delivery elements C) buses.
The following subsections present a brief overview of these categories.For more details, please refer to (Dugan, 2016).
Power Conversion Elements convert energy from other forms into electrical energy like generators and loads.Most of the elements in this category have one multi phase terminal.For the scope of this paper the power conversion elements are considered as black boxes where the implementation can be of variable fidelity.
Power delivery elements consists of two or more multi phase terminals.Their basic function is to transfer energy from one place to another.The most common power delivery elements are transmission lines and transformers.
Buses are the interface points for power conversion and delivery elements.Buses can be considered as N-node containers to which other components are connected.

Cyber System (Protection System)
Cyber systems include components responsible for supervisory control and protection of components in the physical system.In power systems, the cyber components include the protection relays (distance, over-current, differential relays, etc.) and circuit breakers.
Distance relays serve to protect the power grid from faults in transmission lines.A relay can act as a primary protection element for a transmission line and a backup or secondary protection for lines in the neighborhood.Distance relays work on the principle of apparent impedance ratio.The reach of a distance relay is marked in terms of zones that are functions of the impedance ratios and the direction in which the relay is configured to operate.Usually the distance relay is configured with zones 1, 2, 3 defined respectively as 80%, 125%, and 200% of the forward impedance of the transmission line to which the relay is attached.When a fault occurs in a configured zone it eventually reaches the relay at which point the relay sends a trip signal to the breakers to arrest the failure effects.For faults in zone1, the distance relay serves as the primary protection element and acts without any delay.For faults in other zones, it serves as a backup and is configured to wait for a certain time (after fault detection) to allow a primary relay to respond to the fault.Typically this value is in the range [0.08, 0.167] sec and [0.250, 1] sec for zone 2 and 3 respectively as mentioned in (E.O. Schweitzer et al., 2014;Kundur et al., 1994).For the system shown in figure 4, distance relays included in PA1, PA2 act as primary protection elements for faults in line TL1 while PA4 serves as back-up or secondary protection device.
Circuit Breakers can be opened or closed to disconnect or restore power flow in the appropriate segment of the power transmission system.This can be used to stop the flow of failure effect by opening and closing the circuit upon receiving the appropriate command from the protection relays.

TCD Model
This subsection describes the TCD model of an example power system -the two transmission line system in Figure 4.

Failure Graphs
In power systems, protection elements are deployed redundantly to detect and isolate faulty components.The TCD failure graph for power systems is constructed in terms of the faults in the physical system and the effects observed by the protection devices.
The failure graph involving physical faults in a two transmis-sion line system is shown in Figure 5.The nodes labeled as F T L1 and F T L2 represent failures in transmission lines T L1 and T L2.The effect of these failures is signaled by the alarms raised by distance relays in protection assemblies, PA1, PA2, PA3, PA4.The failure propagation is captured by an edge between the failure node, F T Ln and discrepancy, d T Ln P Ak, where F T Ln represents a fault in line T Ln and d T Ln P Ak represents an anomaly detected by protection assembly P Ak due to a fault in line T Ln.The physical effect corresponding to this anomaly is a reduction in impedance that is observed from relay data in the form of zone 1, 2, 3 alarms (described in next section).
Failure propagation delay depends upon the time taken by the failure effect to reach the bus where the distance relay is installed along with the time taken to detect the fault conditions.Typically, this is close to 30 milliseconds as mentioned in (E.O. Schweitzer et al., 2014).Failure propagation edge activation conditions are expressed in terms of the states of the breakers in the path between the protection assembly and the generator (source).As shown in Figure 5, in order for PA4 to detect a fault in line TL1 the breakers in assemblies PA4, PA3, PA2 should be closed.Thus, the operating condition for the effect of a failure to travel from node F T L1 to d T L1 P A4 is captured by the boolean expression, PA4 BR close and PA3 BR close and PA2 BR close.
The ability of a protection element to detect a fault depends upon number of factors, mainly, the location of the fault with respect to the protection assembly, nature of the power flow (forward or backward), physical state of the breakers, and the loading conditions.The protection elements located at the remote end are known to over-or under-reach.Hence, the failure propagation links between failure nodes and discrepancies related to remote or back up protection elements are marked uncertain, ND(e) = False, and are represented by dotted lines.As shown in Figure 5, PA4 acts as a back-up protection device for faults in line TL1.Thus the link between F T L1 and d T L1 P A4 is marked uncertain.
We further classify discrepancies associated with faults in each transmission line as primary and secondary discrepancies, where primary discrepancies are associated with primary protection devices for the faults associated with the transmission line and secondary discrepancies are related to back up protection devices (described in the next section).

Discrete Behavioral Model: Distance Relay
Figure 6 shows the discrete model of a typical relaying system containing a distance relay (protection relay) and a breaker assembly.The distance relay model consists of three zones of protection.Table 1 summarizes the failure modes and events (observable and unobservable) considered in the distance relay model.This state shows the presence of zone 2(3) spurious detection fault.
The automaton consists of 9 states, which are described, in Table 2. Initially the automaton is in the idle state and looks for fault-condition i.e. events -E1, E2, E3, and checks the status of failure modes every R seconds.If the distance relay detects zone 1 fault conditions (modeled by the presence of the event E1) , then the distance relay moves to the tripped state and issues a Z1 alarm and commands the breaker to open (cmd open).For zone 2 and zone 3 faults conditions (E2, E3), the protection relay does not issue an open command after moving to the chkZ2 or chkZ3 states.The state machine waits for predefined time, zn2wt, zn3wt ∈ R + and checks again for the presence of the fault conditions.If the fault is still present, the relay commands the breaker to open.Additionally, distance relays may be configured with overreach trip transfer protocols.In this case, the primary relays associated with a transmission line send permissive trip signals to each other, in order to avoid zone 2 wait time.
In the presence of internal faults, the distance relay may not detect physical faults.This is modeled by the presence of a missed detection fault, F de1, where the relay jumps to de-tErr1 state and does not detect any physical faults.In certain situations the distance relay could have internal faults related to spurious detection (F de2 z1, F de2 z2, F de2 z3).In such cases, as modeled in the automaton, it incorrectly reports zone 1, zone 2 or zone 3 faults by moving to detErr2, and detErr3 and instructs the breaker to open.In this model, the faults (F de1, F de2 z1, F de2 z2, F de2 z3) are assumed to be mutually exclusive.State close is the initial state of the automaton and after every R seconds, the automaton checks for cmd open event and the presence of F stuck close failure mode.If the failure mode is not present, the breaker state machine moves to opening state.In opening state, the state machine waits for t3 units of time before transitioning to open state.t3 is a parameter of the behavior model that captures the lag due to the mechanical nature of the breaker and is of the range [0, 50] milliseconds as mentioned in (E.O. Schweitzer et al., 2014).Similarly, in the open state, the automaton looks for cmd close event and the status of F stuck open failure mode.The automaton moves to closing state and after t3 seconds moves to close state.

Discrete Behavioral Model: Circuit Breaker
The TFPG model shown in Figure 5, and multiple copies of the behavioral models shown in Figure 6 constitute the system TCD model for the two transmission line system.A valid sample trace of such a system will be as follows: 3 phase to ground fault introduced in the middle of the line at t=0.5 secs.This causes zone 1 fault conditions for primary relays in assemblies PA1 and PA2 and zone 3 for the backup PA4.All the relays detect the fault at t = 0.501 secs and instructs the breaker to open.The breaker changes the mode and isolates the fault at t = 1.502 secs.

DIAGNOSIS SYSTEM
The TCD based diagnosis system employs a hierarchical framework as shown in Figure 1.The lower layer includes observers that track the operation of cyber components (distance relays and circuit breakers) to detect and locally diagnose faults in physical and protection systems.The observers feed their results to the reasoning engine.The TCD reasoning engine produces a set of hypotheses that explain the current system states as per the output of various observers by traversing the failure propagation graph.The traversal is constrained by the state of the protection system as predicted by observers tracking it.The following subsections provides a detailed description of the model and operation of the observers and the TCD reasoner.

Observers
Observers are responsible for detecting and diagnosing faults in the cyber components (protection equipment in electric grids) by tracking their behavior.The observers monitor the observable events generated by the cyber components.The timed events produced by the various observers fall into two categories; an estimation of a state change in discrete components, and a discrepancy detection.The detected anomalies and the local estimate of the state of different components in the plant and protection layer are passed by the observer to the next layer for system level diagnosis.The observer models related to the distance relay and the circuit breaker are described as follows:

Observer: Distance Relay
The TTA model of a distance relay observer can be seen in Figure 7.The state machine has 8 states with idle being the initial state.The events attributed to the distance relay observer machine are summarized in Table 1 (last two rows).The observer remains in the idle position until zone fault conditions are reported by the corresponding distance relay.
Once the distance relay fires a Z1 event, the observer machine jumps to the chkZ1 state.The observer machine waits for t2 seconds for open command (cmd open event).If received, the observer moves to the tripped state, otherwise transitions back to idle state.t2 is a parameter of the distance relay observer machine that models propagation delay and relay frequency.Please note that the transition from chkZ1 state to the idle state implies a communication channel fault, but in this paper we are not considering such faults.
Similarly, the observer machine moves to the chkZ2 state when the distance relay reports a Z2 event after detecting zone 2 fault conditions.Upon the confirmation of zone 2 fault, the observer waits t3 seconds for the arrival of the cmd open command.t3 is a parameter which is equal to the sum of zone 2 wait time and t2.If the cmd open event is not observed within t3 seconds the automaton moves back to the idle state and concludes that the zone 2 fault condition has disappeared.The observer machine moves from chkZ2 state to chkZ2 Z1 state if the event TripRec occurs and waits for the cmd open event.In a similar fashion, the distance relay observer diagnoses zone 3 faults.The observer layer generates h Z1, h Z2, h Z3 time stamped events to signal the TCD reasoner regarding the local diagnosis of physical faults (zone 1, zone 2, zone 3) and emits h Z1 , h Z2 , h Z3 to signal the disappearance of zone 1, 2, and 3 fault conditions.From the tripped state the observer moves to idle state when a reset signal is observed and updates the physical component to be fault free by issuing h Z1 , h Z2 , h Z3 events.

Observer: Circuit Breaker
The   (Sampath et al., 1995;Tripakis, 2002).Various observers in the TCD diagnosis system consume the input events from both discrete components and generate alarms for the higher level TCD reasoner.The TFPG includes such mappings between observable discrepancies related to faults in the physical plant to observer alarms.These mappings keep the reasoning engine independent from the changes in the behavioral models, while allowing for the events to be consumed by both the observer and the reasoning engine.The resultant TFPG for physical faults in the two transmission line system is listed in Table 4.
One more failure graph is created for linking cyber faults with derived alarms produced by the various observers.These cyber faults are summarized in Table 5.

TCD Reasoner
This section discusses the model based reasoning engine focusing on a graph-based diagnosis approach, diagnosis inputs, hypothesis structure and ranking metrics.Based on the TCD model of the system, the diagnosis engine tries to explain the observed events from the protection system (relay and breaker observers) in terms of the faults associated with the physical and/ or cyber components of the protection systems, taking into account the operating mode of the system.

System States and Maps
The diagnosis engine hypothesizes on the state of the nodes in the failure graph based on the outputs of the observer models.
The states of a node in a failure propagation graph can be categorized as Physical (Actual), Observed and Hypothetical State (Abdelwahed & Karsai, 2006).
• Physical state corresponds to the actual state of the nodes and edges.At any time t, the physical state of any node is given by the map, PNode t : V → {ON, OFF} × R, HSet t is a set that contains all hypotheses generated by the TCD reasoner.Every hypothesis, h f in HSet t has its own HN ode t map.The structure of hypothesis is defined in the following subsection.

Reasoner Hypothesis
Hypothesis is a tuple, where elements are related based on temporal consistency.Formally, hypothesis h f ={f, terl, tlat, S, C, I, M, E, U} where: • f ∈ F is a physical failure mode projected by the hypothesis, h f and F is the set of physical failure modes defined in section 4.1.We are using single physical fault hypothesis which lists only one fault per element of the physical system along with multiple faults in protection system.
• S ⊆ F cyber is a set of faults active in the system.These faults are related to components in the protection system layer as defined in 4.2.
• The interval [terl, tlat] is the estimated earliest and the latest time during which the failure mode f could have been activated.The time estimate for protection layer faults is not supported in the current implementation.
• C ⊆ D is the set of discrepancies that are consistent with the hypothesis h f , where D is the set of physical discrepancies described in section 4.1.These discrepancies are referred to as consistent discrepancies.We partition the set C into two disjoint subsets, C1, C2 where, C1 consists of primary discrepancies and C2 contains secondary discrepancies.A discrepancy, d w.r.t hypotheses h f is called primary if the failure propagation linking the discrepancy, d, is certain otherwise its termed as secondary as defined in section 5.3.1.
• E ⊆ D is the set of discrepancies which are expected to be activated in the future according to h f .This set is also partitioned into E1 and E2 that contain primary and secondary discrepancies respectively.
• M ⊆ D is the set of discrepancies that are missing according to the hypothesis h f i.e. alarms related to these discrepancies should have been signaled.This set is also composed of two disjoint sets M1 and M2 based on primary and secondary discrepancies.
• I ⊆ D is the set of discrepancies that are inconsistent with the hypothesis h f .These are the discrepancies that are in the domain of f but cannot be explained in the current mode.
• U ⊆ D is the set of discrepancies which are not explained by this hypothesis h f as there is no failure propagation link between d ∈ U and s ∈ f ∪S∪C i.e. the discrepancy is not in the domain of f.
For every scenario, the reasoner creates one special hypothesis (conservative), H0 that associates a spurious detection fault with each of the triggered alarms.

Temporal Consistency:
The estimated states in a hypotheses need to be temporally consistent with respect to the estimated state of other nodes.Temporal consistency is a node-pair relationship that can be applied to any arbitrary child-parent pair in the failure propagation graph (Abdelwahed & Karsai, 2006).Formally, a discrepancy d, is temporarily consistent with respect to a hypothesis h f if : and all the following hold: -HNode

Hypothesis Ranking:
The quality of the generated hypotheses are measured based on three metrics, Plausibility, Robustness and Failure Mode Count as explained in (Mahadevan et al., 2014).We are extending this list by adding a new criterion, called Rank.The complete metric list is defined as follows: • Plausibility: It is a measure of the degree to which a given hypothesis explains the current fault and its failure signature.Mathematically, it's is defined as • Robustness: It is a measure of the degree to which a given hypothesis will remain constant.Mathematically, it's is defined as • Rank: It is a measure that a given hypothesis (a single physical fault along with multiple cyber faults) completely explains the system events observed.Mathematically, it is defined as, • Failure Mode Count: is a measure of how many failure modes are listed by the hypothesis.The reasoner gives preference to hypotheses that explain the alarm events with a limited number of failure modes (parsimony principle).This metric plays an important role while pruning out H0 from the final hypothesis report.

Reasoner Input Events
There are three types of events that invoke the reasoner to update the hypotheses.The first two are external physical events related to a change in the physical state of a monitored discrepancy and system mode.The third event is an internal timeout event that corresponds to the expectation of an alarm.A physical event is formally defined as a tuple e = (x,t), where x ∈ D 0 ∪ M is either an observable discrepancy or a system mode.The timeout event is described as a tuple e = < h f , d, t> which implies as per hypothesis h f , any alarm related to discrepancy d should have been signaled by time t.

Reasoner Response
This section describes in details the behavior of the TCD reasoner by explaining the underlying algorithms that handle both internal and external events.The algorithm, HandleDis-ccrepancyStateChangeEvent is invoked to update appropriate hypothesis in HSet t .If none of the hypotheses are able to explain this event a new hypothesis is created as described by the algorithm, CreateNewHypothesis.The mode change and time out events are handled by HandleModeChangeEvent and HandleTimeOutEvent respectively.The following subsections discuss these algorithms in more detail.
CreateNewHypothesis(d,t,m): Algorithm 1 deals with creation of new hypotheses to explain the change in state of a discrepancy, d.This procedure is triggered by the reasoner when the new state of the discrepancy d is not consistent with any of the existing hypotheses in HSet t .A new hypothesis is created (line 2-3) for each failure mode with which the discrepancy d is temporally consistent.Further, for each hypothesis the set of consistent (line 4-5), expected (line 6-7), missing (line 8), inconsistent (line 9) and unrelated (line 10) discrepancies are identified.Appropriate timeout events are added to the global event queue for every discrepancy in the expected set (line 15-18).
HandleDiscrepancyStateChangeEvent(e,m): Algorithm 2 deals with updating every hypothesis in the set HSet t when a change is observed in the state discrepancy d.The change in discrepancy state is signaled by the event (d, t).For every hypothesis in HSet t , the temporal consistency of discrepancy d is checked by routine T Consis t () (line 9), based on the constraints described in section 6.2.2.
If the new state of d is ON and is temporally consistent with the hypothesis, then the discrepancy is moved from the expected sets (E1 or E2) to the consistent sets (C1 or C2) (line 9-20).Further, new discrepancies are added to the expected sets (E1,E2) based on the failure propagation from discrepancy d (line 21-31).Also, timeout events are created for each new discrepancy that is added to the expected set, based on the maximum propagation time listed in ET map (line 23-29).
If the state of d is OFF and it is temporally consistent, then the discrepancy is removed from the consistent sets (C1, C2) and corresponding child discrepancies are deleted from the expected sets (E1,E2) (line 32-49).
If the discrepancy d is not temporally consistent in the current system mode, then it is moved to the inconsistent set (line 50- end for 19: end if 20: end for 51) based on whether the observed state of the discrepancy is ON or OFF.The discrepancy d is added to the unrelated set, when d is ON, but not in the domain of f (line 52-53).The above steps are bypassed if the discrepancy is associated to cyber faults.In that case, parent failure mode is added to secondary failure mode set of every hypothesis in HSet t .
HandleModeChangeEvent(e,m): Algorithm 3 updates the hypotheses in HSet t after every mode change.A mode change is reported to the reasoner when any of the underlying observers detect a change in the system mode.The expected set for each hypothesis is updated using the routine MConsis t () to include only those discrepancies that are reachable from the nodes in f ∪ C in the current system mode (line 3-16).The timeout events are suitably updated based on the changes to the expected set (line 17-33).
HandleTimeOutEvent(e): Algorithm 4 updates the hypothesis h f for a timeout event (h f , d a , t) that is triggered when the observed state of the discrepancy does not change to ON by time t.The discrepancy, d a , listed in the expected set E1 (E2) is moved to the missing set M1 (M2).Also, a protection relay missed detection failure mode i.e F PAn DR de1, is added to the set h f .S if d a is a primary discrepancy associated to protection device PAn DR.

CASE STUDY
The effectiveness of the reasoning approach is tested on a standard 9 Bus system (Kundur et al., 1994).This system is an approximation of the Western System Coordinating Council to an equivalent system containing 9 buses and 3 generators.Figure 8 shows the one line diagram of the 9-bus system.Table 6 lists the failure signatures for the transmission line faults.The failure graph related to cyber faults is similar to Table 5 and is not shown due to lack of space.The four scenarios considered in this paper include • Scenario 1: A 3 phase to ground fault is introduced in the transmission line labeled TL 7 8, located between buses 7 and 8.
• Scenario 2: A Zone 3 spurious detection fault is introduced in the relay PA4 DR that forces the breaker PA4 BR to open.
• Scenario 3: A 3 phase to ground fault is introduced in the line TL 7 8, located between buses 7 and 8 and a stuck closed fault is injected in the breaker assembly PA4 BR.
• Scenario 4: A 3 phase to ground fault is introduced in the line TL 7 8, located between buses 7 and 8.A missing detection fault in relay PA4 DR and stuck closed fault in breaker PA2 BR are introduced in the protection assemblies.
The following subsections present the simulation and diagnosis results.

Event Generation
Simulink's Simscape and Stateflow toolboxes (Simscape Power Systems: For Use with MATLAB;[user's Guide], 2017) are used to model and simulate the cyber physical system under study to produce the appropriate events that are fed to the diagnosis system.The simulation is carried using a fixed step discrete solver with a step size of 1 ms in phasor In scenario 1, a three phase to ground fault is injected in the line at t = 0.5 secs and both the primary protection elements (PA3 DR, PA4 DR) along with secondary backup (PA2 DR) detect the fault by issuing Z1, Z2, Z3 events at t = 0.501 secs.The PA3 DR sends trip signals to relay PA4 DR and breaker PA3 BR at time t = 0.501 secs.The trip signal is received by relay PA4 DR which reduces the zone wait time and forces the relay to issue a trip signal to PA4 BR at t = 0.502 secs.
The breaker assemblies PA3 BR, PA4 BR changes their state from close to open at t = 0.532 and t = 0.533 secs respectively, to isolate the fault.
In scenario 2, a spurious detection fault F de2 z3 is injected in the relay, PA2 DR at t = 0.3 secs.This failure mode forces the relay to issue a Z3 event even in the absence of any transmission line fault.After waiting for zone 3 wait time (1 sec), the relay issues a trip signal to breaker PA2 BR.The state of the breaker is changed at t = 1.331 secs.
In scenario 3, a three phase to ground fault is injected in the line at t = 0.5 secs and a stuck close fault is activated in breaker PA4 BR.Similar to scenario 1, PA3 DR, PA4 DR and PA2 DR all detect the fault conditions and issue Z1, Z2, Z3 events followed by trip signals from PA3 DR to PA DR and PA3 BR.The breaker assemblies PA3 BR and PA4 BR receive trip commands at t = 0.501 and t = 0.502.PA3 BR changes its state to Open at t = 0.5332 secs.However, due to the stuck close fault in PA BR, the trip request is ignored and PA4 BR remains in closed position.At t = 1.502, the zone 3 wait time expires and PA2 DR checks for the fault condition again.Since the fault is not cleared from B8 side, PA2 DR detects the fault and send a trip signal to breaker PA2 BR.
The breaker clears the fault by taking out the line TL8 9 at t = 1.533 secs.
In scenario 4, along with a three phase transmission line fault, a missed detection fault in PA4 DR and breaker stuck close fault in PA2 BR are injected at t = 0.5 secs.PA3 DR and PA2 DR detect the fault conditions and issue Z1 and Z3 events at t = 0.501 secs.And due to the missed detection fault, PA4 DR skips the detection.PA3 DR and PA2 DR issue trip signals to their respective breakers at t = 0.501 and 1.502 secs.The state of the breaker PA3 BR changes at t= 0.532 but PA2 BR remains in the closed state due to the stuck close fault.

Diagnosis Results
Figures 13, 14, 15 and 16 show the output of various observers and the TCD reasoning engine for the fault scenarios discussed in the previous section.
In scenario 1, a persistent transmission fault is introduced at t = 0.5 sec.The distance relays PA3 DR, PA4 DR and PA2 DR detect the fault and report Z1, Z2 and Z3 events.The corresponding observers acknowledge these events and generate h Z1, h Z2 and h Z3 alarms which are fed to the TCD reasoner.These alarms activate d TL7 8 PA3, d TL7 8 PA4, d TL7 8 PA2, d TL8 9 PA4 discrepancies and invoke the discrepancy state change event.These discrepancies produce three hypotheses labeled as H0, H1, H2.H0 is a special hypothesis that blames a spurious detection fault in all the relays.H1 points towards 3 phase to ground fault in TL7 8 with three consistent discrepancies whereas H2 lists a fault in TL8 9 with one consistent discrepancy.At t = 0.531 sec, a timeout event occurs which removes the discrepancies from In scenario 3, a transmission fault in TL7 8 and a stuck close fault in the breaker assembly is injected at t = 0.5 sec.The hypothesis set evolves in similar fashion as described in scenario 1 until t = 0.532 secs.However, the observer PA4 BR OBS  does not report a mode change and waits until t = 0.552 secs.At t = 0.553, the observer concludes stuck close fault in the breaker and issues an alarm h stuck close which is transformed into a cyber fault and added to every hypothesis in the hypothesis set.Scenario 4 involves three faults, a transmission line fault in TL7 8 along with stuck fault in PA2 BR and a missed detection fault in PA4 DR.At t = 0.501, PA3 DR OBS and PA2 DR OBS report h Z1 and h Z3 alarms.These alarms produces two hypotheses H0, H1.H1 lists faults in line TL7 8 with two consistent discrepancies and expects a zone alarm from PA4 DR OBS.At t = 0.531, timeout forces the expected discrepancy to shift to the missing set.H1 and H0 both point towards two failure modes.H1 lists physical faults associated with line TL7 8 along with a missed detection fault in PA4 DR whereas H0 blames both the distance relays for having spurious detection faults.At t = 1.552,PA2 BR OBS concludes a stuck fault in breaker PA2 BR after failing to receive a state change event.Both the hypotheses are updated to reflect the breaker fault.The hypothesis H1 is given preference over H0 as the probability of two cyber faults is less than a physical and a cyber fault (E.Schweitzer et al., 1997).

CONCLUSION
In this paper we showed a new approach to diagnosing fault in cyber-physical systems while considering the possible faults in controllers that can change the mode of behavior of the system.This approach called Temporal Causal Diagrams extends our prior work on Temporal failure propagation graphs by capturing the interaction between failure propagation graphs and discrete time behavior models, that capture the controller semantics.
The TFPG definition is extended to include uncertain edges.However, the uncertainty leads to an inherent limitation of not being able to diagnose missed detection faults in secondary protection devices.
We finally, demonstrated the extended diagnostic procedure on an WSCC-9 bus power transmission system.We are currently working on extending the diagnostic technique to provide a holistic solution that predicts imminent failure modes and presents fault mitigation strategies.We are also interested in automatic way of synthesizing TCD models from system topology.However, writing such transformations are domain dependent and require a good understanding of the underlying domain.

ACKNOWLEDGMENT
This work is funded in part by the National Science Foundation under the award number CNS-1329803.Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of NSF.The authors will like to thank Rishabh Jain, Srdjn Lukic, Scott Eisele and Amogh Kulkarni for their help and discussions related to the work presented here.

Figure 1 .
Figure 1.An interconnected system consisting of a network of physical processes, sensors, actuators and fault managers Figure 2. TCD Diagnosis System

Figure 3 .
Figure 3.A sample temporal failure graph along with behavior automatons of different cyber components in both faulty and nominal modes

Figure 5 .
Figure 4.A simple two transmission line system

Figure 6
Figure6also shows an abstract model of a single phase breaker.The different failure modes, and events associated with the breaker behavioral model are summarized in Table3.The state machine consists of 4 states: breaker observer model is shown in the right side of Figure 7.It consists of 4 states labeled as open, close, opening and closing and correlate directly to the 4 states of the breaker automaton.

Figure 7 .
Figure 7. Protection System Observer Models, Distance Observer Model (Left); Breaker Observer Model (Right) The failure mode, F PAn DR de1 embodies a missed detection fault in the protection relay PAn DR.The associated discrepancy is d PAn-DR de1.F PAn DR de2 zk implies a zone k spurious detec-tion fault in the PAn DR protection relay.These two families of cyber faults are not related to any alarms as they are inferred by the TCD reasoner by looking at the system failure propagation graph.The faults F PAn BR SC, F PAn BR SO imply stuck close and stuck open faults in the breaker PAn BR.These are linked to discrepancies d PAn BR SC and d PAn-BR SO which are signaled by alarms h stuck close and h st-uck open by their respective observers.

Figure 8 .
Figure 8. WSCC 9 Bus System One Line Diagram

Figure 16 .
Figure 16.Diagnosis results for scenario 4 • Discrepancy Set: D cyber is a finite set of discrepancies associated with the component behavior, partitioned into observable and unobservable.• Failure Mode Set: F cyber is a finite set of unobservable failure modes associated with the component.Similar to a failure node in TFPG, failure mode also has ON and OFF states.δ t is a function defined over F cyber × R + that maps a failure mode f ∈ F cyber at time t ∈ R + to True if the state of failure mode is ON and to False if the state is OFF.• Failure Mode Constraints: α(F cyber ) represents the set of all constraints defined over members of set F cyber .An individual failure mode constraint, ω t ∈ α(F cyber ), is a Boolean expression defined inductively as

Table 1 .
TCD language elements (Failure Modes and Events) associated with distance relay behavior and observer model

Table 2 .
Different states of distance relay model

Table 3 .
The state machine consists of 4 states: • open: This state implies that the physical state of the breaker is open.• close: This state represents that the physical state of the breaker is close.• opening: Due to the mechanical nature of the breaker assembly and zero crossing detection, the transition from open state to close is not instantaneous.The opening state represents the intermediate state where the breaker has received the command to open but the physical state is not open.• closing: Similar to opening state, closing is an intermedi-ate state that implies that breaker assembly has received a closing command but the status is not yet closed.

Table 3 .
Language Elements (Failure Modes and Events) for breaker behavior and observer model Table 3 lists all the events associated with the breaker observer model.Initially the state machine is in the close state and jumps to the opening state after observing cmd open event.The breaker observer transitions to the open state if it receives an st open event from the breaker assembly within t4 seconds.t4 is a model parameter that is equal to the sum of propagation time and the maximum time required to open the breaker.If the event is observed in the time limit, the observer concludes the physical state of breaker is open.Otherwise it hypothesizes that the breaker has a stuck fault.The fault is signaled by generating an h stuck close event.Similarly, when the breaker is in the open state it has the same timed behavior and an h stuck open event is generated if an st close event is not observed within t4 seconds of receiving the cmd close event.The above mentioned observers (local diagnosers) are created manually by merging edges and states that do not contain observable events associated with them.There exists a number of approaches for generating discrete diagnosers for dynamic systems based on

Table 4 .
Failure propagation graph for 2 transmission line system(Physical Failures)

Table 5 .
Failure propagation graph for 2 transmission line system (Cyber Failures) An observed state is the same as the physical state except when there are sensor/alarm failures.The observed state at time t is also represented as a function defined over observable discrepancies as ONode t : D obs → {ON, OFF} × R where D obs ⊂ D, are observable discrepancies.• A hypothetical state is an estimate of the node's physical state and the time since the last state change happened.Formally a hypothetical state at time t is defined as a map HNode t : V → {ON, OFF} × R × R. The hypothetical state is defined for both discrepancies and failure modes.HNode t (v).terl and HNode t (v).tlat denotes the earliest and latest time estimates for the state changes of node v i.e. from ON to OFF or vice-a versa.
where V = D ∪ F is the set of failure and discrepancy nodes.An ON state for a failure node implies the presence of the fault otherwise it is in an OFF state.For discrepancy nodes an ON state implies that the failure effect has reached that node.Similarly, for edges the function PEdge t : E → {ON, OFF} × R gives the physical state of an edge at time t.The ON (OFF) state implies the edge is active (inactive).The PNode t (v).state,PEdge t (e).staterepresents the state of a node n and edge e at time t.PNode t (v).time,PEdge t (e).timedenotes the last time the state of nodes and edges were updated.• Algorithm 1 CreateNewHypothesis(d, t, m): Algorithm for creating a new hypothesis 1: Input: d, where d∈ D , t∈ R + , m (current system mode) 2: for all f ∈ Parent(d) and F do 3: if PEdge(f,d).state = ON and ET(f,d).tmin≤ (t -PEdge(f,d).time)≤ ET(f,d).tmax and EM(f,d) m then Algorithm 2 HandleDsicrepancyStateChnageEvent(e,m): Algorithm for handling discrepancy state change event 1: Input: e = (d, t), where d∈ D , t∈ R + ; m (current mode)2: isExplained = FALSE 3: for all h ∈ HSet t do ∈ h.C1 ∪ h.C2 ∪ h.E1 ∪ h.E2 ∪ h.M1 ∪ h.M2 and (d.d 1 ).EM m then Input: e = (h f , d a , t) where h f ∈ HSet t , d a ∈ D, t∈ R + 2: if d a ∈ h f .E 1 then

Table 6 .
Temporal Failure Propagation Graph for WSCC 9 Bus System D cyber Finite set of discrepancies associated with cyber failure modes.d PAn BR SC Discrepancy associated with stuck closed fault in breaker, PAn BR. d PAn BR SO Discrepancy associated with stuck open fault in breaker, PAn BR. d PAn DR de1 Discrepancy associated with missed detection fault in distance relay, PAn DR. d PAn DR de2 zk Discrepancy associated with zone k spurious detection fault in distance relay, PAn DR. d TLn PAk Discrepancy related to fault in component, TLn, signaled by distance relay in protection assembly PAk D Nonempty set of discrepancy nodes related faults in physical components.F cyber Finite set of failure modes associated with cyber components.F PAn BR SC Stuck closed fault in breaker, PAn BR.F PAn BR SO Stuck open fault in breaker, PAn BR.F PAn DR de1 z1 Missed detection fault associated with distance relay, PAn DR F PAn DR de2 zk Zone k spurious detection fault associated with distance relay, PAn DR F TLn Failure in transmission line, TLn F Nonempty set of failure nodes in physical components.h f Hypothesis related to physical fault f.HNode t (n) Map that defines hypothetical state of a node n in failure graph at time t.HSet t Set of all hypotheses at time t.ONode t (n) Map that defines observed state of a node n in failure graph at time t.PAn BR Circuit breaker in protection assembly, PAn PAn DR Distance relay in protection assembly, PAn PAn Protection assembly labeled as PAn PEdge t (e) Map that defines physical state of an edge e in failure graph at time t.PNode t (n) Map that defines physical state of a node n in failure graph at time t.