Fmdtools: A Fault Propagation Toolkit for Resilience Assessment in Early Design

Incorporating resilience in design is important for the longterm viability of complex engineered systems. Complex aerospace systems, for example, must ensure safety in the event of hazards resulting from part failures and external circumstances while maintaining efficient operations. Traditionally, mitigating hazards in early design has involved experts manually creating hazard analyses in a time-consuming process that hinders one’s ability to compare designs. Furthermore, as opposed to reliability-based design, resilience-based design requires using models to determine the dynamic effects of faults to compare recovery schemes. Models also provide design opportunities, since models can be parameterized and optimized and because the resulting hazard analyses can be updated iteratively. While many theoretical frameworks have been presented for early hazard assessment, most currentlyavailable modelling tools are meant for the later stages of design. Given the wide adoption of Python in the broader research community, there is an opportunity to create an environment for researchers to study the resilience of different PHM technologies in the early phases of design. This paper describes fmdtools, an attempt to realize this opportunity with a set of modules which may be used to construct different design models, simulate system behaviors over a set of fault scenarios and analyze the resilience of the resulting simulation Daniel Hulse et al. This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 United States License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. results. This approach is demonstrated in the hazard analysis and architecture design of a multi-rotor drone, showing how the toolkit enables a large number of analyses to be performed on a relatively simple model as it progresses through the early design process.


INTRODUCTION
Resilience, the ability to prevent and mitigate hazards, is a key consideration in the design of complex engineered systems (Cottam et al., 2019;Yodo & Wang, 2016a). In the aerospace industry, for example, it is important that aircraft adapt and recover from hazards (Choi, Atkins, & Yi, 2010), airports reconfigure runways in the event of damage (Faturechi, Levenberg, & Miller-Hooks, 2014), and supply chains that mitigate disruptions (Treuner, Hübner, Baur, & Wagner, 2014). Because resilience factors heavily into the risk, safety, and functional reliability of a system, it is important to proactively consider resilience in the early design phase, when there is the most freedom to consider alternatives and allocate PHM features (Yodo & Wang, 2016b;Youn, Hu, & Wang, 2011). It is often helpful to support this consideration with a cost-benefit analysis that can then be used to provide a coherent business case for PHM system development (Banks, Reichard, Crow, & Nickell, 2009).
Considering hazards in early design involves representing the system in a high-level function structure to identify hazards and develop resulting design requirements (Stone, Tumer, & Van Wie, 2005). In the design of aircraft, this process is . fmdtools is intended specifically to provide fault analysis methods that enable the consideration of risk in early conceptual design processes called functional hazard assessment and follows the ARP4761 guideline (ARP, 1996;Allenby & Kelly, 2001). Model-based functional hazard assessment has been an active research area (Krus & Lough, 2009;Kurtoglu & Tumer, 2008;Noh, Jun, Lee, Lee, & Suh, 2011), with many new methods focusing on how to model different aspects of the system, such as human errors (Irshad, Ahmed, Demirel, & Tumer, 2019), dynamic behaviors, new flows resulting from failures (Jensen, Tumer, & Kurtoglu, 2009), and operational decision-making (Short, 2016). However, there has been less demonstration of how to use this information to compare design alternatives, and the research codes underlying these methods have not been shared within the research community. To enable this resiliencebased design, then, there is an opportunity to develop a tool that enables one to assess the resilience of a model without reimplementing underlying data structures and fault propagation methods.
The main contribution of this work is the development and study of a computational environment for performing resilience-based hazard assessment early in the design process. This paper presents this contribution by first describing the fmdtools project, an open-source python-based toolkit for the high-level modelling, simulation, and analysis of resilience. It then demonstrates the usefulness of this toolkit in early design by showing how it can support the analysis of models as a design increases in fidelity. In doing so, it seeks to enable the practical application of early model-based functional hazard assessment frameworks while explicitly enabling the consideration of resilience. An approach like this has a number of potential applications to considering PHM in aerospace systems by supporting cost-benefit analysis (e.g., (Holzel, Schilling, & Gollnick, 2014)), which can used to allocate resources for PHM (Youn et al., 2011) and assess de-sign alternatives (e.g. prevention or recovery schemes) (Hulse, Hoyle, Goebel, & Tumer, 2019b). While most traditional risk assessment methods focus on how faults lead to system-level failures, the resilience-based approach used in this work represents the full set of dynamic effects that result from a given fault scenario, so that one can compare the effect of different recovery actions.
The modelling framework described here additionally represents an advancement on current modelling paradigms used in early design. One major difficulty in modelling failure propagation is representing the system in a way that captures the full set of effects which would be caused by a fault (Hönig, Lunde, & Holzapfel, 2017). While a number of formalisms have been developed that enable this (e.g. modelica (Bunus, Isaksson, Frey, & Münker, 2009), simulink/lustre (Joshi & Heimdahl, 2007), etc), these models are often computationally expensive and are more applicable for later design when one has the complete set of system behaviors. As a result, previously-presented simulation-based functional hazard assessment methods have often been defined directly in code in a base language (e.g. MATLAB/Python). In this setting, it is convenient to express the system using a procedural, directedgraph representation of system behaviors where each function is defined in a method and where the output flows of each function are used as inputs to the next function in the graph. The main problem with this representation is that it can only represent flow propagation that occurs in a single directionfrom the source functions where flows originate to the sink functions where flows terminate. Defining a model in a base language also makes it difficult to then structure the model in a way that logically organizes faults, states, and behaviors of functions to be be easily understood, visualized, and modified. To improve on this approach, fmdtools uses a object-oriented, undirected-graph representation of system behaviors, where each function's states, behaviors, and faults are defined in a function class and a model class is used to connect adjacent functions so that behaviors can propagate in any direction through the model graph and one can easily parse the structure of the system model.
The rest of this paper is organized as follows: Section 2 discusses and compares the approach used by fmdtools with existing fault propagation toolkits to show how it compares with the state-of-the-art as well as uses of fault injection in other fields. Section 3 the underlying model representation, fault propagation, and visualization algorithms provided in the toolkit. Section 4 provides an example of using fmdtools to model on a multirotor drone case study with increasing fidelity in the design process and using it to inform decision making. Finally, Section 5 concludes with some assessment of the usefulness of the tool and directions for future work.

BACKGROUND
Fault propagation is widely used in a number of domains to assess the safety and resilience of a system of interest. As shown in Figure 1, while most current tools focus on modelling the system in the later design stages and in the verification and validation process, when there are detailed models of the system, fmdtools is meant to support early design processes. To do this, it provides analyses that support each phase of the function-behavior-structure design process commonly used in engineering design (Howard, Culley, & Dekoninck, 2008).
In this process, one creates a functional decomposition of the tasks the system is to perform, finds solution principles to achieve those tasks, and then synthesizes those principles into a realized design concept. These design processes are supported by static failure-logic functional hazard assessment and network models, dynamic behavioral models, and hierarchical fault models, respectively.
A number of modelling formalisms have been developed to assess the risk of hazards in engineered systems, including fault trees, bayesian networks, and stochastic petri nets (a type of discrete event simulation) (Chemweno, Pintelon, Muchiri, & Van Horenbeek, 2018). While the simpler methods (fault trees, bayesian networks) enable stronger proofs of system dependability (Chemweno et al., 2018), they do not express the system resilience-the behavior over time resulting from a fault. In these situations, a discrete event simulation or continuous dynamic model is used. Discrete event simulation (e.g. (Matloff, 2008)) has been used to assess and design resilience into systems, including maintenance (Wang, Cui, & Shi, 2015) and recovery aspects (Miles, 2018). Similarly, Monte Carlo techniques are often used with continuous simulation models to quantify risk of hazardous states (Hu, 2005). While these approaches describe the underlying general approaches to simulating faulty behaviors in a system, the following sections describe specific toolkits that use these approaches to quantify risk and resilience.

Related Work
As shown in Table 1, there have been some prior efforts to develop generally-applicable toolkits for the design of resilience in a model-based engineering process. One major difference between toolkits is the way they represent causality in the system to determine the effects and propagation of faults, which has a number of potential aspects, including the types of failure paths able to be represented (through the system behavior, failure logic, state transitions or a hybrid) (Hönig et al., 2017), and the nature of causality (probabilistic or deterministic). To integrate with design activities, currently-available frameworks are built around either a standardized modelling language (e.g. AADL), or an existing systems modelling tool such as Simulink or Modelica. While this enables one to use a single unified model for multiple analyses through the process, it can also be limiting if certain aspects of fault propagation are difficult to represent in the given formalism. Finally, while each of these toolkits are claimed to support different design processes, the design being performed is later stage embodiment design-not the early conceptual design shown in Figure  1. The main exception to this is Hip-Hops (Hierarchically Performed Hazard Origin and Propagation Studies) and IBFM (Inherent Behaviors of Functional Models), which both target early design processes. As a result, the implementation of early functional risk assessment tools has been cited as a necessity to bridge the gap between model-based hazard assessment methods in literature and their adoption in industry (Grigoleit et al., 2016).
Development of the fmdtools toolkit was initiated to address limitations with the previously-developed IBFM simulator (McIntire et al., 2016). While IBFM was developed for the early, high-level functional design of engineered systems and was able to determine the end-states of large sets of system faults, the behavioral representation was limited to finite states (Zero to Highest), a set of given possible operations on input and output flows, event-based dynamic propagation, and a given syntax for specifying failure logic. This limited its ability to express all possible behaviors that could occur in a system as a result of a given fault. Additionally, the text-based model representation made it difficult to parameterize models and simulate them iteratively in an optimization algorithm. In this work, these limitations are addressed by defining the model as a set of interacting Python classes with possible faults and (nominal or faulty) behaviors to simulate over a set of discrete time-steps. With this model representation, one can easily express and optimize the dynamic behavior of the system without being limited by expressiveness of the underlying modelling tool. fmdtools has additionally since expanded in scope to include not just fault propagation tools but the necessary analysis and visualization tools needed to interpret Toolkit Causality Representation Model Format(s) Availability Use in design HiP-Hops (Papadopoulos & McDermid, 1999) Dynamic simulation with failure logic Simulink, Simula-tionX, AADL, etc.

MATLAB code
Open Source Resilient, reliable, robust design under uncertainty (Patelli & Broggi, 2015) IBFM (McIntire et al., 2016) Bond graph with failure logic Text files, Python Open Source Functional decomposition and design optimization (Hulse et al., 2019a) fmdtools Dynamic un-directed behavioral propagation with failure logic

Python subclasses Open Source
Early resilience-based design, analysis, visualization and optimization

Other Fault Modelling Tools
In addition to the fault propagation toolkits mentioned above, there are additionally a number of toolkits used for fault propagation in safety assessment and for application-specific resilience assessment.
Safety assessment toolkits are used to verify that a given design meets desired safety and reliability criteria (Hönig et al., 2017;Joshi et al., 2006). Typically, these tools take a design model specified in a formal language (e.g. Simulink (Joshi & Heimdahl, 2005), Lustre (Joshi & Heimdahl, 2007), Modelica (van der Linden, 2014), AADL (Stewart, Liu, Whalen, Cofer, & Peterson, 2018), UML (Combemale, Crégut, Giacometti, Michel, & Pantel, 2008)) and apply a model checker to the system description to find cases where the system does not work as intended. While this process can be performed in a nominal system model, model-based safety assessment tools extend this process to assess safety by including failure modes and faulty behaviors in the system description (Joshi & Heimdahl, 2005, 2007. However, because the intended purpose of these tools is to verify safety requirements post-design (see the right side of the v-model in Figure 1), they typically rely on detailed, fully-specified models of the design that are not available early in the design process (Grigoleit et al., 2016). Furthermore, these tools are not intended to assess resilience-the dynamic effects of failures due to faults-but instead assess the safety or reliability of the system (i.e. an overall fault tree or failure probability). As a result, they are not applicable to the design of resilience in the early design process when the system has not been fully specified and the designer is interested in assessing the system's dynamic response to faults.
Fault injection is used widely in software and hardware engineering to assess a computer system's ability to manage the different types of faults. Existing simulators vary by domain (e.g. distributed systems (Martins et al., 2013), servers (Kollárová, 2014), stand-alone systems) and type of faults (hardware-originated (Georgakoudis, Laguna, Vandierendonck, Nikolopoulos, & Schulz, 2019;Schirmeier et al., 2015) or software components/algorithms (Goldstein et al., 2020;Arribas, Nikova, & Rijmen, 2018)), though generic simulators also exist (Winter et al., 2015). One of the advantages of software (as opposed to hardware/systems) engineering is that this testing can be iteratively performed on a running prototype of the system at a low computational cost, and as a result, these tools do not operate on models of the system as is needed in engineering design.
There are additionally a number of application-specific resilience assessment toolkits that interface with specific system simulators to model the resilience-related attributes for that domain. For example, integrated circuits have well-defined properties that have led to a number of specialized fault simulation algorithms, which have since been implemented in software (May & Stechele, 2012;Niermann, Cheng, & Patel, 1992). Infrastructure often has specific hazards to assess, which has resulted in tools to consider natural disasters for cities (Fraser et al., 2016;McKenna, 2011) and cyber-physical threats in smart grids (Wadhawan & Neuman, 2017). Finally, assessing the hazards of autonomous vehicles in a real system is both costly and hazardous (Gambi, Müller, & Fraser, 2019), which has led to the development of a number of simulators that enable one to try different policies to approaching haz-

Design
Simulation Analysis Figure 2. The fmdtools design, simulation, and analysis environment. ards which the vehicle will encounter (Jha, Banerjee, Cyriac, Kalbarczyk, & Iyer, 2018;Jha et al., 2019). While these toolkits can assess resilience in specific design contexts with high fidelity, they are not applicable to a generic systems design context.

METHODS AND ALGORITHMS
The fmdtools toolkit aims to provide a design, analysis, and simulation environment that enables the design of resilience into a system. To accomplish this, it provides a number of tools to represent, simulate, and analyze the system as it progresses through the design process, as shown in Figure 2. As a result, it can accommodate a number of modelling and analysis usecases to progress from the early, abstract representations of the design (e.g. network and static propagation models) to more detailed representations of the system structure (e.g. dynamic and hierarchical propagation models) and behaviors in the same modelling environment, as shown in Figure 3.
The full implementation of this work is provided in a publiclyavailable repository, along with examples and documentation at github.com/DesignEngrLab/fmdtools or (Hulse, Walsh, Biswas, & Zhang, 2021). While an exhaustive description of every method and class is out of the scope of this paper, this section will discuss some of the underlying concepts and structure of the toolkit. The fmdtools toolkit is organized into different modules used for model representation, simulation, and analysis. The modeldef module provides the classes to define a system model from functions, flows, and components as well as a fault sampling approach for resilience quantification. The faultsim module then provides methods for propagating faults in a model and quantify network metrics. Finally, the resultdisp module provides a number of methods to process and visualize simulations of the model, including behaviors over time, FMEA tables, fault graphs, and heatmaps.

Model Representation
In this work a system consists of functions (modules that perform a task), components (specific solutions to a function), and flows (variables) that are connected with each other in a bipartite graph. In a model (itself defined by a user-defined class), functions, flows, and components are represented by ob-  Figure 3. Fault model types and analyses enabled by fmdtools. Note that since model types build on each other, the analyses from less detailed model types (e.g. static propagation models, network models) can apply to more detailed model types (e.g. dynamic propagation models, hierarchical propagation models).  jects instantiated from user-defined classes that are connected by a graph, as shown in Figure 4 for a model of a wire. In this representation, each function (e.g., Transport EE) consists of its associated flow objects (e.g., EE 1, EE 2), internal state variables, set of faults, behavior methods, and constituent component objects (if a component representation is used). Flows are in turn defined by dictionaries of states with corresponding values and components are defined by internal state variables, a set of faults, and behavior methods. Model objects are then composed of their constituent functions, flows, and components as well as a graph object used to track the connections between functions and flows and a classification method used to quantify the costs of a fault scenario propagated in the model. This model is constructed in fmdtools by defining a subclass that extends the Model class to represent a system of interest. This model class is constructed by defining the simulation parameters (e.g. units, timesteps, starting and ending simulation times, design parameters), adding each flow in the model, adding each each function in the model and connecting them with flows to construct the model graph, and creating a classification method which determines the rate, cost, and expected cost of a given scenario given the results and parameters for that simulation-the end-state of functions and flows (e.g. values and faults present), the scenario properties (e.g. rates, faults, etc.), and the history of model states (values for flows over time). By instantiating this class, one can then use the resulting object to model the evolution of system states over time for a given set of defining parameters.

Functions
In design, functions represent the high-level tasks performed by the system (Pahl & Beitz, 2013). In fmdtools, the FxnBlock class is used to represent these functions, which constitute the main building block of the system model defining faults and system behaviors. To use this class, users define a subclass for each function which uses inherited FxnBlock attributes to represent the properties of the function. At its most basic, a user-defined function class is defined by the flows going in and out of the function, a behaviour method which relates values of input flows to values of output flows, and a set of faults which modify the input-output relationship defined in the behavior method. However, more attributes can be defined, including states (internal variables of the function tracked in the model history), conditional fault methods defining input/output flows or function states result in faults, timers that can be used to express delays in behaviors, and components, described in Section 3.1.3.
To enable expected resilience quantification, each fault to inject in the function can be associated with a probability model that expresses the likelihood of the fault occurring. This probability model is defined by a function-wide rate, the proportion (or rate/probability) of faults resulting from the mode, an opportunity vector expressing the relative likelihood of a fault occurring at a specific phase of operation, and a specification of whether these values are a rate (with units) or a probability. Each fault can additionally be given a cost of repair that can be used to calculate the costs of fault scenarios.

Flows
Flows represent the states (e.g. energy, material, or signal) with which the functions interact to achieve the overall goal of the system. In fmdtools, the Flow class is used to represent these states, which can either be extended in a user-defined subclass (if there are special properties the designer wishes to represent) or instantiated using a dictionary with the name of the flow, name of the values characterizing the state of the flow, and initial quantity for each value. Because of the undirected nature of the model graph and associative relationships between functions and flows, flows are accessible (i.e. values can be changed) in all connected functions.

Components
While functions represent the task a system performs, components can be used to represent realizations of that function. Often in risk or resilience-based design, one is interested in designing component architectures which will fulfill an overall function, even when an individual component fails (Youn et al., 2011). To represent this, fmdtools uses the Component class, which shares many attributes with the Function class, including internal states, a behavior method, and a set of fault modes. Similarly, to use the Component class in a model, one must define a subclass for the modelled component with its own states, fault modes, and behavior method. However, unlike the FxnBlock behavior method (which acts on FxnBlock attributes and does not return anything), Component behavior methods explicitly take the inputs of the component behavior as input and return outputs of the behavior as output. This is done to enable the designer to relate the inputs and outputs of the components in the architecture fulfilling a function with each other and function states in the behavior methods of its corresponding function.

Resilience Simulation
fmdtools has two main approaches for assessing the resilience of a model: quantifying network metrics of the system architecture and propogating faults in the system model. Network metric quantification, described in Section 3.2.1 and implemented in the networks submodule enables some consideration of the structural resilience of the system before specifying the fault logic in the system. Fault propagation, described in Sections 3.2.2 and 3.2.3 can then be used to assess the resilience of a model given the model has faults to propagate in each function (or the function of interest) and the functions each have the necessary fault logic and/or behaviors.

Network Metric Quantification
The network metric quantification tools in fmdtools enable the early assessment of the failure tolerance and resilience of a design. Network analysis is an emerging early design methodology for predicting the likely failure tolerance of a design without the need for high fidelity models (Haley, Dong, & Tumer, 2016;Mehrpouyan, Haley, Dong, Tumer, & Hoyle, 2013). In this approach, engineered systems are represented as networks in which nodes represent functions, parameters, or components depending on the specific network formalism. In short, network analysis enables analysis of the topology that emerges from the connectivity between system elements. The representation of connectivity between system elements is similar to that of a design structure matrix (DSM), and network theory enables visualization and powerful analyses of emergent network properties. Networks are used for both a priori assessment of resilience properties as well as for quantifying the network's response to simulated faults.
Resilience Properties of Networks Known relationships between structure and failure tolerance (Boccaletti, Latora, Moreno, Chavez, & Hwang, 2006) in complex networks enable the a priori assessment of a network's resilience properties prior to simulation. The structure of a network is intrinsically related to its functionality; structural vulnerabilities are therefore relatable to loss of functionality. In networks, failure tolerance is typically studied by attacking or removing nodes and measuring the resultant degradation of the network structure. That is, much as loss of functionality in one component in an engineered system leads to decreased overall performance, degradation or removal of one node in a network leads to an alteration of the network's topology. In this way, a network's resistance to failure is relatable to an engineered system's resistance to failure. Certain networks are more prone to degradation due to node removal than others. Likewise, certain nodes are more prone to causing degradation than others. In component networks, failure or removal of a component node implies loss of functionality in that component, and the analysis therefore relates to the effects of the failure of that function. Network analysis of the effects of function node failure is complementary to, for example, FMEA in that it enables an assessment of the effects of a failure and its severity.
A common method for characterizing a network is studying its degree distribution. In an undirected network, the degree of a node is its the number of connections (edges). Nodes with high degree (hubs) tend to be more critical in retaining a network's functionality. A network's degree distribution is a histogram of the degrees of all nodes in the network. Degree distributions that follow a bell-curve tend to be more vulnerable to random node removal, whereas degree distributions that follow a power law tend to be more vulnerable to targeted node removal (Barabási, 2009) (i.e. removal of a hub). High degree nodes are identifiable using find high degree nodes. Degree distributions are provided using degree dist.
The modularity and community structure of a network also have significant bearing on the network's failure tolerance. Modules, or communities, are tightly coupled groups of nodes. The modularity of a network, the degree to which a network exhibits a modular structure, is typically measured using Qmodularity (Newman, 2010). Nodes that connect modulesbridging nodes -are functionally important to a network's failure tolerance (Walsh, Dong, & Tumer, 2018). This is comparable to, for example, identifying high severity failures in FMEA. Networks with high modularity have been found to be less robust overall (Walsh, Dong, & Tumer, 2019). find bridging nodes is provided to identify bridging nodes in the network model. calc modularity is provided to compute the modularity of the network model.
Average shortest path length (ASPL) is a measure of the efficiency of the spread of information through a network. Under attack, networks with low ASPL are more likely to retain short to moderate length paths between any given pair of nodes, whereas networks with high ASPL are more likely to disintegrate significantly under attack. ASPL is defined as the mean of the sum of all edge weights along the shortest path between each pair of nodes in a network and is available as calc aspl. Simulation-Based Analysis of Network Resilience Rather than using a network's structure to predict its response to failure, it is possible to use simulation-based approaches to network analysis. First, robustness coefficient simulates the effect of failure in the network. This approach, implemented as calc robustness coefficient, measures the changing size of the largest connected component of a network during successive node removal (Viana, Tanck, Beletti, & da Fontoura Costa, 2009). A second simulation-based approach is an SFF (susceptible-failed-fixed) epidemic spreading model, which is able to explicitly represent node recovery (Mehrpouyan et al., 2013). Rather than attacking or removing nodes as in the robustness coefficient, this model considers nodes to be in a susceptible, failed, or fixed state. Failed nodes may cause susceptible nodes to fail, similarly to infected persons spreading an epidemic. After a node is fixed, it is unable to fail again by the same cause (immunity). The SFF model is available as sff model.

Fault Propagation
Propagation of faults in a model has two major aspects: static fault propagation and dynamic fault propagation. Static fault propagation is the process of determining the immediate effects of a fault in a system, as illustrated in Figure 5. First, all of the behavior methods are run. If a new value occurs in one of the functions (e.g., because of fault injection) or its associated flows, that function and functions adjacent to the changed flows are added to a list of functions to update. The behavior functions for those functions are then run and new functions to update are again added if they receive new input values. This process is run iteratively until the system reaches a stable end-state, if a stable end-state is possible. Thus, one necessary property for fault models is stable fault behavior-behaviors in one function should not change behaviors in other functions that will in turn change the original behaviors in the original function repeatedly, indefinitely.
As shown in Figure 4, functions have associative relationship with flows, meaning functions have full access to (and can change the values of) the states of both "input" and "output" flows. Because the propagation of system states is undirected, functions have the ability to propagate new system states to any other functions in the graph-not just the function that would be placed "next" in the sequence of tasks to perform. However, because of the undirected system representation, conflicts between function behaviors can occur when different functions specify different values for the same flow state, resulting in a non-convergent system state at a given time-step. This must be avoided in model setup, which can be achieved by representing flows with states that propagate forward through the model graph (i.e. "effort" variables such as voltage or potential in a bond graph representation) and states that propagate backwards through the model graph (i.e. "flow" variables such as current or rate).
Dynamic fault propagation is the evolution of states in the model over time necessary to quantify resilience as a timedependent property of a system. The implementation of dynamic fault propagation used here is illustrated in Figure 6. As shown, the static propagation procedure is iteratively run over a set of time-steps from fault injection time to the end of the simulation time. To fully assess resilience, a history is kept of all of the states of the model (flow values, function state values, faults in functions, etc.) over the set of time-steps. Based on a simulation like this, one can then quantify metrics for the simulation such as dynamic costs, recovery time, or worst state over time.

Fault Injection
Depending on the scope of the analysis, one might be interested in injecting faults in different ways. Before injecting faults, it is important to determine that the model performs as expected by simulating the system without any faults. Then, while setting up faults and fault behaviors (and in systems with single faults), one can propagate a single fault at a given simulation time to verify that the simulated behavior matches the expected behavior for that fault. Once faults are encoded, the list of faults can be propagated in the system at times defined in the model. While this approach lets one see the consequences of faults injected at set times, it may not be for calculating expected resilience metrics, since it neglects joint fault scenarios and when in time faults are most probable.
To quantify the mathematical expectation of fault-injection based resilience models, the SampleApproach class can be used to define the set of fault scenarios to propagate in the model, as illustrated in Figure 7. This class uses the dynamic probability model set up in the model, functions, and components, along with user-defined parameters to create a list of fault scenarios which will be used to represent the statistical expectation of the defined faults. This approach can be defined over the set of faults to include, the number of joint faults to inject and the probability model for the joint faults (e.g. assuming independence or a conditional probability), and the times to inject the faults at. The set of injection times is determined by two main parameters: the phases defined in the model (and opportunity vectors for each fault in the probability model), and the set of times within each phase. Within each phase, these times can be specified as every discrete timestep, an evenly-spaced approach with a set number of points, a randomly-spaced approach with a set number of points, or an approach using a given quadrature defined in the quadpy software package (Schlömer, Papior, Ancellin, & Arnold, 2020). Additionally, Sample Approaches can be refined post-hoc based on a set of simulation results to represent the behaviors with a small set of sample points. Using these approaches, one quantify expected metrics iteratively with as few fault simulations as possible.

Resilience Analysis and Visualization
Using the models defined in Section 3.1 and simulations in Section 3.2, one can then perform analyses on the results. fmdtools provides a number of different methods using existing Python libraries, including matplotlib (Hunter, 2007), networkx (Hagberg, Swart, & S Chult, 2008), and pandas (pandas development team, 2020) to makes sense of the fault behaviors modelled in the system. To assist with this analysis and visualization, the results of the simulations are processed to summarize the state of different aspects of the system as nominal or faulty. This process results in three categorizations for functions, flows, and components: nominal, when the data structure behaves as it does in the nominal state; degraded, when the data structure has a different behavior than it does in the nominal state; and faulty, when a component or function has a fault. This approach to result processing enables highlevel visualization of the status of model structures without the user defining bounds or conditions for each variable to be listed as faulty or degraded. Using this representation, one can make a number of plots of the model graph structure, system behaviors over time, and tabular summaries of results.

Graph Plots
To visualize the propagation of faults in the system, fmdtools provides a number of methods that display a graph view of the system using matplotlib (Hunter, 2007), networkx (Hagberg et al., 2008), and netgraph in the graph submodule. Graph views of the system enable one to see the structure of the model as well as desired states or properties of the functions and flows. Two main graph representations can be plotted: a default graph representation where the flows are plotted as edges between function (which are nodes) and a bipartite graph representation where both functions and flows are nodes and edges are the associative relationships between them. To visualize the model graph in an intuitive layout, methods calling netgraph's InteractiveGraph are provided which enable one to place nodes manually (rather than relying on an algorithmic layout). While the default representation is more intuitive to interpret-especially for simple systems-the bipartite representation often makes a better use of space-especially when a flow is connected to more than two functions-because there is less more freedom to ensure that edges do not overlap.
Using the graph view, one can then visualize graph metrics as shown in Figures 8, the state of the model at the end-state or a particular time (or set of times) in the model history as shown in Figure 11, and various model run statistics defined by heatmaps (e.g. expected degradation time, maximum number of faults, etc.). These visualizations give one a view of how faults and behaviors propagate at the system level.

Dynamic plots
When modelling a dynamic system, it is often important and necessary to plot particular states over time in order to see how the behavior evolves over time. Methods in the plot sub-module use matplotlib (Hunter, 2007) to show the evolution of chosen states of the model over time, with (if the simulation was a run of a fault scenario) faulty states overlaid over the nominal states over time to aid assessment of the fault-induced behavior, as shown in Figure 12. In addition to modelling system behavior in a particular modelling scenario, time-based plots also have the ability to visualize the cost responses given by the simulations at each injection time. This plot, shown in Figure 13, can be used visualize how the sample approach defined in Section 3.2.3 approximates the expected resilience costs by showing the rate over time for a particular fault, as well as the modelled cost and quadrature weight for each sample point. Since these plots are plotted in matplotlib, well-known commands and interfaces can be used to edit and save the plots.

Tables
Finally, it is often helpful to be able to view the results of fault simulations in tabular form. While simulation results are typically returned as nested dictionaries, fmdtools provides methods to view this information as a pandas (pandas development team, 2020) table to enable results processing and display. Based on the processed results, one can also make tabular summaries of simulations, such as the number of functions and flows degraded over time or in a particular simulation. Tables are most helpful for summarizing the results of a set of simulations, where they can provide an FMEA-style assessment of the functions and flows affected as well as the rate, cost, and expected cost of each fault simulation, as shown in Table 3, which can be generated to delineate between or summarize fault effects over each phase. Since these tables are implemented in pandas, existing interfaces can then be used to display and/or save results (e.g. as a .csv).

EXAMPLE -DRONE MODEL
This section illustrates how one can use the fmdtools software package to aid the conceptual design of a real system by providing analyses that increase in fidelity and detail with the design process. A number of examples are provided in the repository, including a conceptual model of a pump, a dynamic modelling of virus spreading, a human-operated tank system, and a static model of an electric power system.
This example considers the design of a multi-rotor drone which must fly to a given location and return to its destination. The functional model of this system is shown in Figure 11, which ASPL Modularity Robustness Coefficient 1.44 0.12 95.85 Table 2. Network metrics for default (function) network representation of example drone model. Figure 8. Visualization of high degree nodes in default (function) network representation of example drone model.
includes the rotor lines, structure, electrical power source and distribution, and path planning of the system. In this example, we first quantify metrics about the system structure, then use a static representation to generate a high-level FMEA and visualize fault propagation, then create a dynamic version of the model to visualize fault behaviors over time and quantify the effects of injecting faults at different simulation times, and finally use a hierarchical model to compare the dynamic fault responses and resulting resilience of different component architectures.

Network Representation and Analysis
First, the model is analyzed using network metrics and algorithms. In fmdtools, it is possible to analyze various network representations of the model, although only one network representation will be shown in each step. Each network analysis function has options to analyze the default (function) network, the bipartite (function-flow) network, the parameter network, or the component network. The bipartite network is treated as a unipartite-like network for analysis (Haley et al., 2016). Analysis of the various network perspectives provides a more complete understanding of the model's resistance to failure.
The network metrics for the function network are given in Table 2. Low (close to zero) modularity indicates a high degree of interconnectivity and has been correlated with high robustness (Walsh et al., 2019). High robustness coefficient and low ASPL indicate high robustness to random node failure. High degree nodes are highlighted in red in Fig. 8. Nodes with high degree, or hubs, are more likely to cause significant performance degradation if in a fault state. Based on this analysis of Figure 9. Degree distribution of default (function) network representation of example drone model. Figure 10. SFF model applied to function network representation of example drone model.
the system topology, we can conclude that the functions with the most opportunity to affect other functions at a topological level are path planning, control systems, and the structure of the drone, since these functions have the most connections with other subsystems. Identification of these vulnerabilities is similar to the identification of high severity failures within an FMEA. The degree distribution of the function network is presented in Fig. 9. The relatively homogeneous degree distribution in Fig. 9 indicates that the network is not particularly robust to failure of critical nodes. The SFF model for the function network is provided in Fig. 10. This model demonstrates the system's resilience (based on parameter topology) to a cascading failure, given a user defined failure rate, fix rate, and start node (first node to fail). If various design alternatives are being considered, their relative resilience can be compared using the SFF model.

Static Representation and Analysis
To identify how faults lead to failures and begin to quantify risk in the system, the model is elaborated with flow attributes, function states, and failure logic, creating a static propagation model. As modelled in this system, deviations in the input of Figure 11. Static fault effects to the motor breaking: the drone crashes. Figure 12. Dynamic behaviors of a motor breaking the Trajectory function block lead to a crash, which in turn propagates faults through the structure to initiate faults in the rest of the functions. Using this model (and an underlying fault probability model), one can create a FMEA-like table of fault effects, rates, and costs, as shown in Table 3, to show which faults have the highest impact on the design. As shown, most faults in this model lead to a crash, making the chosen rate the driving factor of expected cost. One of the faults under consideration in this model is a mechanical break causing the AffectDOF function to lose the ability to control the degrees of freedom of the drone. This fault is visualized in Figure 11, showing how fault leads to a crash and in turn faults in the other functions. This fault will be used to motivate analysis and design through the rest of this example.

Dynamic Representation and Analysis
In the dynamic model, the drone is given dynamic states and behaviors which iterate over time-in this case the position, velocity, charge, and perceived location of the system. This can then be used to model how the system behaves in faulty Figure 13. Modelled cost over time of the motor breaking and nominal scenarios as shown in Figure 12 for the mechanical break fault. In the nominal case, the drone flies out to a location and returns to land, while in the faulty case the system crashes soon after the fault is injected, leaving it far from the landing location.
While this single-fault injection can help one understand how the system behaves, a fault injection approach can be used to quantify the expected effects of a fault which could occur at different points in the simulation. This is shown in Figure  13. As shown, the cost is high in the ascent and forward flight phase ($134K-$184K) since the fault then leads to a crash, and low in the descent phase ($500), since the drone has already landed. Additionally, the cost increases in the middle of the forward flight phase since the system crashes far from its landing location, which incurs additional cost in the model. While these results are consistent with the results of the static model at the point in time considered in the model (forward flight), they also show how a higher-fidelity dynamic model model can elicit a more nuanced consideration of fault effects.

Hierarchical Representation and Analysis
Given the effects of failures in this system, it is important to consider how they can be mitigated through the component architecture. In the drone model, for example, one has the ability to consider whether to realize the AffectDOF function with a quadrotor, hexarotor, or octorotor architecture. This example considers the quadrotor and octorotor architectures.
To compare how each architecture adapts to faults, the dynamic behaviors over time can be plotted. When considering the break fault, the quadrotor architecture reacts identically to the fault as in Figure 6, since losing one motor causes the system to lose stability. However, when the drone has an octorotor architecture, the system behaves as shown in Figure  14, faltering due to lost thrust but ultimately recovering and landing in the desired location. Thus the octorotor architecture is more resilient to this fault scenario.  Table 3. Automatically-Generated Scenario-Based Static FMEA from model Figure 14. Fault behavior of octorotor However, to make a decision about component architectures on the basis of resilience, the expected cost of all fault scenarios for both architectures must be compared and weighted against the operational and implementation costs. To quantify this cost, each of the faults associated with the realized function (AffectDOF) are injected in the model according to a sampling approach. In this case, while the octorotor component architecture mitigates a number of scenarios due to the increased system redundancy, it does not mitigate every fault (e.g. control errors), and in fact increases the chance of other faults (e.g. electrical problems that propagate to the battery) because the larger number of components provides more opportunities for the fault to occur. Statistics from this fault approach are shown in Table 4. As shown, while the number of scenarios increases in the octorotor architecture, the number of scenarios which lead to a crash (and overall crash rate) is much lower, resulting in a lower overall resilience cost. This shows how fmdtools can be used to assist resilient design decision-making in the early design process.

CONCLUSIONS
Computational environments for resilience-based design can enable the tractable modelling, simulation, and analysis of resilience in early design that can directly inform early design decisions. As presented in Section 3, the fmdtools project realizes this goal by providing a set of model classes, simulation methods, and analyses that reduce the complexity of the resilience modelling task while keeping a high level of expressiveness needed to fully represent system fault behaviors. This is because analyses and simulation methods are already implemented in the environment in a way that integrates with the underlying model representation, meaning that, to analyze resilience, the user only needs to define a model and run the corresponding method. As demonstrated in Section 4, this approach supports the analysis of the system as the design progresses from a low-detail network representation to a higher-fidelity dynamic component model. In this demonstration, the automated generation of network visualizations and metrics, behavior-over-time plots, cost-based FMEA, and expected cost quantification informed the understanding of the resilience of the system and enabled its consideration in early architectural design decisions. Because this work was implemented in the open-source Python code, this general design environment can be easily modified and extended as needed to fulfill future research needs.
While the usefulness of this work is apparent from the demonstrations shown here, there are a number of possible extensions that would increase its practicality in design. First, safety is an important aspect of resilience that has specific requirements not explicitly taken into account in this work. That is, while this work is concerned with quantifying the costs of faults, safety procedures require one to quantify the overall cost of the entire set of failure scenarios, which often requires taking a deductive approach (ARP, 1996). Future work should address this by providing an approach to identify top-level failure events and quantify the risk of those events. Additionally, while the models used in this work are deterministic, failures can often have probabilistic effects that must be taken into account to accurately quantify overall risk. Future work should incorporate probabilistic state transitions into the model to enable non-deterministic fault propagation. Finally, while defining models directly in Python code increases model expressiveness, it forces one to use a stand-alone model and may make the toolkit difficult to use without the relevant programming knowledge.Future work should provide an interface for defining these models in an existing modelling tool-chain or model formalism (e.g. AADL) so the same model used by other design and analysis processes can be used to model resilience.

ACKNOWLEDGMENT
This material is based upon work supported by the National Science Foundation under Grant No. NSF CMMI #1562027. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
The fmdtools package and scripts used to produce this work are archived at (Hulse et al., 2021) under open licenses. Her research focuses on the design and analysis of complex engineered systems. Specifically, her approach draws from network theory, systems theory, and machine learning for risk and reliability analysis in early design.

Journal of Mechanical
Dr. Andy Dong is head of the Oregon State University School of Mechanical, Industrial, and Manufacturing Engineering and a professor of mechanical engineering. His research addresses strategy in the design and innovation of engineered products and systems. His research aims to explain the impact of design strategy on productivity and the betterment potential of new products. His background in artificial intelligence in design has led him to collaborative work across a wide range of top-ics in behavioral economics, cognition, and computational fabrication. Considered around the world as an expert in design strategy, he has provided advice to major international telecommunications, financial services, and civil aviation companies.