Automated Contingency Management for Water Recycling System

To enable effective management, planning, and operations for future missions that involve a crewed space habitat, operational support must be migrated from Earth to the habitat. Intelligent System Health Management technologies (ISHM) promise to enable the future space habitats to increase the safety and mission success while minimizing operational risks. In this paper, Water Recycling System (WRS) deployed at NASA Ames Research Center's Sustainability Base is used for verification and validation of the proposed solution. Our work includes the development of the WRS simulation model based on its dynamic physical characteristics and the design of Automatic Contingency Management (ACM) framework that integrates fault diagnosis and optimization. In WRS modeling, a nominal model with fault injectors is developed. Fault detection and isolation techniques are then developed for isolating causes and identifying the severity of the faults. Dynamic Programming (DP) based fault mitigation strategies are designed to accommodate the faults in the system. A series of simulations are presented with different fault modes and the results indicate that the proposed ACM system can alleviate the fault in the WRS optimally regarding energy consumption and effects of the fault.

Figure 1.NASA Ames Sustainability (Poolla et al., 2015) There are reported works on using data-driven based approaches to analyze data from NASA Ames Sustainability Base, shown in Fig. 1, as a testbed for Deep Space Habitats.Poolla et al. (Poolla et al., 2015) used an artificial neural network to train the on-site sensor data from the photovoltaic (PV) system.Basak et al. (Basak, Hosein, Mengshoel, and Martin, 2016) integrated dimensionality reduction and Bayesian network structure learning with a MATLAB adverse condition detection called ACCEPT to detect thermal discomforts of occupants.Iverson et al. (Iverson et al., 2012) used a distance-based anomaly detection method to monitor parameter values in the space operations include International Space Station flight control, satellite vehicle system health management, launch vehicle ground operations, and fleet supportability.Martin et al. (Martin, Schwabacher, & Oza, 2007) compared several different unsupervised anomaly detection algorithms on the Space Shuttle Main Engine (SSME) data.
There are also several model-based diagnoses and prognosis approaches designed for the Environmental Control and Life Support System (ECLSS) (Roychoudhury, Hafiychuk, and Goebel, 2013) models the WRS deployed at NASA Ames Research Center's Sustainability Base and design diagnosis and prognosis approach for it.The limitation of this work is it only focuses on the diagnosis and prognosis approaches when fault happens.After fault is diagnosed, there is no fault-tolerant control method designed to accommodate the fault.
This research aims to develop advanced ACM for Life Support Systems (LSS) in a deep space habitat.The WRS, which collects condensate in the air, used water, and recycle them into the drinkable and usable water, is one of the critical subsystems in LSS.In this research, the WRS in Sustainability Base is employed as a reference to build the WRS model.To accommodate faults in the WRS, an automatically contingency management framework is developed.Different fault modes, both discrete faults, and continuous faults are injected into the WRS system.Faults are detected by a Lebesgue sampling based Extended Kalman filter (LS-EKF) approach (Yan & Zhang, 2014).With fault state estimation, Dynamic Programming is used to optimize the energy consumption and maintain the WRS in a degraded but acceptable operating condition.A series of simulations are conducted to demonstrate the effectiveness of the proposed method.The ACM strategy developed in this research is application agnostic and can be applied more generally to other subsystems, such as power subsystem, waste processing, and biomass processing in LSS, and other NASA systems for outer space missions.
The paper is organized as follows.Section 3 describes the dynamic physical characteristic of WRS and the modeling of WRS.Section 4 describes the framework of ACM and the functions in this framework.Section 5 illustrates a case study, which use Dynamic Programming as an optimal control method to mitigate the effect of fault.Section 6 provides concluding remarks and future works.

WATER RECYCLING SYSTEM MODELING
The ECLSS (Wieland, 1998) includes atmospheric resource management; airborne particulate matter removal and disposal; water recovery systems; waste management; fire protection systems; and environmental monitoring.WRS plays a critical role in ECLSS.
The WRS in the Sustainable Base collects wastewater from sinks and showers and recycles them into clean water.Fig. 2 shows a schematic diagram of the WRS, which consists of tanks, pumps, pipes, filters, and forward osmosis (FO) and reverses osmosis (RO) modules.For outer space missions, WRS can reduce the water consumption and extend the duration of NASA missions.The WRS consists of two primary subsystems, namely FO system and RO system.
During the service of WRS, components like filters, pumps, and pipes will degrade, such as clogging of the filter, corrosion of pumps, fatigue, fraction, and cracking of pipes, etc.These degradations will result in system performance Figure 2. Diagram of Water Recycling System (Roychoudhury, Hafiychuk, and Goebel, 2013) degradation, and if not detected or maintained, they will eventually lead to system failure (breakage of pipe or pump, or complete clogging).Therefore, diagnosis and contingency management of WRS is of vital importance to the efficiency, reliability, and safety of ECLSS.In the past, the results from diagnosis are often used in maintenance, in which only the current fault information is utilized and this is a reactive strategy.As a result, when health management method is developed, it may not be optimal over a long period of time.By integrating real-time prognosis, which predicts the fault state in futures times and estimates the remaining useful life (RUL), we can upgrade the reactive strategy to a proactive strategy, which will lead to long-term optimization and more reliable and economical maintenance activities.
When a fault is detected, it is desirable that the system has fault tolerant capabilities to alleviate the fault or extend the life of the system.For outer space missions, when the maintenance is not available, such capabilities of automatic reconfiguration, fault tolerant control, and health management are significant to the safety of the WRS and crew.
For this WRS shown in Fig. 2, a model has been established.Since modeling is not the focus of this paper.The model is ignored and more details about the model can be found in (Tang et al., 2018) and (Roychoudhury et al., 2013).

AUTOMATED CONTINGENCY MANAGEMENT
At each discrete time step , the healthy nominal system model and faulty system (with a fault being injected) share the same input ().The measurement denoted by () for the healthy system model and  () for faulty system will be different.The fault detection algorithm takes the difference between () and  () , also known as residual () = () −  () , to detect if there is a fault happens in the system.Once the residue reaches the threshold, the fault isolation algorithm will distinguish sensor fault or component fault by the number of residues reaches the threshold.If only one residue reaches the threshold, we can claim there is a sensor fault.The sensor with fault will become offline, and the new sensor will be used online.
If several residues reach the threshold at almost the same time, we can claim there is a component fault.For component fault case, the fault mitigation method would be used for mitigating the effect of fault.The fault mitigation algorithm is executed to generate new fault control signal, which will accommodate the fault and reduce the impact of the fault.
Two fault mitigation methods are proposed in this paper.The first fault mitigation method uses Dynamic Programming to minimize the control effort while every state located within predefined constraints.The second fault mitigation uses Lebesgue sampling based Diagnosis to estimate how severe the fault is, then a proportionalintegral-derivative (PID) controller to mitigate the fault.Fig. 3 shows the preliminary framework of automated contingency management.The details about modeling, the structure can be found in (Tang et al., 2018) and (Roychoudhury et al., 2013).

Fault Detection
Diagnosis aims to monitor the health state of the component and detect fault or anomaly from the measurements of components.In Bayesian estimation theory, the states are described by probability density functions (pdf).Diagnosis is conducted by comparing the baseline distribution with the real-time fault state estimation distribution, as illustrated in Fig. 4 (Yan, Zhang, Wang, Dou, & Wang, 2016).The baseline distribution (given by the green distribution) is obtained from measurement when it is in healthy condition.
When false alarm rate is 5%, a threshold is defined by the 5% of baseline distribution.While the diagnostic algorithm is executing, it takes a measurement and computes a realtime estimation distribution of the fault state (given by the red distribution).If 90% (predefined confident level of detection) of the real-time pdf is beyond the detection threshold given by the vertical blue line, then we can claim a fault is detected with 5% false alarm rate and 90% confidence.

Lebesgue Sampling based Fault Identification
The objective for fault diagnosis is to estimate how severe the fault is, which can be used to alarm crew and automatic control system.Since the computational resource can be limited on a Deep Space Habitat, the Lebesgue Sampling based fault diagnosis method is applied.
Different from traditional Riemann sampling framework (RS), LS method divides the state axis by some predefined states (also called Lebesgue states).These Lebesgue states are deteremined by the number of Lebesgue states and the range of feature value.The diagnosis is triggered only when the fault state, which is reflected by the fault indicator extracted from the raw measurement, changes from one Lebesgue state to another, or an event happens (Yan et al., 2016).To illustrate the concept of LS, a degradation curve is shown in Fig. 5.With this consideration, in R1, more computation resources can be assigned to other tasks while only a little resources are needed for diagnosis.In  , more resources are assigned to diagnosis tasks so that the fault dimension can be tracked accurately.

Dynamic Programming
Dynamic Programming (DP) is an algorithmic paradigm that solves a given complex problem by breaking it into subproblems and stores the results of sub-problems to avoid computing the same results again.DP can be used to the issues that have the following two main properties: overlapping sub-problems and optimal sub-structure.
For the first property, DP is mainly used when solutions of same sub-problems are repeatedly needed.In DP, computed solutions to sub-problems are stored in a table so that they can be used in future for same sub-problems directly.Therefore, DP is not useful when there are no common (overlapping) sub-problems.For the second property, a problem has optimal sub-structure if optimal solution of the given problem can be obtained by using optimal solutions of its subproblems.
In our ACM system, the ACM optimization needs to be solved with estimated fault state for a given fault mode recursively.When an optimal ACM strategy is generated, before it is changed, the optimization will have the same structure.Therefore, DP suits the ACM optimization very well.It is also worth mentioning that, there are many similar components in the LSS, such as pumps, filters, and motors.When a fault happens to these components, the optimization problem will be defined similarly, which indicates DP can also be used in high-level optimization.
However, the computation cost of DP will increase exponentially with the increase of number of state in the optimization, DP is limited to subsystems in WRS.The same strategy can be applied to other subsystems so that the overall optimization can be achieved.For the WRS simulation model developed, there is 11 state and we will use the state related to a pump-filter subsystem for case study.The selected states are chosen based on the criteria that when a component happens, only a few states will deviate from the nominal condition while the rest of the states remain as the same in short period.

CASE STUDY I: DYNAMIC PROGRAMMING BASED FAULT MITIGATION
As mentioned early, a pump-filter subsystem will be used in this section for illustration and demonstration of the proposed ACM system and DP optimization.

Fault Mode
In this case study, the Filter2 clogging fault scenario is studied.From Fig. 6,  , which indicates the hydraulic resistance to flow through the Filter2, is directly related to the health condition of Filter2 and is used as health indicator.According to its dynamics, it is assumed that, when clogging happens, the value of this health indictor will decrease in a quadratic form and is represented as:  As for the LS-EKF method, a state model and a measurement model are used for fault detection and identification (Tang et al., 2018).Here the EKF is described to make this section self-complete.Suppose the fault dynamics is described by the following nonlinear model: where  is the states to be estimated, (•) is the nonlinear function of states,  is the input at time ,  is a zero mean Gaussian noises with covariance matrix  .
The observation model that describes the relationship between state  and measurements  is given by:  = ℎ( ) + (2) where ℎ(•) is the measurement function of the state,  is a zero mean Gaussian noises with covariance matrix  .
For EKF, the Jacobian of (•) and ℎ(•) need to be calculated, which is given by: (3) Then, the prediction step calculates the mean and covariance of the prior distribution by using the following equations: where  | is the mean of the priori distribution and  | is the covariance matrices of the predicted state.
When the measurement becomes available, the correction step uses it to calculate the posterior distribution, by following the following equation: where  is the measurement residual,  is the residual covariance,  is the near-optimal Kalman gain,  is the covariance matrix of the observation noises,  | is the updated covariance estimate,  | is the updated state estimate, and  is the identity matrix.
The first task is to detect whether the filter has clogged.This is conducted by comparing the baseline  distribution (obtained from simulation) against the real-time  distribution as shown in Fig. 7, which indicates two realtime estimation pdfs at two different time instants.To make the description clear, the  without clogging is normalized to 1.With the baseline distribution and 5% false alarm, the fault detection threshold is 0.9984, which is indicated by the blue vertical line.The mean value of the real-time distribution at the 601 min is 0.9852 and its 95% confidence interval is [0.9695, 1.0009].The probability of detection is set to be 0.95, and more than 95% percent of the real-time distribution is below the blue line, so it claims a fault is detected at the 601 min.Note that the results shown in Fig. 7 are available at every time instants to reveal the fault state pdf, which can be used in ACM.
Figure 7.The fault detection results.

Dynamic Programming for Optimal Control
The objective of DP is to minimize the control effort while maintaining each state within their constraints.In this paper, DP is implemented, and qualitative evaluation is studied.
The pump-filter subsystem used in DP is shown in Fig. 8.The benefit of DP based fault mitigation is the constraints on each state, and the final state can be adjusted based on the physical limitation (threshold) of the component or system and the need of the crew.

Problem Definition
When the filter clogging fault is isolated, our objective is to minimize the control effort or energy cost for the Pump4.DP is used for optimal control of this subsystem.
When clogging fault occurs in Filter2, the pressure of Filter2 ( ) will increase by clogging.As a result, the outflow rate at Pump4 ( ) and the outflow rate of Filter2 ( ) will decrease.The water transferred from FO Module to Feed Tank 2 is denoted by  .
Therefore, based on these measurements and states, the filtering subsystem has three states ( , ,  ) and three measurements (  ,  , ).The optimal control problem is to find an admissible control sequence  ,  = 0, 1, . . .,  , ( indicates the final time instance) such that the cost function is minimized and the constrains are satisfied (Elbert, Ebbesen, & Guzzella, 2013), as shown in Eq. ( 6

Dynamic model of the filtering system
To implement DP algorithm efficiently, the filtering system in FO module shown in Fig. 8 is used as a case study.The mathematical model of this filtering system is given as: The following Ordinary Differential Equations (ODE) describe the filtering subsystem: with the constraints of  ∈ [0,10] ,  ∈ [0,160],  ∈ [40,50] , and  ∈ [1.0,1.2] .The specified constraint values for the optimization problem are selected based on our understanding of the system.These values can be adjusted according to the real system.The selection of these values, however, does not affect the implementation of the proposed solution.
Note that the constraint on  is based on assumption that, at this range, the water flow into the FO Module 1 is considered to be accessible for the crew and other components.The constraint on  and  are based on assumption that the Pipe2 can operation for pressure less than 160 psi.Note also that the pressure of Feed Tank2 will not change significantly.Because before Pump4 starts to work and Filter2 becomes clogging, the water transferred into the Feed Tank2 is much more than water transferred out from Feed Tank2.To simplify the problem, it is assumed that the increase of pressure in Feed Tank2 ( ) is mainly caused by the water transferred into Feed Tank2.Therefore, the state model for  can be simplified as follows: ̇ =  (10) The cost functional to be minimized is given by J = ∫   (11)

Simulation Result
When continuous Filter2 clogging fault mode is injected into the model, diagnosis algorithm will detect whether there is a fault, as shown in Fig. 7. Once the fault is detected, the DP base fault mitigation would compute all possible control space from  to  numerically, and the control sequence with minimum cost as the control signal for the fault mitigation.

Resolution
The state space must be discretized for the DP algorithm.
The resolution of the state-space discretization is a critical factor for DP.With the increase of resolution, the accuracy of the solution would improve, but the more computation effort is required.
Therefore, a study is carried out to quantify the accuracy of the solution obtained by the DP for the pump-filter subsystem in WRS.Fig. 10 shows the deviation of the optimal solution evaluated with DP with different statespace discretization density.From the simulation result, we can conclude that when the state space discretization increase, the cost of the control input will increase.
Figure 10.Cost consumption deviation with different statespace discretization.

CASE STUDY II: PID CONTROLLER BASED MULTI-STAGE FAULT MITIGATION
For PID based fault mitigation, it is a static optimization strategy.Based on the severity of fault, the fault mitigation method with PID controller can be divided into three stages.
1) At the first stage, when the fault is not severe, the control objective is to bring the outflow rate of Filter2 back to its normal condition.
2) At the second stage, when fault becomes more severe, the control objective is to bring the outflow rate of Filter2 to degraded performance.In this study, 95% of the normal outflow rate is considered as a reference.
3) At the third stage, when fault becomes even more severe, the control objective is to bring the outflow rate to further degraded performance.In this study, 85% of the regular outflow rate is considered as a reference.At this stage, the relief valve would be opened to maintain the pressure of Filter2 into an accessible range.When the fault happens at the 500th min, the pressure of Filter2 begins deviating from the nominal system pressure.
The outflow rate of Filter2 starts decreasing, which means the water production begins to drop.As a result, the outflow rate from pump4 starts dropping too.Lebesgue samplingbased diagnosis detects filter clogging at the 601 minute.At this time, fault mitigation stage I would start, and the PID controller will follow the reference signal and bring the outflow rate of Filter2 to the nominal condition.At time 803rd minute, when the reference signal cannot be maintained because the physical limitation of Pump (In this study, we assume the limitation is 115% of operating voltage, and 95% of the normal outflow rate of Filter2 is used as the new reference signal at Stage II.The pressure of Filter2, however, is still increasing because of the degradation of the Filter2.When the pressure becomes higher than 88 psi, which is the limit set to keep the safe of WRS, the relief valve is opened to keep the pressure of Pump4 below this safety threshold.At this time, the 85% of the outflow rate in normal condition is used as the new reference signal at this stage.

COMPARISON OF DP AND PID-BASED ACM
By comparing the simulation result from these two different approaches, the advantages and disadvantages of these two approaches are discussed in Table 1.

CONCLUSION
In this work, an automated contingency management solution is developed and the WRS in NASA Ames Sustainability Base is used as a testbed for verification and validation.Lebesgue sampling-based diagnosis is used for fault diagnosis.Dynamic programming and a PID-based fault mitigation strategy are introduced in the proposed ACM system for comparison studies.
For the future work, we will seek the access to the real data set of photovoltaic and WRS systems for ACM verification.Meanwhile, the physical degradation characteristic of components in the WRS can be derived from regression methods.The NASA existing abnormal detection toolbox will also be integrated into the proposed system to introduce a data-driven approach for fault detection.He has mechanical, computer and electrical engineering experience in robotics design, robotic technology, sensor fusion, A.I, and robotic automation system integration and development.Additional skills include development of computerized manufacturing specifications, new product refinement, device failure analysis and strong analytical and statistical reasoning skills in broad multidisciplinary system design, control, and production.
Dr. Martin is a senior researcher in the Intelligent Systems Division and acts as the Deputy Data Sciences Group Lead.Over the course of his 12+ years at NASA Ames Research Center, he has worked in the application areas of robotics, data mining for aviation safety and space propulsion, among other areas.He acts as the research lead and facility support manager for Sustainability Base, the agency's flagship LEED-platinum certified green building that acts as a living laboratory for testing advanced information and sustainable technologies.Life support and other safety critical systems that will be essential for missions in support of future advanced and fully sustainable exploration habitats can directly benefit from these efforts.

Figure 3 .
Figure 3.The framework of Automated Contingency Management.

Figure 4 .
Figure 4. Fault detection criteria and fault detection process.

Figure 5 .
Figure 5. Illustration of LS.(a) RS with a fixed time interval; (b) LS with fixed Lebesgue state length It is clear that the degradation in the range  = [1,780] cycle is smaller than that in the range  = [780, 1000] cycle.Using RS method with fix time interval, as shown in Fig. 5(a), the diagnosis algorithm is executed at each cycle no matter if it is necessary.The setting of fix time interval, Fig. 6 illustrates the fault scenario when filtering clogging occurs at  = 500 minute.

Figure 8 .
Figure 8. Simplified diagram of subsystem of WaterRecycling System(Roychoudhury et al., 2013) For all  = 0, 1, … , .where  () is the final cost term and  ( ,  ) is the stage cost,  and  are the initial state and final state, respectively, and  is constrained by a target set  .Additionally, the input signals are constrained by the timevariant set  .The functions  and  are discrete-time representations of the dynamic system and the stage-cost function.At time  , the state space is discretized to the set  =  ,  , … ,  , where superscript  in  denotes the state variable in the discretized state-time space with time index  and state index .The control space is represented by the discrete set  =  ,  , … ,  .When a fault happens, DP starts and the control space is computed by DP backward process.The control space with minimum cost would be use by DP forward process for fault mitigation.

Fig. 9
Fig. 9 illustrates the simulation results of this case study, which include the results of the pressure of Pipe2, the pressure of FO Module 1 and the input signal of Pump4.When Filter2 clogging occurs, the pressure of FO Module will decrease, which means less water is being transferred into the FO Module1.The Filter2 clogging will affect other modules after FO Module 1 in the long run.Therefore, constraints on the  are set between 40 psi to 50 psi (represented by the red line in Fig. 9(b)), and assuming this is appropriate for crew and other system.

Fig. 9
Fig.9(a) shows that, when fault at Filter2 happens, the pressure at Pipe2 will decrease (represented by the black line).With DP based fault mitigation, the pressure of Pipe2 will increase within a constraint.In this case study, we assume when Pipe2 can operate under pressure less than 160psi.

Figure 9 .
Figure 9. Simulation results of Dynamic Programming based fault mitigation.(a) The pressure of Pipe2; (b) The pressure of Forward Osmosis Module1; (c) The input control signal As shown in Fig. 9(b), when the fault happens, the pressure of FO Module1 decreases quadratically (represented by the black line), which indicates the water transferred in the FO Module is much less than the normal condition (represented by the green line).The magenta line represents the results after DP based fault mitigation is implemented while the system is operated in a degraded but accessible situation.

Fig. 9
Fig.9(c) represents the control signal for Pump4.In this case study, the magnitude of the input signal for Pump4 is normalized to 1, and the 1.2 represents the maximum input for Pump (restricted by the physical limitation of Pump4).

Fig. 11
Fig. 11 Simulation result of PID controller based multistage fault mitigation.

Fig. 11
Fig. 11 shows the results of fault mitigation.In these figures, the first red vertical line represents fault happens at  = 500ℎ min.The second red vertical line represents fault detected at  = 601 min.At this time, the PID controller starts working, which indicates the fault mitigation Phase I starts.The third red vertical line represents fault mitigation stage II begins at  = 804ℎ min.The fourth red vertical line represents fault mitigation stage III begins at  = 900ℎ min.Fig. 11(a)-(d) show the outflow rate of Filter2, and outflow rate of Pump4, the pressure of Filter2, input to the Pump4, respectively.
Dr. Moore is a member of the Integrated Systems Health Management and Automation branch which is part of the Flight Mechanics and Analysis Division at Marshall Space Flight Center in Huntsville Alabama.For the last 12 years he has supported the branch in its development of ISHM software for the Ares and SLS rockets.For the previous 17 years he has worked in material science as part of the Microgravity Division of the Science Directorate.His work involved theoretical chemistry calculations on novel semiconductor and nonlinear optical materials.He received his M.S. and Ph.D. degrees in Physical Chemistry from the University of Cincinnati and a B.S. in Chemistry from North Dakota State University.

Table 1 .
Comparison between DP based fault mitigation and Multi-Stage PID based fault mitigation.Ash Thakker received his PhD in Engineering Science from Virginia Tech in 1974 and MBA from Florida Inst. of Technology in 1980.Dr. Thakker is CEO of Global Technology Connection, Inc. Global Technology develops and commercializes Predictive Analytical Solutions specifically for equipment and machinery asset monitoring, such as aircraft, ground vehicles, generators, chillers, battery systems etc.Dr. Thakker has held leadership positions in high tech fortune 500 companies.Inc., in Atlanta, Georgia.