Complementary Meta-Reinforcement Learning for Fault-Adaptive Control

Faults are endemic to all systems. Adaptive fault-tolerant control maintains degraded performance when faults occur as opposed to unsafe conditions or catastrophic events. In systems with abrupt faults and strict time constraints, it is imperative for control to adapt quickly to system changes to maintain system operations. We present a meta-reinforcement learning approach that quickly adapts its control policy to changing conditions. The approach builds upon model-agnostic meta learning (MAML). The controller maintains a complement of prior policies learned under system faults. This"library"is evaluated on a system after a new fault to initialize the new policy. This contrasts with MAML, where the controller derives intermediate policies anew, sampled from a distribution of similar systems, to initialize a new policy. Our approach improves sample efficiency of the reinforcement learning process. We evaluate our approach on an aircraft fuel transfer system under abrupt faults.


INTRODUCTION
No physical system is immune to degradation, changing environments, and faults. Since such situations can occur during operation, it is important the system respond to these changes in a way that the system continues to operate, be it in a degraded manner. This ensures safety and costeffectiveness through less down-time. Fault-tolerant control (FTC) (Blanke, Kinnaert, Lunze, Staroswiecki, & Schröder, 2006) seeks to keep a faulty system operating, but within an acceptable margin of sub-optimal performance. This relaxes the constraints on the designers to make a system completely fail-safe and allows for a tradeoff between design and operating costs.
Data-driven approaches to FTC (MacGregor & Cinar, 2012;Ibrahim Ahmed et al. This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 United States License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Hongm, Tian-You, Jin-Liang, & Brown, 2009) exploit the preponderance of data collected from system operations. They generate models that avoid the need for time-consuming and accurate physics-based simulations of system dynamics to analyze and respond to different situations that may occur in the system. However, such methods depend on the data to span the breadth of operating conditions, and the model has to contain sufficient detail to capture multiple operating modes. This represents another compromise between design and operating costs.
In many cases, systems are complex, the number of possible faults are large, and faults that have not been seen before can occur during operations. There may not be precedent in the data to model such behaviors. A data-driven control approach will not then have "ground truth" to learn from and recall a sufficiently optimal control policy. Reinforcement learning (RL) is a semi-supervised approach to machine learning. It forfeits dependence on labelled ground truth and instead relies on accumulated feedback (i.e. experience gained) from a sequence of actions to determine a globally optimal policy. This ability to learn during operations alleviates design time effort and costs.
RL relies on gathered experience to accurately evaluate actions. This can be represented as a dynamic programming problem (Bellman, 1966) that typically has a closed-form solution, but for large systems, suffers from the curse of dimensionality. Advancements to RL have used function approximations of values to overcome the computational intractability of the problem (Boyan & Moore, 1995;Baird, 1995). However, the dependence on data to learn such approximations limits how fast and how accurately a RL-based controller can accommodate faults.
In our past work (Ahmed, Quiones-Grueiro, & Biswas, 2020), we developed data-driven models to supplement experience with the real environment and simulate faults. In this work, we employ meta-RL for faster adaption of the RL algorithm to collected data samples. Our approach is not de-pendent on the time-consuming step of a data-driven model being learned first, however one can be used. The next section provides a background on RL and meta-RL. Section 3 describes our approach, and section 4 evaluates it on a simulation of a fuel-transfer system. Finally, section 5 places our work in the context of extant research.

PRELIMINARIES
This section briefly introduces the RL approach and then discusses Model Agnostic Meta Learning in the context of RLbased control.

Reinforcement Learning
Reinforcement Learning (RL) is a semi-supervised approach to machine learning. A RL problem consists of a controller interacting with its environment. The environment can be modeled as a single Markov Decision Process (MDP) sampled from a population of available processes, p ∼ P . At a time t the controller perceives the environment's state x t ∈ X, and uses its policy π : X → U to take an action u t ∈ U . The environment goes into a new state x t+1 governed by its transition function T : X ×U → X and emits a reward signal r t ∈ R, defined by the function R : X × U × X → R. The combination of (X, U, T, R) constitutes a MDP, p ∈ P .
The goal of a RL is to maximize the return, J π (x, u), which represents the total discounted cumulative reward for an action from each state when a policy, π is followed. A discount factor γ ∈ [0, 1] is used to weigh immediate rewards over delayed rewards and to ensure convergence of the discounted reward function. The maximum future discounted reward for an action from a state is its value V : X × U → R: Policy gradient algorithms (Sutton, McAllester, Singh, & Mansour, 2000) in RL parametrize π with parameters θ, i.e. π θ . The parameters θ are be the weights of a model representing the policy, for e.g. neural network. During training, they directly learn π θ by implicitly optimizing for V using gradient ascent on the gain function G ← E[J π (x, u)]. Gradient ascent produces iterative updates to θ the size of which is determined by the learning rate α ∈ [0, 1].
Parameter updates at each iteration are dependent on experienced rewards under the latest policy. This is known as onpolicy RL. This approach is sample inefficient because new trajectories of interactions need to be obtained for each version of θ. A way around this is to use importance sampling in the gain function. By modeling the policy as a stochastic function over actions, π θ (x | u), the relative probabilities, known as importance ratios, of the same trajectory under different policies can be obtained. Thus, the gain function can reuse the same batch of experiences to update the current iteration of parameters θ by weighing cumulative rewards. Equation 2 shows how importance sampling reuses experiences collected under θ k for the next iterations of policy parameters θ k+i : i ≥ 0. The learning rate is α.
Large gradient updates may cause the next iteration of π θ to overshoot, thus missing the optimum, causing the learning process to diverge altogether. Proximal Policy Optimization (PPO) (Schulman, Wolski, Dhariwal, Radford, & Klimov, 2017) clips the size of gradient updates by restricting the importance ratios between iterations. Thus a policy does not drastically change between updates. We use PPO in this work to learn the control policy under fault conditions.

Model-Agnostic Meta Learning
Meta-learning seeks to speed up a machine learning process through introspection. Essentially, it learns how to learn. In a RL context, meta-learning seeks to quickly adapt a policy trained on one process to another.
Model-agnostic Meta Learning (MAML) (Finn, Abbeel, & Levine, 2017) speeds up the optimization of any model learned through gradient updates. It does so by running an inner introspective loop within each iteration of a gradient update to the model's parameters, which is designated as the outer loop. In the inner loop, variants of the process are sampled as p i ∼ P . The current model parameters θ are then optimized by training for several interactions on each p i using gradient ascent to yield θ i . At the end of the inner loop, gradients on a test set of interactions are computed. In the outer loop, the update to θ is a weighed aggregate of the test gradients from the inner loop. That is, the training step for the outer loop is based on the test step of the inner loop.

Problem Formulation
The problem of the controller is thus: to exploit its past experiences with different processes, and sparse interactions under new process dynamics p to quickly converge to a locally optimal policy. The proposed approach for adaptive control operates under the framework depicted in figure 1. The adap-Algorithm 1: Model-agnostic meta-learning Input: parameters θ k , MDPs P , learning rates α in , α out , tion pipeline can either be preempted by fault detection, or happen periodically.
The adaption step begins with a fault. The fault is abrupt, causing a discontinuous change in process dynamics p → p . The MDP representing the system has changed. In the aftermath of a fault, a controller continues to interact with p and records states, actions, and rewards in a memory buffer M using its current policy parameters θ k . Once sufficient interactions t update have been buffered, the controller attempts to initialize new parameters θ from its memory, and then finetunes them to θ k+1 by interacting with the new process. Once learning is complete, the controller consolidates the newly learned policy with its prior policies. Thus, when a new fault occurs, it is able to exploit its past experience and adapt faster.
The learning phase consists of two stages: the meta-update using the memory, followed by iterations of any choice of a gradient-based reinforcement learning algorithm on the new process. During the meta-update, the controller uses its consolidated prior experience to initialize new policy parameters. The controller can also generate a data-driven model of the system to supplement sample inefficiency of RL. After that, the parameters are iteratively updated by the RL algorithm through interactions with the actual system.
Consolidation of knowledge happens via maintaining a complement of prior policies C = {θ | π θ }. The set of policies is periodically pruned to ensure that they capture diverse behavior but are small enough to evaluate within time constraints. Figure 2. The meta-update initializes policy parameters closer to an optimum, after which RL converges faster to a solution. The meta-update depends on the aggregate gradients of policies in the complement. The gradients are calculated from samples from a data-driven process model updated from a buffer of recent experiences, or the buffer itself. The meta-update step from θ k to θ is described in algorithm 2.

Policy meta-update
Our approach mirrors MAML in that there is an outer update loop for the main policy parameters. It depends on the gradients of the test error on the inner loop. We diverge in our formulation of the inner loop. In MAML the inner loop samples random processes from a population p i ∼ P defining the MDP. It uses those samples to derive intermediate parameters θ i from the single starting parameter θ k . We forego sampling processes anew to derive such intermediate parameters, and instead exploit the history of the controller's experience. In other words, MAML evaluates multiple processes on a single set of parameters. We propose to evaluate a single process on multiple sets of parameters.
Prior to the meta-update, a memory M of interactions under the new process is buffered. The meta-update step assumes a complement C = {θ | π θ } of prior policies trained on the system under different faults. This foregoes the need of sampling an altogether new set of processes for the meta-update. The complement of polices is then trained for a few steps K in to yield an updated set of meta-parameters. Finally, the test error of the meta-parameters on the process is used to update the outer loop's policy parameters.
Optionally, as a guard against a sub-optimal initialization θ k → θ , θ k is also concurrently updated using standard RL without meta learning to a baseline parameter θ b kout for each iteration k out of the outer update loop. Finally, the metalearned parameters and baseline parameters are evaluated on a provided process model p m . Whichever performs better is returned as the new initialization θ .
Evaluating policies from θ i ∈ C necessitates new interactions with the changed process p . This can be achieved by learning a data-driven model p m of the process using M. However, this introduces an additional computational load on the meta-update step. An alternative approach, already inherent in PPO, is to forego a model altogether and instead use impor-tance sampling (equation 2) to adjust the gain with respect to θ i . With importance sampling, the returns already calculated on p under θ k stored in M can be weighed by the relative probabilities of actions under θ i . This process is delineated in algorithm 2 and figure 3. Calculate test gain G i ;

Population of complement
The final step of the approach is to store the newly learned parameters for future reference. The complement of policies should be populated with policies such that it maximally spans the parameter space. Policies should be different enough so that the meta-update has a greater likelihood of adapting to novel faults. The difference between policies is evaluated on the memory of interactions collected by the controller. Each policy in C generates a probability for actions stored in M. KL-divergence between the probabilities is used as a metric of difference. The total divergence of each policy from the rest of the complement becomes a score of a policy's uniqueness. Given a complement size |C| ← s, the s most unique policies are kept as new members of C. Algorithm 3 goes through the process of selecting between the existing and newly learned policies to update C.

EXPERIMENTS
The algorithm was evaluated on a simulation of a fuel transfer system of an aircraft. The system is defined in greater detail in (Ahmed et al., 2020). The objective is to maintain center of gravity, variance in fuel distribution, and closed valves to avoid unnecessary mass transfer. Faults can include increased valve resistances leading to low flow rates, and increased fuel consumption due engine faults.
A controller was first trained for 50,000 steps on the nominal system. At the beginning of a trial, a random fault occurred and the controller accumulated experience in memory M. The controller then employed the meta-update step in algorithm 2 to initialize new policy parameters. Following that, the RL algorithm continued to learn on the new system. As a baseline, an RL controller was trained for |C| × K in × K out iterations on p m , when p m was provided, followed by learning on the new system p . For all experiments, a first-order approximation of gradients ∇ θ G i as documented in (Finn et al., 2017) is used. First, the controller was tested with an empty complement of policies. Second, a complement of 3 policies under simulated faults on the system was generated. The complement was trained on faults in tanks 1, 3, and 5 and no engine faults. In both cases, the controller was tested on the system under random novel faults. The controller was allowed to adapt solely from buffered experiences after a fault, without learning a new environment model. Figure 5 shows performance with C = {∅}. Episodic rewards start off lower than but comparable to the baseline. They quickly recover and match baseline throughout. Of note is the low variance in episode rewards compared to the baseline. Figure 6 shows performance with a complement of 3 policies. The controller starts off with performance similar to the baseline, but quickly pulls ahead and converges to an optimum. The initialization using a populated complement allows the controller to converge to a solution faster.
Additional experiments with different values of learning rates and loop iterations are documented in section 6.2.

RELATED WORK
Reinforcement learning has been explored for control systems. (Lewis, Vrabie, & Vamvoudakis, 2012) surveys RL approaches for feedback control. (Liu, Wang, & Zhang, 2016) attempts to speed up learning of neural network policies for controlling systems by manipulating the parameter update rule.
Approaches besides RL are prevelant in the field of FTC. (Jiang & Zhang, 2006;Zhang & Jiang, 2003) use performance degraded reference models to generate a library of the system under various conditions. Control is transferred to the policy learned for the most similar model in the library.
Meta RL for FTC is a nascent field. Recently, (Nagabandi et al., 2018) used used model-based RL for quickly adapting control to changed system dynamics. They used MAML and a recurrent network as two approaches to develop a metaupdate rule for the environment model parameters. In our case, however, we apply MAML towards updating the policy parameters. Alternatively, (Saemundsson, Hofmann, & Deisenroth, 2018) train a model to predict a latent representation of the environment. The latent variable is fed to the agent as a conditioning variable to represent changed dynamics. ) use a recurrent neural network to train a controller on a population of related environments. The controller, being recurrent, has memory of this experience, and therefore learns an internal function to transition between environments as they change.

CONCLUSION
We have proposed a meta-RL algorithm, which exploits a controller's past experience under faults to initialize parameters for a new policy under a novel abrupt fault. The metaupdate can optionally use a data-driven model to mitigate sample inefficiency, or it can fall back to using importance sampling on buffered experiences to evaluate the complement under current conditions. The newly derived parameters are added to the complement if they are divergent enough from the members of the set, thus ensuring a diverse library of behaviors for faster adaption to new faults.
MAML can be sensitive to choice of model architecture, task, and hyperparameters (Antoniou, Edwards, & Storkey, 2018). This merits further investigation on guarantees of convergence and optimality under faults. MAML can be further incorporated in our approach by using meta-learning to update the data-driven model itself. This should further reduce time taken to learn an updated model and the dependence on the size of the buffered data. Figure 8. Due to a higher outer learning rate α out and iteration number K out , the meta-update shows a larger change in performance. With the help of a full complement, the parameter updated is moderated in a direction such that performance variance remains low and shows a higher rate of change. Figure 9. |C| = 3, α in = 0.001, α out = 0.001, K in = 4, K out = 2, p m = ∅. Fault in tank 6. In some faults, the initialization from the meta-update starts at a local optimum.