Aircraft Troubleshooting Optimization Using Case-based Reasoning and Decision Analysis


 
 
The Fault Isolation Manual (FIM) can be seen as a specialist system that carries the expectations and expertise of engineers and technical team concerning the aircraft components and systems operation. It is basically a manual that supports the maintainers regarding the actions to perform in determined situations to properly isolate a fault. Although the FIM is the most common tool that assists maintainer on the troubleshooting process today, it does not adequately consider field experience and it does not explore situations where the maintenance operator has limited resources, such as a lack of tools and equipment. These drawbacks are essentially caused by the lack of flexibility or adaptability of this method since it is a static manual. There are several dynamic methods studied in the field of system troubleshooting and aircraft maintenance such as Artificial Neural Networks, Support Vector Machine, K Nearest Neighbor and many other machine learning algorithms. These techniques are considered very powerful and useful; however, the training process of the data-driven strategies requires a large amount of data to provide a reliable result. 
In this context, the present work proposes a combination of data-driven with legacy knowledge-based approaches. The following techniques are employed to integrate the concepts mentioned: decision trees that explore the legacy knowledge with its topology based on the FIM, truth tables and decision analysis that explores Bayes’ rule to assist the decision- making process and case-based reasoning, technique that enables the learning from the field experience. 
 
 



INTRODUCTION
According to Kinnison and Siddiqui (2013), maintenance is the process of ensuring that a system continually performs its intended function at its designed-in level of reliability and safety. Besides that, maintenance of an aircraft has a significant impact on the direct operating costs for an airline.
Complex systems such as aircraft and power plants are submitted to failures whose root cause cannot be identified without detailed maintenance investigation. The process of identifying the faulty components and repairing them is called troubleshooting.
Some troubleshooting strategies consist of a combination of actions and checks. In such cases, each possible answer for a check question may lead to a different set of troubleshooting actions (or a different sequence of troubleshooting actions). In many applications, the set of all possible actions and questions are known. Then, the troubleshooting problem can be expressed by finding the optimal sequence of actions and questions (Vianna, 2016).
Decision analysis is a formalized approach to making optimal choices under conditions of risk or uncertainty. In situations where the decision-maker face no uncertainty or risk, the possible alternatives can be modeled by a deterministic model. Facing risk or uncertainty, the best decision or alternative becomes, depending on the variables involved, much less evident. In these cases, a probabilistic model is one form to represent a complex problem and find its best solution.
approach naturally present on the traditional troubleshooting strategies since each experiment is performed to improve the prior knowledge of the faults to provide a more precise reference of expected payoff for each alternative.
The mathematical rule that is capable to establish a relation between posterior and prior knowledge is the Bayes' theorem (Nyberg, 2018). It is given by: Where A and B are events and P(B) ≠ 0. P(A | B) is the conditional probability of event A given event B is true. P(A | B) is the conditional probability of event B given event A is true. P(A) and P(B) are the probabilities of occurrence of the events A and B, respectively.
In decision analysis, a powerful tool used to represent a strategy or sequence of actions that contains conditional control statement is the decision tree (Chen, Zeng, Lloyd, Jordan and Brewer, 2004). While decision trees are not always the most competitive classifiers in terms of prediction, they enjoy the crucial advantage of yielding humaninterpretable results. In addition, another important tool used in various fields of science is the truth table. It aims to represent a system or process based on its inputs and outputs. A decision tree can be easily converted into a truth table that describes the possible existing paths.
In the domain of Fault Detection and Isolation (FDI), modelbased techniques usually employ residual analysis generated by the difference between system outputs measurements and their physical models' estimations (state observer as a broadly known method of modeling). On the other hand, concerning the data-driven techniques, Manjurul Islam and Kim (2019) states that SVM is still a state-of-the-art FDI algorithm in many cases. Artificial Neural Networks (ANN) also have been applied in FDI for distinct purposes. Cirrincione et al. (2020) Liao and Köttig (2016) proposed a data-driven and model-based hybrid/fusion prognostics framework interfaces a classical Bayesian model-based prognostics approach, namely particle filter, with two data-driven methods in purpose of improving the prediction accuracy. The first datadriven method establishes the measurement model to account for indirect measurements. The second data-driven method extrapolates the measurements beyond the range of available measurements to feed them back to the model-based method which further updates the particles.

PROBLEM AND HYPOTHESES
Various authors address the matter of improvement of fault isolation and detection process since it is, in many cases, considered not fully efficient. However, the Fault Isolation process optimization is not a commonly addressed subject. Even with several criticisms on the fault isolation framework and practices, the main aircraft manufactures do not adopt a deeply different way of generating it.
For years, fault isolation and correction strategies were object of study. However, most of the authors and approaches employed, in general, two methods. The first method is an entirely data-driven approach that implements techniques such as Artificial Neural Networks (ANNs) (Cirrincione et al., 2020), (Jung et al., 2018), Support Vector Machine (SVM) (Manjurul Islam & Kim, 2019) and many other machine learning algorithms. This strategy is limited and very dependent on the volume of data available. The second approach corresponds to the implementation of a specialist systems based on the company/developer legacy knowledge. Conventional aeronautical Fault Isolation Manual (FIM) (Embraer, 2020) exemplifies this approach. The limitation of this approach is the reduced capacity for updating and evolving according to the field experience.
The hypothesis of this study is that there is a gap between the fully data driven approach and the legacy knowledge-based systems. Therefore, the solution to cover this gap consists in combining aspects of these two different types of approach, adding their advantages and minimizing their vulnerabilities and limitations. Therefore, two main techniques are employed in the present study: decision trees that explore the legacy knowledge as its topology; and case-based reasoning, technique that enables the learning from the field experience.

General Approach
As commented before, a troubleshooting strategy is basically a sequence of tests and experiments executed as a mean of isolating one or a small group of faults to finally perform a corrective action to fix the fault. If the fault isolation process ends prematurely, some faults may not be isolated thus requiring the maintainer to go through a "trial and error" approach to solve the problem. Although this situation may lead to increased aircraft downtime and workload, sometimes maintainer may face situations when lack of resources and infrastructure limits the isolation process and a "trial and error" approach may be the best left alternative to recover a fault.
The complete troubleshooting process englobes the fault isolation and the repairing process. The fault isolation refers to all tests and experiments performed before the corrective action that correctly fixes the fault. An experiment, also called test, is a type of action that intends to acquire symptoms or gain information about the system in order to isolate a fault. Some examples of experiments are checking continuity on system wires, checking power source, testing equipment fuses and many others. The repairing process comprehends only the execution of an effective corrective action. A corrective action is an action that aims to fix the fault such as replacing a component or repairing a wiring damage. The relation between the troubleshooting cost and fault isolation cost can be observed by the equation below: In this work two probabilistic models are proposed, the heuristic model and the global optimization. Both models follow a similar structure of steps presented in Figure 1. The main difference between the two models resides in the calculation of the expected economy (step 2). Basically, the first step is to gather all data and structure the problem using decision trees and truth tables to simply manipulates and perform the necessary operations. After that, the most important metric is determined which is the expected economy of each experiment (heuristic case) or for the whole TS (global optimization case). Using this parameter, it is possible to decide if it is better to perform an experiment or a corrective action. If an experiment is selected, the current representation of the system is updated and the process of choosing the next best action restarts. If a corrective action is elected, the effectiveness of the solution is verified. If the corrective action correctly fixes the fault, the TS ends and the fault repaired is added to the fault history database of the algorithm. In case of negative, the current TS scope of possibilities is updated, and the decision-making process starts again.

Heuristic Model
The heuristic model follows a greedy approach in which the next action is performed based only on the possible outcome for current state of the troubleshooting process. In this way, each decision is a short-term goal that is not necessarily the best alternative to optimize the whole TS. It requires a low processing capacity and it is a fast solution if compared to the global optimization model. This model uses a mix of trial and error method and experiment oriented approach.
The EIC incorporates the corrective action chance of being non-effectiveness, adding the downtime cost (DTC) to its original cost. The DTC was estimated based on the methodology proposed by Saltoglu, Humaira and Enalhan (2009). The costs and probabilities of the corrective actions and faults are, respectively, represented by the functions Cost() and P(). Then, in order to maximize the economy of each experiment and reduce the risk of the model (favoring the experiments over the corrective actions), the greatest expected value will be used as reference of EIC without experimentation as shown in Eq. (4).
Assuming that an experiment will be performed before any corrective action, The EIC needs to be determined for both potential results of the experiment. Therefore, the EIC with experimentation is given by the Eq. (5). Where: and: Finally, the economy (Econ) of each experiment which is the main decision-making metric can be found using the Eq. (8).

Global Optimization Model
The global optimization model, on the other hand, follows an exhaustive search approach. It relies on the determination of the best sequence of actions (experiments and corrective actions) to minimize the entire troubleshooting process. In order to do so, it starts by simulating all possible sequences of corrective actions only.
= { 1 , 2 , … , ! } (9) where: Each strategy s has m elements and m possible subsets. The subsets are the effective corrective actions performed in a determined scenario. For example, a strategy given by = { 1 , 2 , 3 } will have the following subsets: 1 = { 1 }, 2 = { 1 , 2 } and 3 = { 1 , 2 , 3 }. The Expected Cost of Repair (ECR) for a subset of strategy is defined by: Where j is the number of elements of the subset. The next step is to calculate the ECR of the complete troubleshooting strategy using the formula described in Eq. (14).
Therefore, the ECR used as reference will be: The ECR for the complete troubleshooting using experiments is calculated by determining the expected value of a troubleshooting process that only performs a corrective action when it has a 100% certainty of the system fault. To meet this condition, it is necessary to consider all path costs as shown in Eq. (16).
( 1 , 2 , … , ) = ∑ ( ) * ( ) =0 The path cost (PC) of each fault is basically the cost of all necessary experiments to isolate a single fault plus the cost of the corrective action that fixes the fault found.
Mathematically the path cost can be expressed by: Where includes all experiments that are essential to diagnose a fault with complete certainty.
Finally, the economy for the global optimization model is given by: After each decision (experiment or corrective action), the fault probabilities are updated depending on its outcome. The probabilities of the remaining faults are determined by Bayes' rule as showed in Eq. (1).
After the fault isolation and repairing process is complete, the initial probabilities of each fault are revised for the next troubleshooting process. It is consolidated by incorporating the last fault fixed to the database and calculating each fault proportion relative to the total fault history record.

CASE STUDY
The proposed method was implemented in three datasets presented at the end of the paper.
It is assumed that only single failures happen at the system being studied. Also experiments presents no false positive or false negative indications.

Dataset 1 Simulation
To simulate the performance of the models created, a dataset with 1000 samples of faults was generated. Along with the faults, the dataset also contains the result of each experiment to its associated fault. The results of the experiments, obviously, respect the relations between experiments and faults established by the decision tree. The fault probabilities in this dataset follow exactly the same probabilities inputted into the models. This dataset is identified by dataset 1 and it can be seen on Table 1.
After the 1000 simulations of fault isolation and repairing process for each one of the troubleshooting strategies (heuristic model and global optimization model), the mean cost of troubleshooting is presented on Table 2.
The "TS mean cost" refers to the average cost of fault isolation and repairing process. Thus, it includes all the actions (experiments and corrective actions) executed to diagnose and fix the fault. The Global optimization model show itself as the best strategy presenting a cost reduction of 3,5% compared to the heuristic approach. The Table 2 also presents a comparison of the "Fault isolation mean cost" between the two models in question. It is worth noting that the cost of the repairing action is common to all models since each fault is only fixed by a unique corrective action and the dataset of faults used was the same to all model simulations. Thus, analyze the fault isolation cost is very valuable to observe the real gain of performance since the repairing action cost is a fixed term. In addition, this metric tends to be fairer to measure the performance gain due to the fact that if the repairing action cost is much greater than the fault isolation cost, the TS cost reduction of an optimized strategy will always be irrelevant or barely noticeable. Using the metric of fault isolation mean cost is possible to observe a cost reduction of more than 50% when comparing the Global optimization and heuristic model.

Dataset 2 Simulation
The second simulation also aims to model the costs incurred for each one of the strategies, however it uses a different dataset from the previously simulation. This new dataset is called dataset 2. Although the structure of columns, number of samples (1000) and format of data are similar (Table 1), the quantity and probability of each fault is slightly different from dataset 1. The probability changes can be seen in Table  8 (appendix).
Even though the dataset has been modified, the probabilities inside the models continue the original ones. Thus, there is a contrast between what is modeled, and the real data consumed. The intention of this simulation is to reproduce a real operative scenario where the probabilities considered to structure the FIM do not match perfectly with the probabilities of the faults on the field.
In this scenario, the Global Optimization Method is the one that presents the lower troubleshooting mean cost. The difference between the method with lower costs incurred (Global Optimization Model) compared to the heuristic model in this simulation is approximately -1.3%. It shows that overall performance was very similar. The comparison of the fault isolation mean cost can be seen in Table 3. As expected, the difference between the models were more evidenced. Observing this metric, the fault isolation cost reduction of the Global Optimization method compared to the heuristic model was approximately 23%. It can be seen that when the mismatch of probabilities between model and dataset are greater, the superiority of the Global Optimization method over the heuristic method is less appreciable.

Dataset 3 Simulation
The third simulation also aims to model the costs incurred for each one of the strategies, however it uses a different dataset from the previously simulation. This new dataset is called dataset 3. Although the structure of columns, number of samples (1000) and format of data are similar (Table 1), the quantity and probability of each fault is completely different from dataset 1. The probability changes can be seen in Table  8 (appendix).
Even though the dataset has been modified, the probabilities inside the models continue the original ones. Thus, there is a contrast between what is modeled, and the real data consumed. The intention of this simulation is to reproduce a situation where the probabilities considered to structure the FIM is disconnected to the real probabilities of the faults on the field. This is considered a possible scenario since the FIM can be elaborated on initial expectations of MTBF of components which could be not strictly respected in some cases or when there are changes in suppliers of the components over the years and there FIM is not updated to incorporate such probability changes.
In this scenario, the heuristic model is the one that presents the lower troubleshooting mean cost. The difference between the method with lower costs incurred(heuristic) compared to the Global optimization model in this simulation is approximately 1.1%. It shows that the gain of performance was not extremely relevant. The comparison of the fault isolation mean cost can be seen in Table 4. As expected, the difference between the models were more evidenced. Even with this accentuation of contrast on the results, the fault isolation cost reduction of the heuristic model compared to the Global Optimization was close to 5%. Again, there was not a significant gain of performance.

Limited resources case
This simulation aims to represent not a frequent but possible real case scenario in an airline maintenance operation. In a limited resources situation, the maintenance crew do not possess or do not have access to tools and equipment to perform experiment actions and follow the FIM instructions, as a result, there are basically two alternatives in order to execute the troubleshooting: the maintenance team must perform only a portion of the experiments oriented by the FIM and execute corrective actions assuming the risk of being non-effective or the company must travel to another site to gather the necessary equipment and then flight back to the original site the aircraft was located. The second alternative will generate significantly additional costs considering the transportation and logistics expenses. For this case, these kinds of charges were estimated in around US$8904 that represents the direct operating cost (DOC) of two flights of 1h of a medium narrow body commercial aircraft according to the Form 41(financial data) provided by the U.S. Department of Transportation in 2013. For this simulation, the resource limitation can be exemplified as the unavailability of an oscilloscope that generate the extra experiment cost of US$8904 which was added to an arbitrary experiment for all models. The experiment selected to incorporate this extra cost was E2. The dataset 1 was used to run the 1000 simulations. The probabilities inputted in both heuristic and global optimization model were compatible to the fault probabilities of the dataset.
The mean costs of TS using the FIM, heuristic and global optimization model are exposed in Table 5. The cost reduction presented by the Global Optimization strategy is extremely significant. The fault isolation mean cost of the heuristic model reaches more than 3 times the fault isolation cost of the other model. It shows that the heuristic model does not show an efficient capability of adaptation in this case.

Processing Time
Due to the enormous quantity of mathematical operations, the algorithm developed for the global optimization approach only determine the expected value of a portion of all possible sequences. A limit of 1000 loops per search was established to allow completing the simulation in a reasonable period of time. To run the 1000 troubleshooting simulations, the global optimization model took approximately 300 minutes. In comparison, the heuristic model consumed only 6 minutes.

CONCLUSION
This work proposes a method to assist the troubleshooting process using a data driven approach as well as knowledge base information. A heuristic approach using a greedy solution is proposed as well as a global optimization approach using an exhaustive search solution. Both methods were evaluated in three different datasets and two possible scenarios, the first one with unlimited resources and the second one with limited resources. Results showed that the global optimization presented better results although more computational effort was required. It worth noting that the exhaustive search model performs better than the heuristic model in a high/medium certainty environment. The heuristic model can add value in scenarios of high uncertainty and restricted computational resources. Future work includes considering multiple failure conditions, the addition of combinatorial optimization techniques to narrow the exhaustive search and evaluation of different heurists from the ones proposed in this work.