Effective Maintenance by Reducing Failure-Cause Misdiagnosis in Semiconductor Industry ( SI )

Increasing demand diversity and volume in semiconductor industry (SI) have resulted in shorter product life cycles. This competitive environment, with high-mix low-volume production, requires sustainable production capacities that can be achieved by reducing unscheduled equipment breakdowns. The fault detection and classification (FDC) is a well-known approach, used in the SI, to improve and stabilize the production capacities. This approach models equipment as a single unit and uses sensors data to identify equipment failures against product and process drifts. Besides its successful deployment for years, recent increase in unscheduled equipment breakdown needs an improved methodology to ensure sustainable capacities. The analysis on equipment utilization, using data collected from a world reputed semiconductor manufacturer, shows that failure durations as well as number of repair actions in each failure have significantly increased. This is an evidence of misdiagnosis in the identification of failures and prediction of its likely causes. In this paper, we propose two lines of defense against unstable and reducing production capacities. First, equipment should be stopped only if it is suspected as a source for product and process drifts whereas second defense line focuses on more accurate identification of failures and detection of associated causes. The objective is to facilitate maintenance engineers for more accurate decisions about failures and repair actions, upon an equipment stoppage. In the proposed methodology, these two lines of defense are modeled as Bayesian network (BN) with unsupervised learning of structure using data collected from the variables (classified as symptoms) across production, process, equipment and maintenance databases. The proofs of Asma Abu-Samah et al. This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 United States License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. concept demonstrate that contextual or statistical information other than FDC sensor signals, used as symptoms, provide reliable information (posterior probabilities) to find the source of product/process quality drifts, a.k.a. failure modes (FM), as well as potential failure and causes. The reliability and learning curves concludes that modeling equipment at module level than equipment offers 45% more accurate diagnosis. The said approach contributes in reducing not only the failure durations but also the number of repair actions that has resulted in recent increase in unstable production capacities and unscheduled equipment breakdowns.


INTRODUCTION
The SI has revolutionized our daily lives with integrated circuit (IC) chips.On average we use more than 250 chips and 1 billion transistors per day per person.These are installed in almost all the equipment around us ranging from dish washer, microwave ovens and flat screens to office equipment.The sales revenues in the SI are characterized by cyclic demand patterns and positive compound annual growth rate (CAGR) of 8.78% (Shahzad, Hubac, Siadat, & Tollenaere, 2011).The demand for ICs is mainly driven by end user markets from the electronics industry (EI) e.g.data processing, communication, consumer electronics, industrial sector and automotive 1 .The SI forms a part of this complex interaction among these multiple industrial sectors (Yoon & Malerba, 2010;Kumar, 2008).Currently, wireless communication and consumer electronics are leading market segments whereas automotive is a potential emerging segment.It is 8% of total SI market but is expected to dominate in the future (Shahzad, 2012).Therefore, demand is increasing not only in volume but also in diversity that led the emergence of high-mix low-volume production environment and shortening product life cycles, in the SI (Shahzad et al., 2011).The success in the SI requires sustainable production capacities to cope up with associated challenges in this complex and highly competitive environment.
The SI production line comprise of hundreds of production and metrology/inspection equipment.These are grouped as different workshops, based on operation types.IC chips are manufactured on the silicon wafers of 200/300 mm diameter that undergo up to 1100+ elementary operations, depending on technology.These are processed in the lots of 25 wafers where each wafer contains around 900 chips and cost 6K to 12K US dollars.In such complex production environment unscheduled equipment breakdown is the limiting factor for sustainable production capacities.The production capacities and the equipment usage are plotted against the evolution in product mix using data from thermal treatment (TT) workshop at the world reputed semiconductor manufacturer.This data is aggregated at the quarter level and spans over last six years (2008Q1 to 2014 Q1).It is manipulated for confidentiality purposes; however, the scale is kept constant to highlight the original trends.
During 2008Q1 and 2012Q2, production capacities are significantly larger than both scheduled and unscheduled breakdowns (Figure 1(a)).In this period, slight increase in the product mix can be observed that decreases production capacities.The data till 2014Q1 shows that with the fluctuation of the product-mix, the production capacities suffer instability and a significant decline.Figure 1(b) presents the impact of commonality and differentiation in product mix on equipment from two consecutive quarters.The difference in product mix is plotted on secondary y-axis; it can either be positive or negative and ranges from -25% to +38%; whereas, product commonality is plotted on the primary y-axis, that ranges from 49% to 92%.It can be seen that production capacities increase with the rise in product commonality and are inversely proportional to unscheduled breakdowns.Therefore, production learning curves against demand diversity can be improved by reducing not only the unscheduled breakdowns but also by stabilizing them.It is because instability in the capacities result quick changes in production planning and reshuffling of production lots.In addition, time constraint lots result in scraps that impact not only the cost but also cycle time.Figure 1 shows that during the last two years, increasing product mix with differentiation and short product life cycles have resulted in 30% reduction of production capacities with high instability.
Further analysis using data from TT workshop is presented below in Figure 2 to identify causes for recent huge increase in unscheduled equipment breakdowns.Besides, reduction in capacities, it results in additional costs due to associated corrective maintenance actions.This analysis is extended on two failures (unscheduled breakdown events), as presented in Figure 2 Data plotted in Figure 2 is for two significant failures, elevator boat rotation and out of control (OC) and is manipulated for confidentiality reasons.It can be seen that failure count and average number of repair actions in each occurrence are 1/∞ proportional to product commonality that reduces process variations and results in stable capacities and controlled unscheduled breakdowns.In addition, out of control is 30% higher than elevator boat rotation in its occurrence and failure duration.As well as, the increasing number of repair actions in both failures provides an evidence of misdiagnosis in failures and causes.Therefore, increase in failure duration, occurrence and number of repair actions are the key root factors in increasing unscheduled breakdowns.Besides equipment failures and causes misdiagnosis, misdiagnosis can also occur while identifying sources of product quality drifts.In a highly complex production environment, e.g.SI, the likely sources can be product itself (imperfections from previous process steps or poor design), process (poor operations recipes design) or maintenance (poor execution of maintenance actions).However, the equipment is blamed for all product quality drifts.Hence, this paper focuses on correctly identifying the source of product quality drifts followed by accurate diagnosis of failures and causes.It ought to help reducing not only breakdown occurrences but also its associated durations.This paper is divided in 4 sections.Section-2 presents 3-axes of literature review, the first one highlights the prospective of having other sources of product quality drift than equipment, the second relates to literature review on equipment failurecause diagnosis in the SI and the third focuses on Bayesian Network as a modeling tool.The proposed methodology and the case study are presented in section-3 whereas BN models, proofs of concept and analyzes results are presented in section-4.Finally, we conclude this paper with discussion and perspectives.

LITERATURE REVIEW
In the scope of this work, the SEMI standard definition2 of failure refers as an unplanned event that changes equipment (system) to a condition where it cannot perform its intended functions, whereas, cause or fault is the reason behind the occurrence of equipment failure.These are different than the sources of product quality drifts which are grouped, in our paper, into four categories and are referred as failure modes (FM): product, process, maintenance and equipment.For ex-ample, due to the type of TT equipment (batch cluster), where multiple lots are processed together, a drift might occur in product due to the influence of different product combinations processed in the equipment.In such situation, the FM is product and not the respective equipment; therefore, it must not be stopped for diagnosis, inducing subsequent maintenance actions.Instead, further study must be directed to the combination of products allowed to be processed inside the equipment.In this regard, section 2.1 presents analysis on the product quality drift sources.Section 2.2 presents the survey on the existing equipment failure-cause diagnosis methods in SI and section 2.3 presents the choice of BN as the target approach for modeling the FM identification and equipment failure and cause diagnosis to reduce increasing unscheduled equipment breakdowns, failure durations and number of repair actions in each failure occurrence.

Source of Product Quality Drift
Analysis of the source of product quality drift can be related to Root Cause Analysis, a study to diagnose the sources of process issues for directing counteractive actions (Rooney & Heuvel, 2004).(Doty, 1996) and (Smith, 1998) use causes classification by (Ishikawa, 1990), dividing the root causes into six assignable categories as man, machine, method, material, measure and environment that explain abnormal situa-tions.It is a well-known qualitative method, frequently used in the diagnosis domain, but requires long brainstorming sessions with experts and is performed on the occurrence of new excursion.Therefore, it cannot be used in complex production environment for all excursions.(Sarkar, 2004) has combined cluster analysis with engineering knowledge to classify big sets of equipment failure events into small number of categories and use experts' knowledge to find root causes for each cluster.These are specific for the equipment related failures and do not take into account other potential sources of drifts.(Weidl, Madsen, & Israelson, 2005) model industrial process and product failure control system using generic object oriented BN that proposes corrective maintenance actions with explanation of root causes.Their set of root causes contains all possible hypotheses on the failure sources or conditions coming from the equipment sensors, process conditions and basic failures in maintenance.This approach does not take into account the product related failure events being the cause of product quality drifts.Besides this, the BN structure is also defined by an expert; however, experts' knowledge might need renewal due to dynamically changing manufacturing environment.
These above researches are important because they provide possibility of finding true sources against product quality drifts.Example of potential candidates which can lead to product quality problems according to their FM are listed below.
• Process: different types of production processes, engineering, R&D, process step combination.• Equipment: gradual build-up on process chamber, machine aging, cleaning, sensor drift.• Maintenance: preventive and corrective maintenance, ineffective repairs.
In the SI, besides these three categories, the Product (different combination of wafers, different wafer states etc.) is an important source to be considered, while identifying the source for product quality drifts.The Ishikawas method in our case can be used to find potential symptoms under these categories through brainstorming sessions.As a result, BN predictive model, if learned using identified symptoms collected from production line can not only give us potential causal relationships between these symptoms and the target failure without understanding the underlying structure of the process operations but also provide us with conditional probabilities to define the sources priorities.This will act as first line of defense against the exponential increase in unscheduled equipment breakdowns, failure durations and number of repairs in each failure.The details can be found in sections 3.1 and 4.1.

Equipment Failure and Cause Diagnosis in the SI
IT revolutions have enabled huge data volume handling with improved artificial intelligence (AI) techniques for failure di-agnosis.The commonly used techniques to optimize the production operations are advanced process control (APC) methods that include run to run (R2R) loops, statistical process control (SPC) and fault detection and classification (FDC).(Yue & Tomoyasu, 2004;Lacaille & Zagrebnov, 2007;He & Wang, 2007) used FDC approach to detect and classify equipment failures by calculating several statistics of collected parameters from FDC sensors data, on predefined time windows.These result in indicators which are then monitored through SPC control charts to detect sources of variation in the form of shift or drift of equipment signals.A comparable approach has been proposed by (Chen & Blue, 2009) using EWMA (exponentially weighted moving average) chart as a function of variance and covariance of relevant parametric distributions to assess the quality of equipment.However, this approach is objectively different from the above approaches as it integrates all sensors to generate one single index that reflects the overall equipment health against the product quality and is argued to be more robust to recipe and operation changes.(Chang, Song, Kim, & Choi, 2012) proposed a fault detection and classification methodology for the SI using a sequential SVDD (support vector data description) classifier algorithm.
A careful analysis of the existing approaches, methods and techniques, highlights that till today, to model equipment behavior, a significant number of its parameters (status variable identifications) from sensors are collected during wafer processing.With this data, the principal objective is to improve fault detection and failure diagnosis on the equipment.However, due to the frequent change of recipes and the diversity of operations in a high-mix low volume production environment such as SI, the overall equipment condition is very difficult to be evaluated.Because of these changes, (Blue, Roussy, Thieullen, & Pinaton, 2012) highlighted in their research that FDC indicators based on pattern modeling for specific recipes can have reliability issues as they are not appropriate to represent the equipment conditions continuously.Hence, they proposed generalized moving variance (GMV) technique in a hierarchical scheme for monitoring; aiming for a robust estimation of overall equipment condition based on the similar variations in FDC sensors data.
The literature proposed approaches focus on either predicting the overall equipment health through an index or diagnosing equipment failure; whereas in most of the cases, the cause assignment is left at the maintenance engineer judgment.In all of these approaches, FDC sensors data is used for prediction that can be misleading in dynamically production environment.As the second line of defense, failure and causes prediction on an unscheduled equipment breakdown must be modeled at module level rather than at equipment level, using contextual and statistical information.Equipment in the SI is composed of modules and sub modules.Each one representing a system, however in previous works, it is considered as a unique system.A comparison of diagnosis models at equipment and module level is crucial to provide us with the more reliable model as maintenance decision support for engineers.Consequently, reduction in failure durations and number of repair actions should stabilize and improve capacities with reduction in the unscheduled equipment breakdowns.

Bayesian Network (BN) as Modeling Tool
The methods used for failure and cause diagnosis range from univariate and multivariate statistical to artificial intelligence (AI) and machine learning (ML) methods.There also exist hybrid methods; however, in our context of stabilizing production capacities and reducing unscheduled breakdowns, the objective is not to accurately diagnose equipment failure but to provide potential high level source of product quality drift, failures and causes, so that engineers can make more accurate and if possible several decision plans on the repair actions.The most promising technique found in the literature that takes into account uncertainties like in the SI, and the experts knowledge is the BN.(Kobbacy, Vadera, McNaught, & Chan, 2011) discuss the various utilities of BNs in manufacturing with emphasis on its applicability when uncertainty is the key characteristic.It is based on the conditional probability theory and has a compact graphical presentation.(Correa, Bielza, & Pamies-Teixeira, 2009) compare the BN approach with Artificial Neural Network (ANN) in the problem of product quality prediction, targeting the automotive and aeronautical industry.BN approach was proved to have higher classification accuracy given new sets of measured variable and a better interpretability of resulting network.ANNs have the disadvantage of taking the shape of a 'black box'model in the sense that the non-linear relationships of cause and effect are not easily interpretable, making it difficult to explain underlying causal relationship behind the input and the output.Other advantages of using Bayesian network are its inherent ability for deduction and inter-causal reasoning (Kjaerulff & Madsen, 2006).The deductive (causal) reasoning takes into account the causal links between variables, from causes to effects using dynamic detection evolution.The inter-causal reasoning is an interesting and powerful ability of BN where evidence on one possible cause disapproves other possible causes.In addition to their ability to represent causal relationships, BN has the ability to perform learning efficiently in uncertain environments, involving small amount of related failure data and short temporal change of states.It can also be used to represent compact joint probability distributions (Margaritis, 2003).
(Weber, Medina-Oliva, Simon, & Iung, 2012) present a detailed review of BN applications in the domains of reliability, risk analysis and maintenance.For probabilistic dependability evaluation, comparisons were made between Fault Trees (FT) and Markov chains (MC) models.FT model, while fulfilling all of the advantages highlighted above, is limited to as-sessing just one top event per model as opposed to BN which is a multiple states modeling and can valuate several outputs in the same model, a characteristic well suited for the selection of alternative actions when we have to make a decision against a problem with multiple failure modes and causes.
On the other hand, MC allows the representation of multistate variables but the system becomes complex with a large number of variables, which is the case in our context where we try to integrate different variables coming from distinctive data sources.With BN, the constraint is avoided since the number of parameter within the conditional probabilities table is considerably lower to MC (de Souza e Silva & Ochoa, 1992).
The Bayesian network approach has recently become a focus for dynamic maintenance management and failure diagnosis in the SI.(Yang & Lee, 2012;Bouaziz, Zamaï, & Duvivier, 2013) applied BN for diagnostics and prognostics in the SI with an objective to investigate causal relation among equipment conditions and their effects on product quality.Moreover, there exist published methods and algorithms to adapt the BN to fit to the specific case studies in the SI (Roeder et al., 2012).In the process industry, (Isham, 2013) proposed a BN to compute dynamic probabilities and update the Fault Semantic Network.Its focus is on predicting real time risk based accident forecasting in the oil and gas sector.Another important use of the BN is as a faults classifier and isolator (Verron, Li, & Tiplica, 2010) A traditional BN consists of a set of nodes representing random variables (V), set of edges (E) connecting these nodes to form a directed acyclic graph (DAG) (Eq.( 1)) and the conditional probability distributions (T) tables to quantify the probabilistic relationships between nodes.The BN is a graphical representation of joint probability distribution (Eq.( 2)) that represents dependent and conditionally independent relationships.
Directed Acyclic Graph, This probabilistic representation of a system in a graphical form allows monitoring relationships among different variables.The conditional probability distribution table (CPT) is constructed based on the Bayes rule (Eq.( 3)).It states that for given two events A and B, the probability of A given B is the function of conditional dependence of B to A and respective probabilities of having A and B events together.It is an efficient feature to model causal relationships between a set of events.
The distribution changes when the states of the nodes in G experience a change of event (called evidence).Propagation algorithm is used to fuse and propagate the impact of new evidence and beliefs through BN so that each proposition eventually will be assigned a certainty measure, consistent with the axioms of probability theory (Pearl, 2014).
Therefore, BN is a powerful method for the probabilistic knowledge and inference under uncertainty.In this paper we focus on presenting a methodology using BN to stabilize production capacities and reduce failure durations, number of repairs in each failure and unscheduled breakdowns by: • Identifying the failure modes (category of source for product quality drifts) as Product, Process, Equipment or Maintenance through an unsupervised BN model that is learned using symptoms collected from the production database (section 4.1).• Developing integrated failure-cause diagnosis BN models at the module and equipment levels (sections 4.2 and 4.3).These models are build using symptoms (contextual and statistical information) collected across product, process, maintenance and equipment databases, with unsupervised learning.

PROPOSED METHODOLOGY
In this section, we present the proposed methodology to reduce unscheduled equipment breakdowns and stabilize the capacities by (i) identifying the FMs to stop equipment only when it is the true cause for product drift and (ii) diagnosing potential failures and causes with respective probabilities.It follows the case study description, data pre-processing and a brief presentation on BN structure learning algorithms.

Proposed BN Based Methodology
The proposed methodology is presented in Figure 3.
In step-1, we start with the identification and classification of potential symptoms from the product, process, equipment and maintenance databases.The FDC sensor signals within equipment database are not directly used as the symptoms; however, data/information computed based on these signals is used as potential symptoms.It is because of the fact that emerging sensor reliability issues are linked with high-mix low-volume production and could result in unstable models.The FMs are modeled as a function of symptoms and the resulting BN for the FM identification serves as first defense against the unscheduled equipment breakdowns.It will help engineers to make more accurate decision on the equipment stoppage if product drift is identified as related to Product or Process.
The step-2 in proposed methodology advocates modeling the equipment failures and causes as a function of symptoms at the module and equipment levels.The objective is to find the model that gives more accurate predictions.In this paper, the concept of prediction is used to represent inference results of a target node.In literature, the failure diagnosis models are built at the equipment level; however, we strongly believe that these should be modeled at the module level.It is due to the fact that in an automated production line with FDC system, equipment is composed of modules which are modeled in parent-child relationship and in the maintenance database, modules are considered as equipment.The state of the parent or child module has an impact on the state of one another; therefore, these share common causes.This phenomenon is reflected in the data collected from the maintenance database.
In addition to that, certain equipment can have more than one processing module known as chamber and if one of the chambers is stopped for maintenance, respective equipment is still in production but with limited capacity.In equipment modeling, it is important to precise the level with which we work with because it has an implication on the maintenance levels which define the criteria of intervention such as the personnel involved, the complexity of actions to be performed, the necessary tools and the associated documents/checklist.The equipment level BN is modeled and proposed to be updated upon new excursions, any structural change between two consecutive equipment level BNs will be used as a signal to revise the module level BNs, with equipment expert's intervention.This loopback step is not completed in this case study; however, diagnosis results from module and equipment level models are compared for their accuracies as the final step of this methodology.Note that in all sub step of BN structure learning, validation by expert is required before any prediction can be done.

Description of the Case Study for Thermal Treatment (TT) Workshop
As a case study, we consider TT workshop equipment that is used to grow and deposit oxide and nitride layers on the surface of silicon wafers as a dielectric, respectively.This equipment uses low pressure chemical vapor deposition (LPCVD) as the technique to deposit nitride layers.It is also used for annealing (heat treatment) after production steps to stabilize the crystalline structure of silicon wafers, prior to the next steps.The equipment type in this production line is batch cluster with two process modules known as reactors.The general structure of this TT equipment is presented in Figures 4. The Reactor1, Reactor2 and Mainframe are the three main modules of the equipment.Mainframe can be further composed of many sub modules.In this case study, we consider the three modules with an assumption that these constitute the whole equipment.The integrated failure-cause diagnosis BN models at module and equipment levels are therefore developed only for these equipment modules.

Data Processing
The dataset used in this case study spans six months (from week 27th to week 52nd of 2013) and are collected across the product, process, equipment and maintenance databases for TT equipment.These are used as the symptoms, failures and causes.The symptoms are classified into four categories and are used to generate the BN to accurately identify the FM as a function of symptoms (section 4.1) as well as development of an integrated failure-cause diagnosis BN models at the module and equipment level (sections 4.2 and 4.3).

Bayesian Network Learning
The structure of BN network can be obtained either through experts knowledge or learning from the data.The structure of the BN models are learned with the BayesiaLab 5.3 using score-based unsupervised learning algorithms that use minimum description length (MDL) as an objective function (Lam & Bacchus, 1994).The task of finding a network structure that optimizes the score is a combinatorial optimization problem, and is known to be NP-hard (Chickering, 1996), even if we restrict each node to having at most two parents.The standard methodology for addressing this problem is to perform heuristic search over some space.Many methods have been proposed along these lines, varying both on the formulation of the search space, and on the algorithm used to search the space.In this paper, instead of using a single method, the structure of our models are learned using three methods.The initial structure is learned using the Equivalence Class (EQ), a heuristic algorithm to search highest scoring network explicitly across the spaces of potential BN structures that have same conditional independence relations (subsection 3.4.1).The learned structure is further optimized using Tabu and Tabu order algorithms, methods that complement EQ in term of search space and exploration strategy (refer section 3.4.2).
3.4.1.Equivalence class framework (Munteanu & Bendou, 2001;Chickering, 2002) The simplest formulation of search space is the set of all possible individual DAG.The intuitive way to find the best network is greedy search which starts at an initial structure in the structure space then considers all nearest neighbors of the current structure and moves to the neighbor that has the highest score.Neighbor is all structures that can be generated by current structure by adding, deleting or reversing a single arc, subject to the acyclicity constraint.If no neighbors have higher score, a local maximum is reached and the algorithm stops.While the method is simple, it can be a disadvantage due to equivalence class property.Two DAGs G and G are equivalent if for every Bayesian network, there exists a Bayesian network such that B and B define the same probability distribution, thus the same score.This type of search can waste time rescoring the same equivalence class and in many cases, in order for the algorithm to move from one equivalence class to another, it will have to make numerous moves within the same equivalence class.Furthermore, in large network we can anticipate early stage wrong decisions accumulation thus end up with a final network very different from the ideal one.In order to overcome these difficulties, we can realize the search in the space of equivalence classes.
Figure 5. Illustration of EQ search strategy.
Figure 5 illustrates the search strategy.This approach consists in allowing the addition of undirected edges, transforming a DAG into PDAG (Partially DAG) when no direction is preferred by the score.Edge orientation is delayed until the interactions between edges make possible the choice of a direction on the basis of the score.As the obtained partially directed graphs may be interpreted as equivalence classes, this solution consists in a modification of the search space: the search algorithm explores the space of equivalence classes of Bayesian networks instead of the space of Bayesian network DAGs.When all EQ is explored, a network with the best score is chosen as the final structure.

Tabu and Tabu Order
The Tabu is an extended form of greedy search algorithm that tries to escape from a local maximum (in the search space of all DAG) by selecting the solution that minimally decrease the value of scoring function.Immediate re-selection of local maximum, just visited, is prevented by maintaining a list of solutions (of predefined precised size) that are forbidden a.k.a. the Tabu list (Glover, 1986;Acid & de Campos, 2003).Figure 6 illustrates this strategy and its Tabu listed networks in dotted boxes.The search operators involved in transforming one DAG to another are addition, suppression and reversal.When sufficient changes occur but without an increase in the minimum score ever encountered during search, the algorithm terminates, the overall best scoring structure is then returned.This strategy typically requires random restarts to find the optimized solution, but using EQ final network largely reduce the restart number as well as the size of necessary Tabu list.In complementary point of view to EQ method, the Tabu approach offer the exploration of solutions that might not be considered as consistent in EQs PDAG → DAG transformation.
Figure 6.Illustration of BN Tabu list search strategy.
To further improve the results, we further optimize acquired network using Tabu search coupled with Ordering search strategy.It is a learning method that uses Tabu search in the space of the order of Bayesian network nodes (Teyssier & Koller, 2012).The search space is restricted to a fixed bound k, the number of parents per node.It has the ability for an exhaustive search with accurate results, given the additional time to compute in advance, a large set of sufficient statistics: for each variable and each possible parent set.The cost is particularly high if the number of data instances is large but it can be reduced with the characterization of our target node.
Ordering search takes much larger steps in the search space, avoiding many local maxima using two local search operators (i) flipping: Permutation of a pair of adjacent nodes that traverse the space of orderings and (ii) Addition, deletion and reversal: the same set of operators as in Tabu search.We define the score of an ordering as the score of the best network consistent with it.Local scoring equals to statistics associated with individual families.Give the scoring function, the strategys task is to find argument of the maximum of score.
In the discrete variable case, these statistics are simply frequency count of instanciations of each family (A node and its parents).This algorithm chooses the parents of a node among the nodes that appear before it in the considered order and computes the MDL score.
This combination of search algorithms involving three type of search space produces a final structure with the lowest MDL score and is accepted for diagnosing purpose and further analysis.All BN models are learned and tested using 10-fold cross validation strategy.The evaluation of BN performance is presented in section 4.

MODELING AND ANALYSIS RESULTS
In this section, we begin by presenting the modeling of BN models at step-1 and step-2, as proposed in Section 3.1 along with proof of concept.This follows analysis on the results.The identification and classification of potential symptoms from the database is the most difficult and complex task.It requires multidisciplinary expertise from product, process, equipment and maintenance domains; therefore, a task force with required expertise was formed.Brainstorming sessions resulted in the formalization of well-known Ishikawa (a.k.a.Fishbone) diagram (Ishikawa & Loftus, 1990) to find potential symptoms from product, process, equipment and maintenance areas.The results are presented in Figure 7.
Symptoms are classified in four axes as Product, Process, Equipment and Maintenance.The TT equipment is of batch cluster type; hence, they process multiple lots in a given step.Therefore current/previous product combinations might influence the product quality.Number of reworks, wait time before process and defect distribution from previous steps are also identified as key product symptoms linked with product quality drift.The process capability (Cp) and process capability index (Cpk) are the key process symptoms.It is also identified that not only current recipe but also previous recipe and their respective process steps combinations could be strongly linked with product quality.The FDC sensor signals from equipment database are not directly considered; however, decisional information based on these signals is a good candidate for potential symptoms.The key symptoms from equipment database are equipment capability (Cm) and equipment capability index (Cmk); however, overall equipment efficiency (OEE) indicators and counters are also included as the additional symptoms.The counters are the meters associated with equipment modules (process chambers and mainframe), used for triggering preventive maintenance actions.Last category of symptoms is the maintenance where reliability, availability and maintenance (RAM), and failure indicators are identified as the key symptoms.The data is collected for these symptoms against product quality drifts.
The data for OEE, RAM, process and equipment capability, and failure indicators are aggregated on weekly basis whereas rest of the data is instantaneous for a given product and process step.
The structure of the BN to identify potential failure modes (FM) is learned with the BayesiaLab, using only symptoms as classified in Figure 7.The model is presented in Figure 8 where FMs are modeled as the function of symptoms.The symptoms, in this model, are grouped in four categories as differentiated with different colors.The green, pink, yellow and light brown colors represent Product, Process, Equipment and Maintenance related symptoms, respectively whereas failure mode is the target node.The objective of showing this graph (Figure 8) is to present the complexity of resulting net-work.
Figure 8. BN model for FM identification.The FM identification model, presented in previous section, is the first step towards reducing unscheduled equipment failure breakdowns.This is complemented by failures and causes diagnosis through developed BN models at module and equipment levels with the data on failure(s) and cause(s) (LPCVD process equipment) as collected from the reputed semiconductor manufacturer (sections 3.2 and 3.3).For the proof of concept, we have used three modules (i) Reactor1, (ii) Reac-tor2 and (iii) Mainframe.
The symptoms from FM identification model (section 4.1) plus failures and causes from module level BNs (section 4.2) are used to develop these BN models.For each model, the target nodes Failure Code1 and Failure Code2 are modeled as the function of these symptoms; however, causes are also allowed to be directed from these symptoms.
The results from three modules (Reactor1, Reactor2 and Mainframe) level BN models are presented in Figure 11(a), (b) and (c), respectively.The color scheme for symptom classes is same as presented in section 4.1.1whereas causes and failures codes are added as nodes with orange and blue colors, respectively.The nodes, not connected, in these BN models are found with zero influence on either failures or causes.
The proofs of concept for Reactor1 and Reactor2 are presented in the Figure 12.In it we present only few chosen symptoms for visualization purposes while the module failures and causes diagnosis made by BN model is presented as the function of symptoms (in green and dark-orange frames of the right column).In the BN model, we have presented the key symptoms having direct influence on causes and failures.It is also observed from the proof of concept (Figure 13) that for given symptoms, all modules have 33.33%probability of occurrence that confirms the added confusion.
Figure 13.Proof of concept from equipment level BN.

FM Identification (Step-1)
A set of the precision, the ratio of predicted positive cases to the total number of the corresponding FM total actual cases (column) and reliability, the ratio of the predicted positive cases to total number of prediction (row) matrices of the BN model (refer section 4.1.1)to identify the FM is presented in Table 1.These tables display the results from one of the FM prediction based on 10-fold cross validation strategy and the results are summed in Figure 14    The Figure 15 shows FM Product sythesis prediction accuracy with receiver operating characteristic (ROC) curve, a graph to plot true positive rate (Y-axis) against false positive rate (X-axis).Its index represents the surface under the ROC curve divided by the total surface and in this graph it represents a 99.66% average accuracy with 0.34% of false positive prediction.
Figure 15.Prediction accuracy with ROC curves for FM Product.
The capability of FM Product identification model with gain curves is presented in Figure 16.The x-axis represents rate of individual cases taken into account for prediction whereas y-axis represents rate at which they are predicted accurately with target failure mode.In the figure, the blue curve represents the gain curve of prediction using random policy and the red using optimal policy.The figure illustrates that choosing 26% of individuals allows getting 100% of the individuals with the target variable with the optimal policy.The data mining Gini index for cross validation represents the gain over random model and is computed as the surface between the red curve and the blue curve divided by the surface above the blue curve.The relative Gini index is computed by dividing the area within triangle formed due to crossing of red, blue and black dotted lines with area within yellow line triangle.The yellow curve is the curve that enables us to determine the percentage of individuals allowing identical value of the relative Gini index and ROC index.The indexes for all failure Mode are presented in Table 2.It is observed that FM identification capability for product and process are higher than equipment and maintenance.

Module Level and Equipment Level Failures and Causes Diagnosis (Step-2)
An example of the precision and reliability matrices for Reac-tor2 with type of failure as target is presented in lected failures from each of the three modules models are presented in Figure 17.The relative GI results linking to ROC index show that learned models have high accuracy.Besides this, it can also be observed that accurate prediction capabilities are also very high in terms of Gini indices.

DISCUSSION AND PERSPECTIVES
Above results advocate that misdiagnosis is one of the key reason for increased unscheduled equipment breakdowns.It is due to the fact that existing failure diagnosis approaches model equipment as a single unit and use FDC sensor data that is subjected to reliability issues.This sensor variability could easily trigger misdiagnosis.The proposed approaches for failure diagnosis make an assumption that the product quality drifts are only due to equipment failures; however, in actual practice, causes can also be traced to maintenance, product or process as well.Besides this, in the SI, equipment is also composed of multiple modules that share symptoms, failures and causes.
In the proposed methodology, we first modeled the failure modes against product drifts as a function of symptoms.It is the first step towards reducing unscheduled breakdowns.Then failure and cause diagnosis is modeled at the module level.An equipment level BN model is also learned in the same way and is found to be less accurate in comparison with the module level BNs.It provides clear evidence that failurecause diagnosis must be modeled at module level that produces more accurate results when used with data other than FDC, in high-mix low-volume production lines.
The next step is to use the developed BN models with FDC sensors data as complimentary indicators when faced with a situation where BN model for FM identification give equal probability to all failure modes (Product, Process, Equipment and Maintenance).The BN models, developed in this paper as a proof of concept, are static in nature; however, the real advantage lies in using these models as an inference tool for real time failure prognosis.In recent efforts to adapt the re-  takes into account failure probability evolution to make maintenance decision.

Figure 9
Figure 9 highlights the nodes that have direct causalities with the target.The probability to have each failure mode shall differ based on different values taken by these nodes.The proofs of concept are presented in Figures 10(a) and (b).Failure mode (pink background) is the result of inference given the observations of highlighted symptoms (white background with distinguish colored frames highlighting the different category).It can be seen that in the Figure 10(a), BN identifies Product (64%) and Maintenance related (36%).Hence, in this situation, maintenance personnel should not stop the equipment.Similarly, the Figure 10(b) shows that maintenance is found as the only reason against given symptoms; hence, BN model suggests to stop the equipment for further investigation on failures and causes.

Figure 9 .Figure 10 .
Figure 9. Representative nodes for the proof of concept.

Figure 11 .
Figure 11.Module and equipment level BNs for failures and causes diagnosis.

Figure 14 .
Figure 14.Failure mode standard deviation and the box plot graph of true positive prediction performances.

Figure 17 .
Figure 17.Gain curves for module level BNs.

Figure 18 .Figure 19 .
Figure 18.Gain and ROC curves for equipment Level BN.

Table 1 .
Precision and reliability matrices of FM BN.

Table 2 .
Summary of Index for all Failure mode.
Table 3 emphasizing the high precision and reliability in modules level prediction performances.The prediction capabilities with se-Figure 16.Gain curves for FM Product.

Table 3 .
Precision and reliability matrices of Reactor2 BN.However prediction accuracy for equipment level BN model is quite low and is presented in Figure 18 with gain and ROC curves.These results show the declined gain and increasing false positive that significantly reduces the diagnosis capability of the equipment level BN model.A box plot summary on precision and reliability based on 10-fold cross validation for each type of failures is presented in figure 19.4.2.3.Comparison of Accuracy for Model vs. Equipment Level Failure-Cause Diagnosis BN Models (Step-3) The diagnosis accuracy from module and equipment level BNs are presented in Figure 20.The accuracy is computed as an average of reliability and precision for each BN model.It shows that module level BN has almost overall 99.7% prediction accuracy in comparison to 54% for the equipment level model.The gain obtained in diagnosis with module level BNs is 45.7% that is significant in reducing unscheduled equipment breakdowns.The likely reason for misdiagnosis by equipment level BN model is the commonality in failures between different modules that add confusion.Hence, the BN models, learned at module level, offer more accuracy over equipment level BNs for failure cause diagnosis.
Asma' Abu-Samah Mrs. Asma' Abu-Samah is a PhD student at Grenoble GSCOP Laboratory.She has completed her degree in Control System of Electrical Energy and then her master in Industrial Process Automation from GrenobleUniversity, in 2008 and 2010respectively.Her research interests include equipment usage optimization and multi criteria decision making.MuhammadKashif Shahzad Dr. Shahzad completed his PhD from University of Grenoble and STMicroelectronics in 2012.At present, he is working in the European research project INTEGRATE in collaboration with STMicroelectronics and G-SCOP research lab.His research focus is on the equipment usage optimization, production scheduling and engineering data management.He holds Bachelors degrees in Mechanical Engineering and Computer Science from Pakistan and Masters degree in industrial engineering from Grenoble INP with distinction.He has six years of professional experience in industrial information systems and databases with special interests in data/information extraction and statistical modeling.Eric Zamaï Dr. Eric Zamaï was born in France in 1971.He received a Ph.D with distinction in electrical engineering from the University of Toulouse, France, in 1997.He is currently an associated professor at the Grenoble Institute of Technology (Grenoble INP) and does his research at the Laboratory of Grenoble for Sciences of Conception, Optimisation and Production (G-SCOP).His research interests include supervision, diagnostic, prognostic, and management and control of production systems.He teaches in Grenoble INP engineering school of ENSE3.His subjects are design of real time system, control and management and Programmable Logic Controller programming.He is also the Director of a Technological Platform in Advanced Manufacturing: AIP Primeca DS (http://www.aip-primeca-ds.net/).Stéphane Hubac Stéphane Hubac is an Expert on yield enhancement and Fab productivity projects at STMicroelectronics.Since 1981, he has worked in many disciplines within the semiconductor industry including manufacturing, memory device design, process and equipment engineering in lithography, dry etching and dielectric deposition, process control, Quality methods implementation and R&D.He joined CR2 Alliance (Freescale, NxP, ST) in the initial phase of the project as a project manager; responsible for the selection of 300mm plasma etching, dry stripping equipment then manufacturing and R&D ramp-up has an AREA Manager (Etch, Strip, APC programs) and ISOTS audit supervisor for Fab qualification.His special interests include R&D on DFM methods, yield enhancement, productivity and process control.