ACE – Automating Causal Extraction: Leveraging Large Language Models for Bowtie Diagram Generation in Failure Analysis

##plugins.themes.bootstrap3.article.main##

##plugins.themes.bootstrap3.article.sidebar##

Published Jul 3, 2026
Priyank Venkatesh Jules Oudmans Florian Zurfluh

Abstract

This paper investigates whether open-source, instruction-tuned large language models (LLMs) can automate the generation of Bowtie diagrams from Failure Mode and Effects Analysis (FMEA) documentation. Three pipelines are developed: Retrieval-Augmented Generation (RAG), Optical Character Recognition (OCR) based extraction, and a vision-enabled dual-LLM approach. Each is designed to handle both structured FMEA tables and unstructured narrative text. Three models (Mistral, Qwen-2.5, and LLaMA-3) are evaluated using Sobol sensitivity analysis, stochasticity experiments, and expert Likert scoring on narrative outputs. With strict schema-constrained prompts, models frequently achieve Node and Edge F1 scores above 0.8 on tabular data. Outputs were identical across repeated runs under fixed settings. Sobol analysis shows that prompt strictness and prompt type are the dominant drivers of Bowtie quality, whereas decoding parameters have a negligible effect. On unstructured narrative text, all models struggled, producing hallucinated nodes, incorrect role assignments, and diagrams that deviated from expert references. The results establish a working approach for automating Bowtie generation from FMEA tables and identify the specific obstacles to extending this to narrative sources.

How to Cite

Venkatesh, P., Oudmans, J. ., & Zurfluh, F. . (2026). ACE – Automating Causal Extraction: Leveraging Large Language Models for Bowtie Diagram Generation in Failure Analysis. PHM Society European Conference, 9(1), 1–10. https://doi.org/10.36001/phme.2026.v9i1.4953
Abstract 0 | PDF Downloads 0

##plugins.themes.bootstrap3.article.details##

Keywords

Bowtie Diagrams, Large Language Models, FMEA, Prompt Engineering, RAG, Failure Analysis, Bowties

References
Anagnostidis, S., & Bulian, J. (2024). How susceptible are LLMs to influence in prompts? arXiv:2408.11865.

Azam, M., Chen, Y., Arowolo, M. O., Liu, H., Popescu, M., & Xu, D. (2024). A comprehensive evaluation of large language models in mining gene relations and pathway knowledge. Quantitative Biology, 12(4), 360–374.

Brown, T., Mann, B., Ryder, N., Subbiah, M., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., & Wang, H. (2024). Retrieval-augmented generation for large language models: A survey. arXiv:2312.10997.

Gopalakrishnan, S., Garbayo, L., & Zadrozny, W. (2024). Causality extraction from medical text using large language models. arXiv:2407.10020.

Hassani, I. E., Masrour, T., Kourouma, N., Motte, D., & Tavčár, J. (2024). Integrating large language models for improved FMEA: A framework and case study. Proceedings of the Design Society. doi: 10.1017/pds.2024.204

Herman, J., & Usher, W. (2017). SALib: An open-source Python library for sensitivity analysis. Journal of Open Source Software, 2(9), 97.

Hosseinichimeh, N., Majumdar, A., Williams, R., & Ghaffarzadegan, N. (2024). From text to map: A system dynamics bot for constructing causal loop diagrams. System Dynamics Review, 40(3), e1782.

Kaddour, J., Harris, J., Mozes, M., Bradley, H., Raileanu, R., & McHardy, R. (2023). Challenges and applications of large language models. arXiv:2307.10169.

Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). Dense passage retrieval for open-domain question answering. arXiv:2004.04906.

Khatibi, E., Abbasian, M., Yang, Z., Azimi, I., & Rahmani, A. M. (2024). ALCM: Autonomous LLM-augmented causal discovery framework. arXiv:2405.01744.

Kiciman, E., Ness, R., Sharma, A., & Tan, C. (2023). Causal reasoning and large language models: Opening a new frontier for causality. arXiv:2305.00050.

Kim, H., & Andersen, D. F. (2012). Building confidence in causal maps generated from purposive text data: Mapping transcripts of the Federal Reserve. System Dynamics Review, 28(4), 311–328.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2021). Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv:2005.11401.

Li, B., Jiang, G., Li, N., & Song, C. (2024). Research on large-scale structured and unstructured data processing based on large language model. In Proceedings of MLPRAE ’24 (pp. 111–116). ACM.

Li, N., Song, Y., Wang, K., Li, Y., Shi, L., Liu, Y., & Wang, H. (2025). Detecting LLM fact-conflicting hallucinations enhanced by temporal-logic-based reasoning. arXiv:2502.13416.

Liu, N.-Y. G., & Keith, D. (2024). Leveraging large language models for automated causal loop diagram generation. Available at SSRN 4906094.

Naval Surface Warfare Center. (2011). Handbook of reliability prediction procedures for mechanical equipment (NSWC-11). Retrieved from https://reliabilityanalyticstoolkit.appspot.com

Rouabhia-Essalhi, R., Boukrouh, E. H., & Ghemari, Y. (2022). Application of failure mode effect and criticality analysis to industrial handling equipment. The International Journal of Advanced Manufacturing Technology, 120(7), 5269–5280.

Saltelli, A., et al. (2010). Variance-based sensitivity analysis of model output. Computer Physics Communications, 181(2), 259–280.

Schwitter, N. (2025). Using large language models for preprocessing and information extraction from unstructured text. Methodological Innovations, 18(1), 61–65.

Segismundo, A., & Cauchick Miguel, P. A. (2008). Failure mode and effects analysis (FMEA) in the context of risk management in new product development. International Journal of Quality & Reliability Management, 25(9), 899–912.

Sharma, K. D., & Srivastava, S. K. (2018). Failure mode and effect analysis (FMEA) implementation: A literature review. Retrieved from https://api.semanticscholar.org/CorpusID:115607603

Sobol, I. M. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and Computers in Simulation, 55(1–3), 271–280.

Taramsari, H. B., Rao, B., Nilchiani, R., & Lipizzi, C. (2024). Identification of variables impacting cascading failures in aerospace systems: A NLP approach. In Conference on Systems Engineering Research (pp. 413–427). Springer.

Turner, C., Hamilton, W. I., & Ramsden, M. (2017). Bowtie diagrams: A user-friendly risk communication tool. Proceedings of the Institution of Mechanical Engineers, Part F, 231(10), 1088–1097.
Section
Technical Papers