Diagnostics-LLaVA A Visual Language Model for Domain-Specific Diagnostics of Equipment
##plugins.themes.bootstrap3.article.main##
##plugins.themes.bootstrap3.article.sidebar##
Abstract
The recent advancements in the area of Large language models (LLMs) has opened horizons for conversational assistant-based intelligent models capable of interpreting images, and providing textual response, also known as Visual language models (VLMs). These models can assist equipment operators and maintenance technicians in the complex Prognostics and Health Management (PHM) tasks such as diagnostics of faults, root cause analysis and repair recommendation. Significant open source contributions in the area of VLMs have been made. However, models trained on general data fail to perform well in complex tasks in specialized domains such as diagnostics and repair of industrial equipment. Therefore, in this paper we discuss our work on development of Diagnostics-LLaVA, a VLM suitable for interpreting images of specific industrial equipment and provide better response than existing open source models in PHM tasks such as fault diagnostics and repair recommendation. We introduce Diagnostics-LLaVA based on the architecture of LLaVA, and created one instance of the Diagnostics-LLaVA for the automotive repair domain, referred to as Automotive-LLaVA. We demonstrate that our proposed Automotive-LLaVA model performs better than the state-of-the-art open source visual language models such as mPlugOWL and LLaVA in both qualitative and quantitative experiments.
How to Cite
##plugins.themes.bootstrap3.article.details##
Visual language model, Automotive, Large language model, Diagnostics, Prognostics, Repair, Troubleshooting
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., . . . others (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901.
Dosovitskiy, A. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. freeasestudyguides. (2024). Hub bearing ring. Retrieved from https :// www .freeasestudyguides .com / a3 -manual -transmission -test .html
Guedes, G. B., & da Silva, A. E. A. (2021). Supervised learning approach for section title detection in pdf scientific articles. In Advances in computational intelligence: 20th mexican international conference on artificial intelligence, micai 2021, mexico city, mexico, october 25–30, 2021, proceedings, part i 20 (pp. 44–54).
He, J., Wang, Y., Wang, L., Lu, H., He, J.-Y., Lan, J.-P., . . . Xie, X. (2024). Multi-modal instruction tuned llms with fine-grained visual perception. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 13980–13990).
Kolo, E. (2006). Does automotive service excellence (ase) certification enhance job performance of automotive service technicians? (Unpublished doctoral dissertation). Virginia Polytechnic Institute and State University.
Kumar, A., & Starly, B. (2022). “fabner”: information extraction from manufacturing process science domain literature using named entity recognition. Journal of Intelligent Manufacturing, 33(8), 2393–2407.
Lai, Z., Bai, H., Zhang, H., Du, X., Shan, J., Yang, Y., . . . Cao, M. (2024). Empowering unsupervised domain adaptation with large-scale pre-trained vision-language models. In Proceedings of the ieee/cvf winter conference on applications of computer vision (pp. 2691– 2701).
Lee, J., Cha, S., Lee, Y., & Yang, C. (2024). Visual question answering instruction: Unlocking multimodal large language model to domain-specific visual multitasks. arXiv preprint arXiv:2402.08360.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., . . . others (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33, 9459– 9474.
Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2024). Visual instruction tuning. Advances in neural information processing systems, 36.
Medeiros, T., Medeiros, M., Azevedo, M., Silva, M., Silva, I., & Costa, D. G. (2023). Analysis of language-modelpowered chatbots for query resolution in pdf-based automotive manuals. Vehicles, 5(4), 1384–1399.
MIT. (2024). Understanding the visual knowledge of language models. https://news .mit .edu/ 2024/understanding -visual -knowledge -language-models-0617/. ([Online; accessed 19-June-2024])
motortrend. (2020). Engine cylinder block. Retrieved from https://www .motortrend .com/uploads/ sites / 21 / 2020 / 03 / 002 -Difference -between-long-short-block.jpg
Park, S.-M., & Kim, Y.-G. (2023). Visual language integration: A survey and open challenges. Computer Science Review, 48, 100548.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., . . . others (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748– 8763).
Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning. Journal of big data, 6(1), 1–48. thetimchannel. (2024). Brake worn. Retrieved from https://openverse.org/image/958dcf66 -f298-4413-85a7-957cf8474742
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., . . . others (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Vidyaratne, L., Lee, X. Y., Kumar, A., Watanabe, T., Farahat, A., & Gupta, C. (2024). Generating troubleshooting trees for industrial equipment using large language models (llm). In 2024 ieee international conference on prognostics and health management (icphm) (pp. 116– 125).
Wang, J., Liu, Z., Zhao, L., Wu, Z., Ma, C., Yu, S., . . . others (2023). Review of large vision models and visual prompt engineering. Meta-Radiology, 100047.
Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., . . . others (2023). mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint
arXiv:2304.14178.
Ye, Q., Xu, H., Ye, J., Yan, M., Hu, A., Liu, H., . . . Huang, F. (2024). mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 13040–13051).
Yemaneab, T. (1997). Employers’ perceptions of automotive service excellence (ase) certification benefits. University of Minnesota.
Zhang, J., Huang, J., Jin, S., & Lu, S. (2024). Vision language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Zhang, Y., Pan, J., Zhou, Y., Pan, R., & Chai, J. (2023). Grounding visual illusions in language: Do vision language models perceive illusions like humans? In Proceedings of the 2023 conference on empirical methods in natural language processing (pp. 5718–5728).
Zhang, Y., Zhang, R., Gu, J., Zhou, Y., Lipka, N., Yang, D., & Sun, T. (2023). Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107.
Zhao, R., Chen, H., Wang, W., Jiao, F., Do, X. L., Qin, C., . . . others (2023). Retrieving multimodal information for augmented generation: A survey. arXiv preprint arXiv:2303.10868.
Zhao, X., Li, X., Duan, H., Huang, H., Li, Y., Chen, K., & Yang, H. (2024). Mg-llava: Towards multi-granularity visual instruction tuning.
This work is licensed under a Creative Commons Attribution 3.0 Unported License.
The Prognostic and Health Management Society advocates open-access to scientific data and uses a Creative Commons license for publishing and distributing any papers. A Creative Commons license does not relinquish the author’s copyright; rather it allows them to share some of their rights with any member of the public under certain conditions whilst enjoying full legal protection. By submitting an article to the International Conference of the Prognostics and Health Management Society, the authors agree to be bound by the associated terms and conditions including the following:
As the author, you retain the copyright to your Work. By submitting your Work, you are granting anybody the right to copy, distribute and transmit your Work and to adapt your Work with proper attribution under the terms of the Creative Commons Attribution 3.0 United States license. You assign rights to the Prognostics and Health Management Society to publish and disseminate your Work through electronic and print media if it is accepted for publication. A license note citing the Creative Commons Attribution 3.0 United States License as shown below needs to be placed in the footnote on the first page of the article.
First Author et al. This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 United States License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.