Diagnostics-LLaVA A Visual Language Model for Domain-Specific Diagnostics of Equipment

##plugins.themes.bootstrap3.article.main##

##plugins.themes.bootstrap3.article.sidebar##

Published Nov 5, 2024
Aman Kumar Mahbubul Alam Ahmed Farahat Maheshjabu Somineni Chetan Gupta

Abstract

The recent advancements in the area of Large language models (LLMs) has opened horizons for conversational assistant-based intelligent models capable of interpreting images, and providing textual response, also known as Visual language models (VLMs). These models can assist equipment operators and maintenance technicians in the complex Prognostics and Health Management (PHM) tasks such as diagnostics of faults, root cause analysis and repair recommendation. Significant open source contributions in the area of VLMs have been made. However, models trained on general data fail to perform well in complex tasks in specialized domains such as diagnostics and repair of industrial equipment. Therefore, in this paper we discuss our work on development of Diagnostics-LLaVA, a VLM suitable for interpreting images of specific industrial equipment and provide better response than existing open source models in PHM tasks such as fault diagnostics and repair recommendation. We introduce Diagnostics-LLaVA based on the architecture of LLaVA, and created one instance of the Diagnostics-LLaVA for the automotive repair domain, referred to as Automotive-LLaVA. We demonstrate that our proposed Automotive-LLaVA model performs better than the state-of-the-art open source visual language models such as mPlugOWL and LLaVA in both qualitative and quantitative experiments.

How to Cite

Kumar, A., Alam, M., Farahat, A., Somineni, M., & Gupta, C. (2024). Diagnostics-LLaVA: A Visual Language Model for Domain-Specific Diagnostics of Equipment. Annual Conference of the PHM Society, 16(1). https://doi.org/10.36001/phmconf.2024.v16i1.4147
Abstract 76 | PDF Downloads 54

##plugins.themes.bootstrap3.article.details##

Keywords

Visual language model, Automotive, Large language model, Diagnostics, Prognostics, Repair, Troubleshooting

References
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., . . . others (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., . . . others (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901.

Dosovitskiy, A. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. freeasestudyguides. (2024). Hub bearing ring. Retrieved from https :// www .freeasestudyguides .com / a3 -manual -transmission -test .html

Guedes, G. B., & da Silva, A. E. A. (2021). Supervised learning approach for section title detection in pdf scientific articles. In Advances in computational intelligence: 20th mexican international conference on artificial intelligence, micai 2021, mexico city, mexico, october 25–30, 2021, proceedings, part i 20 (pp. 44–54).

He, J., Wang, Y., Wang, L., Lu, H., He, J.-Y., Lan, J.-P., . . . Xie, X. (2024). Multi-modal instruction tuned llms with fine-grained visual perception. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 13980–13990).

Kolo, E. (2006). Does automotive service excellence (ase) certification enhance job performance of automotive service technicians? (Unpublished doctoral dissertation). Virginia Polytechnic Institute and State University.

Kumar, A., & Starly, B. (2022). “fabner”: information extraction from manufacturing process science domain literature using named entity recognition. Journal of Intelligent Manufacturing, 33(8), 2393–2407.

Lai, Z., Bai, H., Zhang, H., Du, X., Shan, J., Yang, Y., . . . Cao, M. (2024). Empowering unsupervised domain adaptation with large-scale pre-trained vision-language models. In Proceedings of the ieee/cvf winter conference on applications of computer vision (pp. 2691– 2701).

Lee, J., Cha, S., Lee, Y., & Yang, C. (2024). Visual question answering instruction: Unlocking multimodal large language model to domain-specific visual multitasks. arXiv preprint arXiv:2402.08360.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., . . . others (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33, 9459– 9474.

Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2024). Visual instruction tuning. Advances in neural information processing systems, 36.

Medeiros, T., Medeiros, M., Azevedo, M., Silva, M., Silva, I., & Costa, D. G. (2023). Analysis of language-modelpowered chatbots for query resolution in pdf-based automotive manuals. Vehicles, 5(4), 1384–1399.

MIT. (2024). Understanding the visual knowledge of language models. https://news .mit .edu/ 2024/understanding -visual -knowledge -language-models-0617/. ([Online; accessed 19-June-2024])

motortrend. (2020). Engine cylinder block. Retrieved from https://www .motortrend .com/uploads/ sites / 21 / 2020 / 03 / 002 -Difference -between-long-short-block.jpg

Park, S.-M., & Kim, Y.-G. (2023). Visual language integration: A survey and open challenges. Computer Science Review, 48, 100548.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., . . . others (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748– 8763).

Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning. Journal of big data, 6(1), 1–48. thetimchannel. (2024). Brake worn. Retrieved from https://openverse.org/image/958dcf66 -f298-4413-85a7-957cf8474742

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., . . . others (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.

Vidyaratne, L., Lee, X. Y., Kumar, A., Watanabe, T., Farahat, A., & Gupta, C. (2024). Generating troubleshooting trees for industrial equipment using large language models (llm). In 2024 ieee international conference on prognostics and health management (icphm) (pp. 116– 125).

Wang, J., Liu, Z., Zhao, L., Wu, Z., Ma, C., Yu, S., . . . others (2023). Review of large vision models and visual prompt engineering. Meta-Radiology, 100047.

Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., . . . others (2023). mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint
arXiv:2304.14178.

Ye, Q., Xu, H., Ye, J., Yan, M., Hu, A., Liu, H., . . . Huang, F. (2024). mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 13040–13051).

Yemaneab, T. (1997). Employers’ perceptions of automotive service excellence (ase) certification benefits. University of Minnesota.

Zhang, J., Huang, J., Jin, S., & Lu, S. (2024). Vision language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Zhang, Y., Pan, J., Zhou, Y., Pan, R., & Chai, J. (2023). Grounding visual illusions in language: Do vision language models perceive illusions like humans? In Proceedings of the 2023 conference on empirical methods in natural language processing (pp. 5718–5728).

Zhang, Y., Zhang, R., Gu, J., Zhou, Y., Lipka, N., Yang, D., & Sun, T. (2023). Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107.

Zhao, R., Chen, H., Wang, W., Jiao, F., Do, X. L., Qin, C., . . . others (2023). Retrieving multimodal information for augmented generation: A survey. arXiv preprint arXiv:2303.10868.

Zhao, X., Li, X., Duan, H., Huang, H., Li, Y., Chen, K., & Yang, H. (2024). Mg-llava: Towards multi-granularity visual instruction tuning.
Section
Industry Experience Papers

Most read articles by the same author(s)