From Engineering Drawings to Assembly Instructions: A Vision and Language Model Approach

##plugins.themes.bootstrap3.article.main##

##plugins.themes.bootstrap3.article.sidebar##

Published Jan 13, 2026
Shokhikha Amalana Murdivien Minji Kim Kyung Wan Choi Jumyung Um

Abstract

Engineering drawings such as CAD draft sheets are widely used in manufacturing to document product structure, part geometry, and dimensional specifications. While these doc- uments contain valuable information, they are not typically organized to support step-by-step assembly tasks, which can present challenges for non-expert technicians during installa- tion, maintenance, or repair. This paper presents a system that automatically generates structured and human-readable assembly instructions from CAD drafts by combining a vi- sion model, an OCR model, and a language model. The vision model, trained on a constructed synthetic dataset, was able to detect mechanical components with an average precision score of 95.2% on real CAD sheets, while the OCR model suc- cessfully extracted dimensional information. These outputs, together with existing description text, were processed by a language model to produce clear and interpretable assembly steps. A synthetic dataset was used to train the vision model, addressing the lack of publicly available CAD annotations. The results demonstrate that the proposed system improves the interpretability and usability of engineering documentation in assembly-related tasks.

Abstract 45 | PDF Downloads 40

##plugins.themes.bootstrap3.article.details##

Keywords

Disassembly Guidance, Automated Maintenance Documentation, Maintenance Support Systems, Smart Manufacturing, Generative AI

References
Bocak, R., Holubek, R., & Tirian, G. (2022). New approach in assembly, disassembly and maintenance processes i4. 0 by displaying in augmented reality environment. In Journal of physics: Conference series (Vol. 2212, p. 012015).
Colabianchi, S., Costantino, F., & Sabetta, N. (2024). Assessment of a large language model based digital intelligent assistant in assembly manufacturing. Computers in Industry, 162, 104129.
Duan, Y., Chen, Z., Hu, Y., Wang, W., Ye, S., Shi, B., . . . others (2025). Docopilot: Improving multimodal models for document-level understanding. In Proceedings of the computer vision and pattern recognition conference (pp. 4026–4037).
Holvoet, L., van Bekkum, M., & de Vries, A. (2024). An approach to automated instruction generation with grounding using llms and rag. In European symposium on artificial intelligence in manufacturing (pp. 224–233).
Huang, Y., Lv, T., Cui, L., Lu, Y., & Wei, F. (2022). Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th acm international conference on multimedia (pp. 4083–4091).
Jamieson, L., Elyan, E., & Moreno-Garc´ıa, C. F. (2024). Fewshot symbol detection in engineering drawings. Applied Artificial Intelligence, 38(1), 2406712.
Jocher, G., & Qiu, J. (2024). Ultralytics yolo11. Retrieved from https://github.com/ultralytics/ultralytics
Kim, G., Hong, T., Yim, M., Park, J., Yim, J., Hwang, W., Park, S. (2021). Donut: Document understanding transformer without ocr. arXiv preprint arXiv:2111.15664, 7(15), 2.
Li, D. (2023). Sim2real: generating synthetic images from industry cad models with domain randomization.
Maupou, C., Yang, Y., Fodop, G., Qie, Y., Migliorini, C., Mehdi-Souzani, C., & Anwer, N. (2024). Automatic raster engineering drawing digitisation for legacy parts towards advanced manufacturing. Procedia CIRP, 129, 234–239.
Monnet, J., Petrovic, O., & Herfs, W. (2024). Investigating the generation of synthetic data for surface defect detection:
A comparative analysis. Procedia CIRP, 130, 767–773.
Pasanisi, D., Rota, E., Ermidoro, M., & Fasanotti, L. (2023). On domain randomization for object detection in real
industrial scenarios using synthetic images. Procedia Computer Science, 217, 816–825.
Schraml, D., & Notni, G. (2024). Synthetic training data in aidriven quality inspection: The significance of camera,
lighting, and noise parameters. Sensors, 24(2), 649.
Tito, R., Mathew, M., Jawahar, C., Valveny, E., & Karatzas, D. (2021). Icdar 2021 competition on document visual question answering. In International conference on document analysis and recognition (pp. 635–649).
Villena Toro, J., Wiberg, A., & Tarkian, M. (2023). Optical character recognition on engineering drawings to achieve automation in production quality control. Frontiers in manufacturing technology, 3, 1154132.
Wang, C.-Y., Bochkovskiy, A., & Liao, H.-Y. M. (2023). Yolov7: Trainable bag-of-freebies sets new state-of-theart for real-time object detectors. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 7464–7475).
Zhang, W., Joseph, J., Yin, Y., Xie, L., Furuhata, T., Yamakawa, S., . . . Kara, L. B. (2023). Component segmentation of engineering drawings using graph convolutional networks. Computers in Industry, 147, 103885.
Section
Regular Session Papers