Domain Adaptation of Automatic Speech Recognition Models for Diagnostic Applications

##plugins.themes.bootstrap3.article.main##

##plugins.themes.bootstrap3.article.sidebar##

Published Jul 3, 2026
Aman Kumar Ahmed Farahat Huimin Zhuge Chetan Gupta

Abstract

Automatic speech recognition (ASR), or speech-to-text (STT), is becoming an important interface for AI systems in diagnostic workflows, but general-purpose ASR models often degrade in specialized technical domains. In diagnostic applications such as fault identification, root cause analysis, and repair recommendation, general-purpose ASR systems struggle with domain-specific terminology, abbreviations, part identifiers, and measurement expressions, leading to elevated transcription errors. This work presents a domain adaptation pipeline that unifies three components: a synthetic benchmarking framework in which domain-specific technical text is converted to speech via text-to-speech~(TTS) synthesis and transcribed by open-source ASR models to establish baseline performance; Low-Rank Adaptation~(LoRA)-based fine-tuning of Whisper Large-v3 using those synthetic audio-text pairs; and transfer validation on curated real-world automotive YouTube recordings to assess generalization beyond synthetic conditions. Using automotive technical language as a representative diagnostic domain, a data-scaling study employing progressively larger subsets of in-domain training data evaluates performance on a held-out test set via word error rate~(WER), character error rate~(CER), normalized error metrics, alphanumeric error rate, semantic similarity, and Bidirectional Encoder Representations from Transformers Score~(BERTScore). Results show consistent gains from lightweight domain adaptation on both held-out synthetic data and real-world recordings, confirming that synthetic data generation combined with LoRA-based fine-tuning is an effective and computationally practical strategy for improving ASR accuracy in specialized technical domains where labeled speech is scarce.

How to Cite

Kumar, A., Farahat, A., Zhuge, H., & Gupta, C. (2026). Domain Adaptation of Automatic Speech Recognition Models for Diagnostic Applications. PHM Society European Conference, 9(1), 1–12. https://doi.org/10.36001/phme.2026.v9i1.5036
Abstract 0 | PDF Downloads 0

##plugins.themes.bootstrap3.article.details##

Keywords

domain-specific models, speech recognition model, automatic speech recognition, llm, finetuning, diagnostics

References
Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, & H. Lin (Eds.), Advances in Neural Information Processing Systems (Vol. 33, pp. 12449–12460). Curran Associates. Retrieved from https://proceedings.neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html

Casanova, E., Weber, J., Shulby, C. D., Candido Junior, A., Gölge, E., & Ponti, M. A. (2022). YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, & S. Sabato (Eds.), Proceedings of the 39th International Conference on Machine Learning (Vol. 162, pp. 2709–2720). PMLR. Retrieved from https://proceedings.mlr.press/v162/casanova22a.html

Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., ... Wei, F. (2022). WavLM: Large-scale self-supervised pre-training for full-stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1505–1518. doi: 10.1109/JSTSP.2022.3188113

Coqui AI. (2023). Coqui TTS: A deep learning toolkit for text-to-speech. Retrieved from https://github.com/coqui-ai/TTS

edge-tts contributors. (2023). edge-tts: Python client for Microsoft Edge text-to-speech. Retrieved from https://github.com/rany2/edge-tts

Hayashi, T., Watanabe, S., Zhang, Y., Toda, T., Hori, T., & Astudillo, R. (2018). Back-translation-style data augmentation for end-to-end ASR. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 426–433). IEEE. doi: 10.1109/SLT.2018.8639619

Hexgrad. (2024). Kokoro-82M: Open-weight neural text-to-speech model. Retrieved from https://huggingface.co/hexgrad/Kokoro-82M

Houlsby, N., Giurgiu, A., Jastrzębski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., ... Gelly, S. (2019). Parameter-efficient transfer learning for NLP. In Proceedings of the 36th International Conference on Machine Learning (Vol. 97, pp. 2790–2799). PMLR. Retrieved from https://proceedings.mlr.press/v97/houlsby19a.html

Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3451–3460. doi: 10.1109/TASLP.2021.3122291

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., ... Chen, W. (2022). LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=nZeVKeeFYf9

Huang, R., Abdel-Hamid, O., Li, X., & Evermann, G. (2020). Class LM and word mapping for contextual biasing in end-to-end ASR. In Interspeech 2020 (pp. 4348–4351). doi: 10.21437/Interspeech.2020-1787

Kim, J., Kong, J., & Son, J. (2021). Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In M. Meila & T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning (Vol. 139, pp. 5530–5540). PMLR. Retrieved from https://proceedings.mlr.press/v139/kim21f.html

Ko, T., Peddinti, V., Povey, D., & Khudanpur, S. (2015). Audio augmentation for speech recognition. In Interspeech 2015 (pp. 3586–3589). doi: 10.21437/Interspeech.2015-711

Kumar, A., Amin, E. M., Lee, X. Y., Vidyaratne, L., Farahat, A. K., Ghosh, D. D., ... Gupta, C. (2025). Building domain-specific small language models via guided data generation. arXiv preprint arXiv:2511.21748. Retrieved from https://arxiv.org/abs/2511.21748

Kumar, A., Farahat, A., & Gupta, C. (2025). Predicting maintenance actions from historical logs using domain-specific LLMs. Proceedings of the PHM Society Asia-Pacific Conference, 5. doi: 10.36001/phmap.2025.v5i1.4652

Kurian, B., Upadhyay, A., & Sengupta, A. (2025). Domain-specific adaptation for ASR through text-only fine-tuning. In Proceedings of the 1st Workshop on Multimodal Models for Low-Resource Contexts and Social Impact (MMLOSO 2025) (pp. 78–85). Association for Computational Linguistics. Retrieved from https://aclanthology.org/2025.mmloso-1.7/

Laptev, A., Korostik, V., Svischev, A., Andrusenko, A., Medennikov, I., & Rybin, S. (2020). You do not need more data: Improving end-to-end speech recognition by text-to-speech data augmentation. In 2020 13th International Congress on Image and Signal Processing, Biomedical Engineering and Informatics (CISP-BMEI) (pp. 439–444). IEEE. doi: 10.1109/CISP-BMEI51763.2020.9263564

Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10(8), 707–710.

Park, D. S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). SpecAugment: A simple data augmentation method for automatic speech recognition. In Interspeech 2019 (pp. 2613–2617). doi: 10.21437/Interspeech.2019-2680

Prasad, A., Madikeri, S., Khalil, D., Motlicek, P., & Schuepbach, C. (2024). Speech and language recognition with low-rank adaptation of pretrained models. In Interspeech 2024 (pp. 2825–2829). doi: 10.21437/Interspeech.2024-2187

Radford, A., Kim, J. W., Xu, T., Brockman, G., Mcleavey, C., & Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, & J. Scarlett (Eds.), Proceedings of the 40th International Conference on Machine Learning (Vol. 202, pp. 28492–28518). PMLR. Retrieved from https://proceedings.mlr.press/v202/radford23a.html

Ren, Y., Hu, C., Qin, T., Zhao, S., Zhao, Z., & Liu, T.-Y. (2020). FastSpeech 2: Fast and high-quality end-to-end text-to-speech. arXiv preprint arXiv:2006.04558. Retrieved from https://arxiv.org/abs/2006.04558

Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 1715–1725). Association for Computational Linguistics. doi: 10.18653/v1/P16-1162

Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., ... Wu, Y. (2017). Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. arXiv preprint arXiv:1712.05884. Retrieved from https://arxiv.org/abs/1712.05884

Song, Z., Zhuo, J., Yang, Y., Ma, Z., Zhang, S., & Chen, X. (2024). LoRA-Whisper: Parameter-efficient and extensible multilingual ASR. In Interspeech 2024 (pp. 3934–3938). doi: 10.21437/Interspeech.2024-892

Suh, J., Na, I., & Jung, W. (2024). Improving domain-specific ASR with LLM-generated contextual descriptions. In Interspeech 2024 (pp. 1255–1259). doi: 10.21437/Interspeech.2024-377

Tran, M., Pang, Y., Paul, D., Pandey, L., Jiang, K., Guo, J., ... Lei, X. (2025). A domain adaptation framework for speech recognition systems with only synthetic data. arXiv preprint arXiv:2501.12501. doi: 10.48550/arXiv.2501.12501

Vanderreydt, G., Prasad, A., Khalil, D., Madikeri, S., Demuynck, K., & Motlicek, P. (2023). Parameter-efficient tuning with adaptive bottlenecks for automatic speech recognition. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 1–7). IEEE. doi: 10.1109/ASRU57964.2023.10389769

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=SkeHuCVFDr

Zhong, G., Song, H., Wang, R., Sun, L., Liu, D., Pan, J., ... Dai, L. (2022). External text-based data augmentation for low-resource speech recognition in the constrained condition of OpenASR21 challenge. In Interspeech 2022 (pp. 4860–4864). doi: 10.21437/Interspeech.2022-649

Zhu, J., Tong, W., Xu, Y., Song, C., Wu, Z., You, Z., ... Meng, H. (2023). Text-only domain adaptation for end-to-end speech recognition through downsampling acoustic representation. In Interspeech 2023 (pp. 1334–1338). doi: 10.21437/Interspeech.2023-1378
Section
Technical Papers