Domain Adaptation of Automatic Speech Recognition Models for Diagnostic Applications
##plugins.themes.bootstrap3.article.main##
##plugins.themes.bootstrap3.article.sidebar##
Abstract
Automatic speech recognition (ASR), or speech-to-text (STT), is becoming an important interface for AI systems in diagnostic workflows, but general-purpose ASR models often degrade in specialized technical domains. In diagnostic applications such as fault identification, root cause analysis, and repair recommendation, general-purpose ASR systems struggle with domain-specific terminology, abbreviations, part identifiers, and measurement expressions, leading to elevated transcription errors. This work presents a domain adaptation pipeline that unifies three components: a synthetic benchmarking framework in which domain-specific technical text is converted to speech via text-to-speech~(TTS) synthesis and transcribed by open-source ASR models to establish baseline performance; Low-Rank Adaptation~(LoRA)-based fine-tuning of Whisper Large-v3 using those synthetic audio-text pairs; and transfer validation on curated real-world automotive YouTube recordings to assess generalization beyond synthetic conditions. Using automotive technical language as a representative diagnostic domain, a data-scaling study employing progressively larger subsets of in-domain training data evaluates performance on a held-out test set via word error rate~(WER), character error rate~(CER), normalized error metrics, alphanumeric error rate, semantic similarity, and Bidirectional Encoder Representations from Transformers Score~(BERTScore). Results show consistent gains from lightweight domain adaptation on both held-out synthetic data and real-world recordings, confirming that synthetic data generation combined with LoRA-based fine-tuning is an effective and computationally practical strategy for improving ASR accuracy in specialized technical domains where labeled speech is scarce.
How to Cite
##plugins.themes.bootstrap3.article.details##
domain-specific models, speech recognition model, automatic speech recognition, llm, finetuning, diagnostics
Casanova, E., Weber, J., Shulby, C. D., Candido Junior, A., Gölge, E., & Ponti, M. A. (2022). YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, & S. Sabato (Eds.), Proceedings of the 39th International Conference on Machine Learning (Vol. 162, pp. 2709–2720). PMLR. Retrieved from https://proceedings.mlr.press/v162/casanova22a.html
Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., ... Wei, F. (2022). WavLM: Large-scale self-supervised pre-training for full-stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1505–1518. doi: 10.1109/JSTSP.2022.3188113
Coqui AI. (2023). Coqui TTS: A deep learning toolkit for text-to-speech. Retrieved from https://github.com/coqui-ai/TTS
edge-tts contributors. (2023). edge-tts: Python client for Microsoft Edge text-to-speech. Retrieved from https://github.com/rany2/edge-tts
Hayashi, T., Watanabe, S., Zhang, Y., Toda, T., Hori, T., & Astudillo, R. (2018). Back-translation-style data augmentation for end-to-end ASR. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 426–433). IEEE. doi: 10.1109/SLT.2018.8639619
Hexgrad. (2024). Kokoro-82M: Open-weight neural text-to-speech model. Retrieved from https://huggingface.co/hexgrad/Kokoro-82M
Houlsby, N., Giurgiu, A., Jastrzębski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., ... Gelly, S. (2019). Parameter-efficient transfer learning for NLP. In Proceedings of the 36th International Conference on Machine Learning (Vol. 97, pp. 2790–2799). PMLR. Retrieved from https://proceedings.mlr.press/v97/houlsby19a.html
Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3451–3460. doi: 10.1109/TASLP.2021.3122291
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., ... Chen, W. (2022). LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=nZeVKeeFYf9
Huang, R., Abdel-Hamid, O., Li, X., & Evermann, G. (2020). Class LM and word mapping for contextual biasing in end-to-end ASR. In Interspeech 2020 (pp. 4348–4351). doi: 10.21437/Interspeech.2020-1787
Kim, J., Kong, J., & Son, J. (2021). Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In M. Meila & T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning (Vol. 139, pp. 5530–5540). PMLR. Retrieved from https://proceedings.mlr.press/v139/kim21f.html
Ko, T., Peddinti, V., Povey, D., & Khudanpur, S. (2015). Audio augmentation for speech recognition. In Interspeech 2015 (pp. 3586–3589). doi: 10.21437/Interspeech.2015-711
Kumar, A., Amin, E. M., Lee, X. Y., Vidyaratne, L., Farahat, A. K., Ghosh, D. D., ... Gupta, C. (2025). Building domain-specific small language models via guided data generation. arXiv preprint arXiv:2511.21748. Retrieved from https://arxiv.org/abs/2511.21748
Kumar, A., Farahat, A., & Gupta, C. (2025). Predicting maintenance actions from historical logs using domain-specific LLMs. Proceedings of the PHM Society Asia-Pacific Conference, 5. doi: 10.36001/phmap.2025.v5i1.4652
Kurian, B., Upadhyay, A., & Sengupta, A. (2025). Domain-specific adaptation for ASR through text-only fine-tuning. In Proceedings of the 1st Workshop on Multimodal Models for Low-Resource Contexts and Social Impact (MMLOSO 2025) (pp. 78–85). Association for Computational Linguistics. Retrieved from https://aclanthology.org/2025.mmloso-1.7/
Laptev, A., Korostik, V., Svischev, A., Andrusenko, A., Medennikov, I., & Rybin, S. (2020). You do not need more data: Improving end-to-end speech recognition by text-to-speech data augmentation. In 2020 13th International Congress on Image and Signal Processing, Biomedical Engineering and Informatics (CISP-BMEI) (pp. 439–444). IEEE. doi: 10.1109/CISP-BMEI51763.2020.9263564
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10(8), 707–710.
Park, D. S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). SpecAugment: A simple data augmentation method for automatic speech recognition. In Interspeech 2019 (pp. 2613–2617). doi: 10.21437/Interspeech.2019-2680
Prasad, A., Madikeri, S., Khalil, D., Motlicek, P., & Schuepbach, C. (2024). Speech and language recognition with low-rank adaptation of pretrained models. In Interspeech 2024 (pp. 2825–2829). doi: 10.21437/Interspeech.2024-2187
Radford, A., Kim, J. W., Xu, T., Brockman, G., Mcleavey, C., & Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, & J. Scarlett (Eds.), Proceedings of the 40th International Conference on Machine Learning (Vol. 202, pp. 28492–28518). PMLR. Retrieved from https://proceedings.mlr.press/v202/radford23a.html
Ren, Y., Hu, C., Qin, T., Zhao, S., Zhao, Z., & Liu, T.-Y. (2020). FastSpeech 2: Fast and high-quality end-to-end text-to-speech. arXiv preprint arXiv:2006.04558. Retrieved from https://arxiv.org/abs/2006.04558
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 1715–1725). Association for Computational Linguistics. doi: 10.18653/v1/P16-1162
Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., ... Wu, Y. (2017). Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. arXiv preprint arXiv:1712.05884. Retrieved from https://arxiv.org/abs/1712.05884
Song, Z., Zhuo, J., Yang, Y., Ma, Z., Zhang, S., & Chen, X. (2024). LoRA-Whisper: Parameter-efficient and extensible multilingual ASR. In Interspeech 2024 (pp. 3934–3938). doi: 10.21437/Interspeech.2024-892
Suh, J., Na, I., & Jung, W. (2024). Improving domain-specific ASR with LLM-generated contextual descriptions. In Interspeech 2024 (pp. 1255–1259). doi: 10.21437/Interspeech.2024-377
Tran, M., Pang, Y., Paul, D., Pandey, L., Jiang, K., Guo, J., ... Lei, X. (2025). A domain adaptation framework for speech recognition systems with only synthetic data. arXiv preprint arXiv:2501.12501. doi: 10.48550/arXiv.2501.12501
Vanderreydt, G., Prasad, A., Khalil, D., Madikeri, S., Demuynck, K., & Motlicek, P. (2023). Parameter-efficient tuning with adaptive bottlenecks for automatic speech recognition. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 1–7). IEEE. doi: 10.1109/ASRU57964.2023.10389769
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=SkeHuCVFDr
Zhong, G., Song, H., Wang, R., Sun, L., Liu, D., Pan, J., ... Dai, L. (2022). External text-based data augmentation for low-resource speech recognition in the constrained condition of OpenASR21 challenge. In Interspeech 2022 (pp. 4860–4864). doi: 10.21437/Interspeech.2022-649
Zhu, J., Tong, W., Xu, Y., Song, C., Wu, Z., You, Z., ... Meng, H. (2023). Text-only domain adaptation for end-to-end speech recognition through downsampling acoustic representation. In Interspeech 2023 (pp. 1334–1338). doi: 10.21437/Interspeech.2023-1378

This work is licensed under a Creative Commons Attribution 3.0 Unported License.
The Prognostic and Health Management Society advocates open-access to scientific data and uses a Creative Commons license for publishing and distributing any papers. A Creative Commons license does not relinquish the author’s copyright; rather it allows them to share some of their rights with any member of the public under certain conditions whilst enjoying full legal protection. By submitting an article to the International Conference of the Prognostics and Health Management Society, the authors agree to be bound by the associated terms and conditions including the following:
As the author, you retain the copyright to your Work. By submitting your Work, you are granting anybody the right to copy, distribute and transmit your Work and to adapt your Work with proper attribution under the terms of the Creative Commons Attribution 3.0 United States license. You assign rights to the Prognostics and Health Management Society to publish and disseminate your Work through electronic and print media if it is accepted for publication. A license note citing the Creative Commons Attribution 3.0 United States License as shown below needs to be placed in the footnote on the first page of the article.
First Author et al. This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 United States License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.