Submodular data selection in ASR language modeling
Given the vast amount of textual data that we have available today, it is very beneficial to have an efficient methodology to filter and select important and relevant chunks of this data to improve current natural language and speech processing systems. Although utilizing very large language models has been the industry norm in the current automatic speech recognition production systems, the focus is now shifting towards efficient ways to generate and utilize personalized and adapted language models as they have proven to improve the end user experience. Submodular methods have achieved great success in different domains; acoustic modeling, text summarization, and machine translation. They provide a natural way to select high-quality relevant data from an out-of-domain data source to be utilized in domain adaptation and personalization. In this work, we model the problem of language modeling data selection as submodular function optimization. Our results show that indeed by using the submodular data selection methods we were able to train better language models with less data. We were also able to reduce the end-to-end word error rate of the ASR system 7% by selecting data from a completely different domain.
- Linguistics