Speaker dynamics as a source of pronunciation variability for continuous speech recognition models
A significant source of variation in spontaneous speech is due to intra-speaker pronunciation changes. Previous work has identified several factors related to pronunciation variability, such as phonetic context and speaking rate, which are useful to model in automatic speech recognition. This work examines new higher-level information sources: syntax, discourse structure and prosody, specifically the relationship between these factors and pronunciation variation as seen in reduction and hyper-articulation. The key contributions of this work include (1) analysis of high-level factors, providing new cues for improving prediction of pronunciation variation, (2) a framework for including dynamic pronunciation models in automatic speech recognition systems, and (3) an analysis of feature-based pronunciation models with suggestions for their incorporation into ASR systems.Key findings from the analysis of high-level factors are attributes that are most useful for predicting variability, including: part-of-speech (POS) of the target word and neighboring words, location of the word in an utterance, the number of FO slope changes within the word, word duration, and average word energy. Pronunciation prediction experiments show a reduction in phone error rate of 2.3% relative and similar reductions in perplexity over a baseline model using only phonetic context.Incorporating higher-level information (such as hypothesis-dependent word context or word-level FO values) into ASR systems requires a rescoring approach. A framework for this is presented, with recognition results using various types of pronunciation models on the Switchboard task. We obtain a small but statistically significant improvement in recognition performance with a baseline static model using phonetic context but no significant gains from extending this model to incorporate POS-dependent pronunciations.We also present a phonetic-feature-based prediction model where phones are represented by a vector of 21 symbolic features that can be on, off, unspecified or unused. Feature changes are predicted rather than phone changes, allowing for varying productions of phones, e.g., nasalized vowels. We studied feature interaction by examining different groupings of dependent features and showed that a hierarchical grouping with conditional dependencies leads to lower perplexity. We find that feature-based models are more efficient than phone-based models in the sense of requiring fewer parameters to predict variation while giving a smaller distance to the hand-labeled form and similar perplexity values.
- Electrical engineering