Separating segmental and prosodic contributions to intelligibility
Abstract
It is well known that the intelligibility of speech can vary both across individuals within styles or tasks, and within individuals across styles or tasks. Various properties of the speech signal have been shown to correlate with such differences in intelligibility, including speech rate, [5,7,8] segmental reduction or deletion, [1] vowel space size, [1,2,4,6] pitch range, [2] and pitch accent deletion. [3] However, these
dimensions are rarely (if ever) manipulated independently in natural speech. This poses a challenge to understanding the sources of individual differences in intelligibility (both across individuals and across styles), and makes it difficult to know whether any particular dimension measured causes speech to be more or less intelligible, or merely indexes some other aspect of speech that is responsible for intelligibility differences.
As an alternative to measuring fine-grained dimensions of the speech signal, this research makes a broad distinction between prosodic dimensions (pitch, intensity, and duration) on one hand, and segmental content on the other. Through careful resynthesis, a corpus of parallel sentences are created that effectively hold constant either prosody or segmental content across resynthesized “talkers”. High-quality stimuli are achieved by hand-correction of glottal pulse epochs and semi-automated hand segmentation of syllable durations, followed by automated dynamic time warping of durations and swapping of pitch and intensity contours.
Results from a speech-in-noise task with both unmodified and resynthesized stimuli show that talkers with low intrinsic intelligibility may have relatively “good” prosody, evidenced by improvements in intelligibility when their prosody is mapped onto other talkers’ waveforms. In contrast, talkers with high intrinsic intelligibility may have relatively “bad” prosody, evidenced by lower intelligibility caused by mapping their prosody onto other talkers. A linear mixed-effects regression model (controlling for signal processing distortion and variation in sentence difficulty) supports this view: patterns of coefficients for “prosodic donor” and “segmental donor” show different rankings than the overall intelligibility scores for unmodified talkers. Comparison between these patterns and post-hoc acoustic analyses of the stimuli allows classification of acoustic predictors based on how well they correlate with “prosodic donor” or “segmental donor” coefficient patterns.
References
[1] Bond, Z. S., & Moore, T. J. (1994). A note on the acoustic-phonetic characteristics of inadvertently clear speech. Speech Communication, 14(4), 325–337. doi: 10.1016/0167-6393(94)90026-4.
[2] Bradlow, A. R., Torretta, G. M., & Pisoni, D. B. (1996). Intelligibility of normal speech I: Global and fine-grained acoustic-phonetic talker characteristics. Speech Communication, 20(3-4), 255–272. doi: 10.1016/S0167-6393(96)00063-5.
[3] Clopper, C. G., & Smiljanić, R. (2011). Effects of gender and regional dialect on prosodic patterns in American English. Journal of Phonetics, 39(2), 237–245. doi: 10.1016/j.wocn.2011.02.006.
[4] Hazan, V., & Markham, D. (2004). Acoustic-phonetic correlates of talker intelligibility for adults and children. The Journal of the Acoustical Society of America, 116(5), 3108–3118. doi: 10.1121/1.1806826.
[5] Mayo, C., Aubanel, V., & Cooke, M. (2012). Effect of prosodic changes on speech intelligibility. Paper presented at the 13th Annual Conference of the International Speech Communication Association. In INTERSPEECH-2012. url: http://interspeech2012.org/accepted-abstract.html?id=661
[6] Neel, A. T. (2008). Vowel space characteristics and vowel identification accuracy. Journal of Speech, Language, and Hearing Research, 51(3), 574–585. doi: 10.1044/1092-4388(2008/041).
[7] Sommers, M. S., Nygaard, L. C., & Pisoni, D. B. (1994). Stimulus variability and spoken word recognition I: Effects of variability in speaking rate and overall amplitude. The Journal of the Acoustical Society of America, 96(3), 1314–1324. doi: 10.1121/1.411453.
[8] Tolhurst, G. C. (1957). Effects of duration and articulation changes on intelligibility, word reception and listener preference. Journal of Speech and Hearing Disorders, 22(3), 328–334.