Modeling linguistic segment and turn boundaries for n-best rescoring of spontaneous speech.

作者: Andreas Stolcke

DOI:

关键词:

摘要: Language modeling, especially for spontaneous speech, often suffers from a mismatch of utterance segmentations between training and test conditions. In particular, uses linguistically-based segments, whereas testing occurs on acoustically determined resulting in degraded performance. We present an N-best rescoring algorithm that removes the effect segmentation mismatch. Furthermore, we show explicit language modeling hidden linguistic segment boundaries is improved by including turn-boundary events model. 1. THE SEGMENTATION PROBLEM IN LANGUAGE MODELING One problems encountered speech recognition continuous, long waveforms. Because current recognizers prefer short waveform segments best performance to limit computational resources, conversation-length waveforms are typically pre-segmented using simple acoustic criteria, such as locations pauses turn switches. This creates several modeling: The used (including its parameters) influences statistics embodied model (LM), creating potential set. Strictly speaking, one would have resegment data, recreate word-level transcriptions, retrain every time process modified. yields units not linguistically coherent, hence sub-optimal modeling. research [10] shows N-gram LMs based complete give lower perplexity than those only segmentations. work reported [12] showed word error rate can be reduced simply resegmenting at same segmentation. Explicit phenomena disfluencies also requires (as opposed acoustic) [15]. Similarly, sophisticated syntactic structure assume sentences their input [12]. following excerpt Switchboard corpus [2] illustrates discrepancies Linguistic marked , indicated //. A subset corresponds boundaries, . B: Worried they’re going get enough attention?

参考文章(11)
Marie Meteer, Rukmini Iyer, Modeling Conversational Speech for Speech Recognition empirical methods in natural language processing. ,(1996)
A. Stolcke, E. Shriberg, Automatic linguistic segmentation of conversational speech international conference on spoken language processing. ,vol. 2, pp. 1005- 1008 ,(1996) , 10.1109/ICSLP.1996.607773
Andreas Stolcke, Mitchel Weintraub, Yochai Konig, Explicit word error minimization in N-Best list rescoring conference of the international speech communication association. ,(1997)
M. Ostendorf, A. Kannan, S. Austin, O. Kimball, R. Schwartz, J. R. Rohlicek, Integration of diverse recognition methodologies through reevaluation of N-best sentence hypotheses human language technology. pp. 83- 87 ,(1991) , 10.3115/112405.112416
L. Rabiner, B. Juang, An introduction to hidden Markov models IEEE ASSP Magazine. ,vol. 3, pp. 4- 16 ,(1986) , 10.1109/MASSP.1986.1165342
A. Stolcke, E. Shriberg, Statistical language modeling for speech disfluencies international conference on acoustics speech and signal processing. ,vol. 1, pp. 405- 408 ,(1996) , 10.1109/ICASSP.1996.541118
Andreas Stolcke, Elizabeth Shriberg, Rebecca A. Bates, A prosody only decision-tree model for disfluency detection. conference of the international speech communication association. ,(1997)
S. Katz, Estimation of probabilities from sparse data for the language model component of a speech recognizer IEEE Transactions on Acoustics, Speech, and Signal Processing. ,vol. 35, pp. 400- 401 ,(1987) , 10.1109/TASSP.1987.1165125
H. Murveit, J. Butzberger, V. Digalakis, M. Weintraub, Large-vocabulary dictation using SRI's DECIPHER speech recognition system: progressive search techniques IEEE International Conference on Acoustics Speech and Signal Processing. ,vol. 2, pp. 319- 322 ,(1993) , 10.1109/ICASSP.1993.319301
T. Zeppenfeld, M. Finke, K. Ries, M. Westphal, A. Waibel, Recognition of conversational telephone speech using the JANUS speech engine international conference on acoustics, speech, and signal processing. ,vol. 3, pp. 1815- 1818 ,(1997) , 10.1109/ICASSP.1997.598889