作者: Yuxuan Lan , Richard W. Harvey , Barry-John Theobald , Jacob L. Newman , Stephen J. Cox
DOI:
关键词:
摘要: In speech recognition, the problem of speaker variability has been well studied. Common approaches to dealing with it include normalising for a speaker's vocal tract length and learning linear transform that moves speaker-independent models closer new speaker. pure lip-reading (no audio) less Results are often presented based on speaker-dependent (single speaker) or multispeaker (speakers in test-set also training-set) data, situations limited use real applications. This paper shows danger not using different speakers trainingand test-sets. Firstly, we present classification results single-word database AVletters 2 which is high-definition version known database. By careful choice features, show possible performance visual-only be very close audio-only recognition single multi-speaker configurations. However, independent configuration, channel degrades dramatically. applying multidimensional scaling (MDS) both audio features visual demonstrate when compared MFCCs commonly used have inherently small variation within across all classes spoken. highly sensitive identity speaker, whereas relatively invariant.