作者: Hiroshi G. Okuno , Hiroaki Kitano , Yukiko Nakagawa
DOI:
关键词:
摘要: We present a method of improving sound source separation using vision. The is an essential function to accomplish auditory scene understanding by separating stream sounds generated from multiple sources. By sounds, recognition process, such as speech recognition, can simply work on single stream, not mixed several speakers. performance known be improved stereo/binaural microphone and array which provides spatial information for separation. However, these methods still have more than 20 degree positional ambiguities. In this paper, we further added visual provide specific accurate position information. As result, capability was drastically improved. addition, found that the use approximate direction improve object tracking accuracy simple vision system, in turn improves system. claim integration inputs tasks each perception, tracking, bootstrapping.