Using vision to improve sound source separation

作者: Hiroshi G. Okuno , Hiroaki Kitano , Yukiko Nakagawa

DOI:

关键词:

摘要: We present a method of improving sound source separation using vision. The is an essential function to accomplish auditory scene understanding by separating stream sounds generated from multiple sources. By sounds, recognition process, such as speech recognition, can simply work on single stream, not mixed several speakers. performance known be improved stereo/binaural microphone and array which provides spatial information for separation. However, these methods still have more than 20 degree positional ambiguities. In this paper, we further added visual provide specific accurate position information. As result, capability was drastically improved. addition, found that the use approximate direction improve object tracking accuracy simple vision system, in turn improves system. claim integration inputs tasks each perception, tracking, bootstrapping.

参考文章(12)
Tomohiro Nakatani, Hiroshi G. Okuno, Takeshi Kawabata, Un-derstanding three simultaneous speakers ,(1997)
Shigeru Ando, An Autonomous Three-Dimensional Vision Sensor with Ears IEICE Transactions on Information and Systems. ,vol. 78, pp. 1621- 1629 ,(1995)
Hiroshi G. Okuno, David F. Rosenthal, Computational auditory scene analysis L. Erlbaum Associates Inc.. ,(1998)
Rodney A. Brooks, Cynthia Breazeal, Robert Irie, Charles C. Kemp, Matthew Marjanovic, Brian Scassellati, Matthew M. Williamson, Alternative essences of intelligence national conference on artificial intelligence. pp. 961- 968 ,(1998) , 10.21236/ADA457180
S. Boll, A spectral subtraction algorithm for suppression of acoustic noise in speech ICASSP '79. IEEE International Conference on Acoustics, Speech, and Signal Processing. ,vol. 4, pp. 200- 203 ,(1979) , 10.1109/ICASSP.1979.1170696
M. Rucci, R. Bajcsy, Learning visuo-tactile coordination in robotic systems international conference on robotics and automation. ,vol. 3, pp. 2678- 2683 ,(1995) , 10.1109/ROBOT.1995.525661
Bodden, Modeling human sound-source localization and the cocktail-party-effect Acta Acoustica. ,vol. 1, pp. 43- 55 ,(1993)
Tomohiro Nakatani, Hiroshi G. Okuno, Harmonic sound stream segregation using localization and its application to speech stream segregation Speech Communication. ,vol. 27, pp. 209- 222 ,(1999) , 10.1016/S0167-6393(98)00079-X
G.J. Wolff, Sensory fusion: integrating visual and auditory information for recognizing speech IEEE International Conference on Neural Networks. pp. 672- 677 ,(1993) , 10.1109/ICNN.1993.298635
Tomohiro Nakatani, Hiroshi G. Okuno, Takeshi Kawabata, Auditory stream segregation in auditory scene analysis with a multi-agent system national conference on artificial intelligence. pp. 100- 107 ,(1994)