摘要: This thesis verses about the research conducted in topic of speaker recognition real conditions like as meeting rooms, telephone quality speech and radio TV broadcast news. The main objective is concerned to automatic detection classification speakers into a smart-room scenario. Acoustic application machine identify an individual from spoken sentence. It aims at processing acoustic signals convert them symbolic descriptions corresponding identity speakers. For last several years, situation has been attracting substantial attention becoming one language technologies adding improvement, or enrichment, recording transcriptions. In particularly, human activity that takes place meeting-rooms class-rooms, compared other domains exhibits increased complexity challenging problem due spontaneity speech, reverberation effects, presence overlapped room setup channel variability rich assortment events, either produced by humans objects handled them. Therefore, determination both their position time may help detect describe provide context awareness. We first seek improve traditional modeling approaches for identification verification, which are based on Gaussian Mixture Models, through multi-decision multi-channel strategies, scenario. We put emphasis studying techniques such Maximum Posteriori Adaptation, Nuisance Attribute Projection, Joint Factor Analysis, score normalization; aiming find out strategies deal with drawback. Moreover, we novel verification algorithm makes use adapted features recognition. In second line research, related continuous audio stream, where optimum number identities unknown priory. developed some previous baseline diarization system upon Hidden Markov Models Agglomerative Hierarchical Clustering. evaluate TDOA feature dynamics order clustering initialization AHC handling overlaps; assess impact synergies Speech Activity Detection Acoustic Event integrated system; propose compare new methods spectral clustering. adaptation news domain tracking task also addressed. Finally, fusion combination video image modalities highlighted across this work, approaches. Techniques Matching Weighting Particle Filter proposed combine scores likelihoods different modalities. Results provided demonstrate these information sources can play important role person task, complementary knowledge spectrum-based systems thus improving accuracy. This work was performed framework international national projects, among CHIL EU project Catalan founded Tecnoparla; participation technology evaluations CLEAR, NIST Rich Transcription (RT), Speaker Recognition Evaluation (SRE) Spanish evaluation Albayzin.