作者: Shakeel Muhammad , Sudo Yui , Peng Yifan , Watanabe Shinji
DOI:
关键词:
摘要: End-to-end (E2E) automatic speech recognition (ASR) models have two desirable properties: online and offline modes. The online ASR mode, which operates under strict latency constraints, processes speech frames in real-time to provide transcription. Conversely, the offline ASR mode waits for the complete utterance of speech frames before generating a transcription. Recently, the integration of online and offline ASR for recurrent neural network transducers (RNNT) can be achieved through the joint training of online and offline encoders with a shared decoder. However, this integration comes at the cost of performance degradation in the offline ASR mode, as the shared decoder must handle features of varying contexts. Namely, with E2E integration framework of online and offline encoders, we explore two approaches to enhance the performance of both the ASR modes. First, we introduce separate RNN-T decoders for each ASR mode while maintaining shared encoders, thereby effectively managing features of different contexts. Second, we explore multiple auxiliary loss criteria to introduce additional regularization, thereby enhancing the overall stability and performance of the framework. Overall, evaluation results show 1.8%-2.5% relative character error rate reductions (CERR) on corpora of spontaneous Japanese (CSJ) for online ASR, and 4.4%-6.3% relative CERRs for offline ASR within a single model compared to separate online and offline models.