作者: Felipe Cruz Salinas , Kenichi Kumatani , Robert Gmyr , Linquan Liu , Yu Shi
DOI:
关键词:
摘要: The sparsely-gated mixture of experts (MoE) architecture can scale out large Transformer models to orders of magnitude which are not achievable by dense models with the current …