作者: Lidia S. Chao , Derek F. Wong , Qiang Wang , Tong Xiao , Bei Li
DOI:
关键词:
摘要: Transformer is the state-of-the-art model in recent machine translation evaluations. Two strands of research are promising to improve models this kind: first uses wide networks (a.k.a. Transformer-Big) and has been de facto standard for development system, other deeper language representation but faces difficulty arising from learning deep networks. Here, we continue line on latter. We claim that a truly can surpass Transformer-Big counterpart by 1) proper use layer normalization 2) novel way passing combination previous layers next. On WMT'16 English- German, NIST OpenMT'12 Chinese-English larger WMT'18 tasks, our system (30/25-layer encoder) outperforms shallow Transformer-Big/Base baseline (6-layer 0.4-2.4 BLEU points. As another bonus, 1.6X smaller size 3X faster training than Transformer-Big.