Learning Deep Transformer Models for Machine Translation

作者: Lidia S. Chao , Derek F. Wong , Qiang Wang , Tong Xiao , Bei Li

DOI:

关键词:

摘要: Transformer is the state-of-the-art model in recent machine translation evaluations. Two strands of research are promising to improve models this kind: first uses wide networks (a.k.a. Transformer-Big) and has been de facto standard for development system, other deeper language representation but faces difficulty arising from learning deep networks. Here, we continue line on latter. We claim that a truly can surpass Transformer-Big counterpart by 1) proper use layer normalization 2) novel way passing combination previous layers next. On WMT'16 English- German, NIST OpenMT'12 Chinese-English larger WMT'18 tasks, our system (30/25-layer encoder) outperforms shallow Transformer-Big/Base baseline (6-layer 0.4-2.4 BLEU points. As another bonus, 1.6X smaller size 3X faster training than Transformer-Big.

参考文章(36)
Diederik P. Kingma, Jimmy Ba, Adam: A Method for Stochastic Optimization arXiv: Learning. ,(2014)
Yoshua Bengio, Tomas Mikolov, Razvan Pascanu, On the difficulty of training recurrent neural networks international conference on machine learning. pp. 1310- 1318 ,(2013)
Thang Luong, Hieu Pham, Christopher D. Manning, Effective Approaches to Attention-based Neural Machine Translation empirical methods in natural language processing. pp. 1412- 1421 ,(2015) , 10.18653/V1/D15-1166
Qiang Li, Hao Zhang, Tong Xiao, Jingbo Zhu, NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation meeting of the association for computational linguistics. pp. 19- 24 ,(2012)
Hang Li, Zhengdong Lu, A Deep Architecture for Matching Short Texts neural information processing systems. ,vol. 26, pp. 1367- 1375 ,(2013)
Ilya Sutskever, Quoc V. Le, Oriol Vinyals, Sequence to Sequence Learning with Neural Networks neural information processing systems. ,vol. 27, pp. 3104- 3112 ,(2014)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition computer vision and pattern recognition. pp. 770- 778 ,(2016) , 10.1109/CVPR.2016.90
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Identity Mappings in Deep Residual Networks Computer Vision – ECCV 2016. pp. 630- 645 ,(2016) , 10.1007/978-3-319-46493-0_38