Attention Is All You Need

作者: Ashish Vaswani , Jakob Uszkoreit , Noam Shazeer , Illia Polosukhin , Llion Jones

DOI:

关键词:

摘要: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. best performing also connect the encoder and decoder through attention mechanism. We propose a new simple network architecture, Transformer, solely mechanisms, dispensing with recurrence convolutions entirely. Experiments two machine translation tasks show these to be superior quality while being more parallelizable requiring significantly less time train. Our model achieves 28.4 BLEU WMT 2014 English-to-German task, improving over existing results, including ensembles by 2 BLEU. On English-to-French our establishes single-model state-of-the-art score of 41.8 after training for 3.5 days eight GPUs, small fraction costs from literature. that Transformer generalizes well other applying it successfully English constituency parsing both large limited data.

参考文章(30)
Mitch Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, None, Building a large annotated corpus of English: the penn treebank Computational Linguistics. ,vol. 19, pp. 313- 330 ,(1993) , 10.21236/ADA273556
Alex Graves, Generating Sequences With Recurrent Neural Networks arXiv: Neural and Evolutionary Computing. ,(2013)
Thang Luong, Hieu Pham, Christopher D. Manning, Effective Approaches to Attention-based Neural Machine Translation empirical methods in natural language processing. pp. 1412- 1421 ,(2015) , 10.18653/V1/D15-1166
Zhongqiang Huang, Mary Harper, Self-Training PCFG Grammars with Latent Annotations Across Languages empirical methods in natural language processing. pp. 832- 841 ,(2009) , 10.3115/1699571.1699621
Min Zhang, Wenliang Chen, Yue Zhang, Muhua Zhu, Jingbo Zhu, Fast and Accurate Shift-Reduce Constituent Parsing meeting of the association for computational linguistics. pp. 434- 443 ,(2013)
Slav Petrov, Leon Barrett, Romain Thibaux, Dan Klein, Learning Accurate, Compact, and Interpretable Tree Annotation meeting of the association for computational linguistics. pp. 433- 440 ,(2006) , 10.3115/1220175.1220230
David McClosky, Eugene Charniak, Mark Johnson, Effective Self-Training for Parsing language and technology conference. pp. 152- 159 ,(2006) , 10.3115/1220835.1220855
Ilya Sutskever, Minh-Thang Luong, Oriol Vinyals, Quoc V. Le, Lukasz Kaiser, Multi-task Sequence to Sequence Learning arXiv: Learning. ,(2015)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition computer vision and pattern recognition. pp. 770- 778 ,(2016) , 10.1109/CVPR.2016.90
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu, None, Exploring the limits of language modeling arXiv: Computation and Language. ,(2016)