作者: Ashish Vaswani , Jakob Uszkoreit , Noam Shazeer , Illia Polosukhin , Llion Jones
DOI:
关键词:
摘要: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. best performing also connect the encoder and decoder through attention mechanism. We propose a new simple network architecture, Transformer, solely mechanisms, dispensing with recurrence convolutions entirely. Experiments two machine translation tasks show these to be superior quality while being more parallelizable requiring significantly less time train. Our model achieves 28.4 BLEU WMT 2014 English-to-German task, improving over existing results, including ensembles by 2 BLEU. On English-to-French our establishes single-model state-of-the-art score of 41.8 after training for 3.5 days eight GPUs, small fraction costs from literature. that Transformer generalizes well other applying it successfully English constituency parsing both large limited data.