2017-NIPS-Attention is All You Need #255

thangk · 2024-09-06T16:40:45Z

Link: https://dl.acm.org/doi/10.5555/3295222.3295349

Main problem

The existing state-of-the-art models, such as ConvS2S, have linear growth computation complexity for the mechanism, which tracks the context (position of a token in a sequence); this is unsuitable for scaling tasks such as language translation tasks. The author's proposed method does the same task in constant time, thus providing a much more efficient approach and can scale much better. Thus, it is better suited for extremely high-parameter models such as LLMs.

Proposed method

The author proposes a new architecture called the Transformer model in this paper. This model is simpler than other sequence-to-sequence models such as ConvS2S (a CNN-based model) and RNN with attention mechanisms such as LSTM or GRU. It is also encoder-decoder-based, and one of its key features is the self-attention mechanism, which is found in both the encoder and decoder layers. The multi-headed self-attention mechanism can focus on multiple contexts simultaneously for better parallelism. It's called self-attention because each token (word) in the sequence is built-in with its positional information and its importance or relation to other tokens in the sequence. This approach makes each token processed independently and does not rely on other/previous tokens, making this architecture highly parallelizable.

My Summary

The newly proposed architecture is a great approach to tackling high-parameter tasks such as language translation, chatbots, or other knowledge-based tasks where the model needs to know a lot of information to serve the users. The Transformer model outperformed the existing state-of-the-art models by 1-7% in the BLEU score while costing much less computationally.

As shown above, the Transformer's big model (yellow) outperforms the best state-of-the-art model (purple) by 7.7% in EN-FR and 1.2% in EN-DE while the training cost is 3.3 times less in EN-DE and ~52.2 times less in EN-FR (green vs orange). This shows that as the number of parameters grows, the efficiency gain increases when using the Transformer model.

Datasets

WMT2024 English-German (4.5M sentence pairs) 37K vocab tokens
WMT2014 English-French (36M sentence pairs) 32K vocab tokens

The text was updated successfully, but these errors were encountered:

thangk added the literature-review Summary of the paper related to the work label Sep 6, 2024

thangk self-assigned this Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2017-NIPS-Attention is All You Need #255

2017-NIPS-Attention is All You Need #255

thangk commented Sep 6, 2024

2017-NIPS-Attention is All You Need #255

2017-NIPS-Attention is All You Need #255

Comments

thangk commented Sep 6, 2024

Link: https://dl.acm.org/doi/10.5555/3295222.3295349

Main problem

Proposed method

My Summary

Datasets