Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2017-NIPS-Attention is All You Need #255

Open
thangk opened this issue Sep 6, 2024 · 0 comments
Open

2017-NIPS-Attention is All You Need #255

thangk opened this issue Sep 6, 2024 · 0 comments
Assignees
Labels
literature-review Summary of the paper related to the work

Comments

@thangk
Copy link
Collaborator

thangk commented Sep 6, 2024

Link: https://dl.acm.org/doi/10.5555/3295222.3295349

Main problem

The existing state-of-the-art models, such as ConvS2S, have linear growth computation complexity for the mechanism, which tracks the context (position of a token in a sequence); this is unsuitable for scaling tasks such as language translation tasks. The author's proposed method does the same task in constant time, thus providing a much more efficient approach and can scale much better. Thus, it is better suited for extremely high-parameter models such as LLMs.

Proposed method

The author proposes a new architecture called the Transformer model in this paper. This model is simpler than other sequence-to-sequence models such as ConvS2S (a CNN-based model) and RNN with attention mechanisms such as LSTM or GRU. It is also encoder-decoder-based, and one of its key features is the self-attention mechanism, which is found in both the encoder and decoder layers. The multi-headed self-attention mechanism can focus on multiple contexts simultaneously for better parallelism. It's called self-attention because each token (word) in the sequence is built-in with its positional information and its importance or relation to other tokens in the sequence. This approach makes each token processed independently and does not rely on other/previous tokens, making this architecture highly parallelizable.

My Summary

The newly proposed architecture is a great approach to tackling high-parameter tasks such as language translation, chatbots, or other knowledge-based tasks where the model needs to know a lot of information to serve the users. The Transformer model outperformed the existing state-of-the-art models by 1-7% in the BLEU score while costing much less computationally.

image

As shown above, the Transformer's big model (yellow) outperforms the best state-of-the-art model (purple) by 7.7% in EN-FR and 1.2% in EN-DE while the training cost is 3.3 times less in EN-DE and ~52.2 times less in EN-FR (green vs orange). This shows that as the number of parameters grows, the efficiency gain increases when using the Transformer model.

Datasets

WMT2024 English-German (4.5M sentence pairs) 37K vocab tokens
WMT2014 English-French (36M sentence pairs) 32K vocab tokens
@thangk thangk added the literature-review Summary of the paper related to the work label Sep 6, 2024
@thangk thangk self-assigned this Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
literature-review Summary of the paper related to the work
Projects
None yet
Development

No branches or pull requests

1 participant