-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Merging with Trans-XL? #2
Comments
@gaceladri I've been thinking about it! Actually, I think this linear attention can be used anywhere where you desire full-attention but cannot pay the price. I'm actually working on the best implementation of transformers-xl at the moment https://github.com/lucidrains/compressive-transformer-pytorch You can make your request there where you want linear attention applied! |
@gaceladri if your question meant whether one can introduce recurrence to this like transformer-xl, that would be harder to do, because naturally these sequences are so long, to hold all the previous activations in memory would be infeasible at some point |
@gaceladri but hmm, now you got me thinking. it may be possible to compress the hidden activations into something smaller and introduce recurrence on top of these LR transformers. Haha, if you can propose something that makes sense, I'll consider adding it! |
You'r a fucking machine! Haha, I am working with an implementation of Transformer-XL with adaptive softmax and dynamic evaluation and I found this awesome repo when I was looking for your linformer. Just thinking and questioning... At this moment I don't know which one would be better... But... I read the deepmind paper and I don't think that compression would be the best way, since you need to add more computation to your model and I am working for the SustaiNLP workshop at EMNLP2020. I am sure that there is a better way than compression. But I am not any expert like you or any deepminder... :) |
I read on an orthogonal way the paper but my intuition says that they got better results just by adding more computation in any or other way... Not read the full paper at all. |
@lucidrains One thing that I would do is, first a bottleneck after the embedding like in the MobileBert paper and then add the linear attention to the Trans-XL. It would makes sense? |
@gaceladri ohh, I believe that linear layer is basically embedding factorization from Albert? It's available as a feature in most of my repos as the |
@gaceladri so, linear attention actually did not yield very good results for me at a certain length. This repository was mainly to explore whether combining it with local inductive prior (local attention) would improve its performance. At sequence length of 4096, on enwik8, the causal linear attention alone fails to converge. However, I have had good result using linear attention in places where I cannot use full attention, https://github.com/lucidrains/stylegan2-pytorch#attention So my decision tree is now basically to use linear attention only where I cannot use full-attention, and perhaps combine it with local attention for some cheap, weaker, global attention. |
@gaceladri Also, I am far from an expert, just find it easier to learn concepts if I build them :) |
@gaceladri I find compressed transformer interesting, because it successfully combined a memory write mechanism, with a recurrent attention network. After learning Nvidia used LSTM with neural turing machine-like memory write/read to learn the dynamics of Pacman from pixels, I've become a lot more interested in memory in general. |
Well, thanks for your answer! I am not looking for longer sequences, just efficiency and some performance. Do you think that it can work with a good performance/efficiency trade-off in sequences around 512 or so? Just to clarify that the linear attention will work just translating the function:
Instead of the normal attention computation? |
Sure interesting! But maybe not so much when you are looking for low computation models? |
@gaceladri yes, actually linear attention worked very well for me at lengths below 2048! |
@gaceladri ohh, I understand now, you are primarily interested in efficiency. Got it. Have you read the Performer paper from Deepmind yet? https://arxiv.org/pdf/2006.03555.pdf |
I also have been looking for Product Key Memory layers but I am working in tf and the official implementation is on pytorch. The embeddingbag in tensorflow is not optimised so it is veeeery slow in tf compared to pytorch. |
Sounds good! I am going to read it! Thanks! |
@gaceladri PKM works great for me! My researcher friend @AranKomat actually is studying conditional computation and cued me in on that |
@gaceladri So what you should know is that the auto-regressive flavor of linear attention actually incurs a much greater memory cost, but EPFL wrote up a CUDA kernel that alleviates that issue. https://github.com/idiap/fast-transformers I also have a non-CUDA solution, but it requires pairing it with local (QK)V attention. Otherwise, for non-autoregressive, it is very efficient, and multiple people (and me) have found it worked for their problems. But it isn't as good as full attention. The only paper that claims it is as good is the Linformer paper, but they've only benchmarked it against Roberta at a length of 4096 |
@gaceladri I also offer the Linformer linear attention in this repository, so feel free to try it and let me know what you discover :) |
I follow Aran on Twitter :P. The thing is that I would like to avoid ad-hoc CUDA kernels since the last instance that I would like is to deploy the model on a mobile device. "linear attention actually incurs a much greater memory cost" Could it be because it is doing a softmax over the hidden space? Maybe an adaptive softmax could alleviate that? https://arxiv.org/abs/1809.10853 |
only the auto-regressive (GPT-like) linear attention incurs the greater cost. if you are building BERT like models, there will be no cost |
@gaceladri knowing you are most attentive to efficiency, the Deepmind paper is most relevant, because they claim you can take pre-trained models on full-attention, and fine-tune them into linear-attention models with little loss in accuracy |
"They claim you can take pre-trained models on full-attention, and fine-tune them into linear-attention" Awesome! But I don't think that I could reproduce their result in a short time that I have at this moment... Great discussion by the way! I am going to sleep on it and I will let you know if I find the linear attention useful on my Trans-xl or not! :=) Have a good night! 👍 |
@gaceladri ok! let me know if you need any modifications to this repository so you can use any of the functions that are not exposed! Good night! |
Hi, |
I have been reading the linear transformer and seeing the Yannic video about it now. I have a question that maybe I am missing something but, do you have normalized the linear attention? And because of this, maybe is it not converging on enwiki8? |
@gaceladri yup! Aran actually sent me the head collab paper! The key line seems to be https://github.com/epfml/collaborative-attention/blob/53ca19deebf62581b412b557da3455974afc7549/src/collaborative_attention/collaborative_attention.py#L96 Yup, multiple different papers seem to approach the normalization in different ways. I went with what was proposed in the original efficient attention paper. https://github.com/lucidrains/linear-attention-transformer/blob/master/linear_attention_transformer/linear_attention_transformer.py#L228-L229 and https://github.com/lucidrains/linear-attention-transformer/blob/master/linear_attention_transformer/linear_attention_transformer.py#L263 (causal) |
I'll give the Transformer as RNN paper's approach a try too, but I think they are all roughly the same. The speech synthesis results were not very encouraging. I don't think this approach can win against full attention, but can probably serve as a weaker global attention to integrate local attention results in later layers. (what this repo tries to do) |
@gaceladri I think the collab head attention is actually two ideas in one. There are already papers showing you can get away with one set of key/value heads (here they do one set of queries). But the 'collab' aspect of mixing the heads seems reminiscent of the Talking Heads paper https://arxiv.org/abs/2003.02436 from Shazeer |
@lucidrains I am a bit confused with the dimensions since I am working with tf and I am aware that pytorch treats the dimensions differently. At this line the dimensions respectively to q, k, v are?: q = [batch, from, heads, size_per_head] or q = [batch, heads, from, size_per_head] |
@gaceladri the latter! if
|
@gaceladri what are you building? lol |
@lucidrains @gaceladri I'd like to note that collab head was not evaluated on other important datasets such as Webtext (both more diverse and longer than the ones tested), so I'm not sure if it works on Webtext (or Wikitext-103 word-level) or not. As stated in this tweet of mine, the model with less budget on self-attention (h=2) performs on par with the baseline on some datasets, while it does not on Webtext. I think it is always important to test any model on either Wikitext-103 or Webtext, as other datasets may not use the self-attention module as much. |
@AranKomat Thanks! What I am looking for is some trade-off between accuracy (ppl) and efficiency. Since what I would like more is scalability. Thanks for the point. My idea was to train my model with h=4. Just with intuition nothing theoretical. @lucidrains I think that I had to have some bug. The memory consumption are equal with size length 128 with the einsums in the attention coming from mobilebert than the linear_trans that I implemented... I think that sz 128 should be negligible but the first try is something suspicious. Tomorrow I will check what is happening... I am also thinking about throw everything away and start from scratch with pytorch... Tomorrow is going to be an awesome day... 👯 😄 Edit: Aran if it is on your road map to check this let me know please. Thanks! |
By "this," do you mean the scalability of h=4 in the context of trade-off between ppl and computes? From my experience, on the datasets like Webtext, fixing h wasn't worth the saved computes, since the small h becomes more and more of a bottleneck to the performance if you increase d_model, d_ff or any other hyperparameter budget. |
@AranKomat I expressed myself badly, I wanted to say that I am not going to scale to big computes... What I want to is to have the better trade-off on a GTX1070TI. So I am not going to scale too much but I want the better performance maybe sacrificing some ppl but gaining "scalability" (ridiculous) on my budged size. Does it make sense? So maybe I dont need to more bigger than h=4 with my #params . I think. |
Makes sense now :) |
@lucidrains I am struggling to understand the differences between your masking implementation and the masking mechanism in the normal attention in transformers library. The misleading point is that when I try to implement your masking mechanism with the tuple I get an slicing error here. And when I try to take the masking coming from the native transformers library I get an error with the ~ inverse that tolds me that this way in Pytorch only works with booleans and integers so I assume that I am doing something wrong. Could you enlighten me a little bit or give me some insight into what I might be doing wrong? The other point that I am a bit confused is that you do the masking just at the "K" while they do it to the whole score I assume that this difference is because the particular way of doing the attention. Again, could you enlighten me a little bit on how to integrate your linear-att in "transformers"? Thanks a lot! |
@gaceladri could you show me your code where you get the slicing error? the inverse error is due to the fact that in the transformer library, they use floating point 0s or 1s to denote masking, while I use booleans. If you simply take their masking and do a yup, because linear attention does the dot product between keys and values along the sequence dimension, we don't need to do any masking for queries |
@lucidrains Thanks a lot for your answer! Your time is very valuable. I have fixed it but not tested yet, for this my delay in the answer. Do you know what is the logic to do this masking? -1000.0 instead of a simple 0 and 1 like "always"? By the moment my code is just the transformer mobilebert with your linearformer and the rnn former from the authors. I have added a linear dense with einsum to optimize the performance that worked well in TF but I have not tested in Pytorch. NOT TESTED! some bug surely.
I will continue to implement the adaptive softmax (makes sense to you to add this to the architecture or does not makes sense due to some collision with something?) I will keep adding the dynamic eval if it does not have conflict with the actual architecture that is going on. Since I am working over the mobilebert that does distillation I think that I have to go with caution over the optimizer. Would you like to prepare something for the SustaiNLP 2020 conference? |
Do you think that any of that implementations would be compatible with Transformer-XL? Thanks!
The text was updated successfully, but these errors were encountered: