Skip to content

Latest commit

 

History

History
14 lines (11 loc) · 1.06 KB

File metadata and controls

14 lines (11 loc) · 1.06 KB

AttentionRollout ReImplementation

Other Attention in ViT:

Note that d_model = embed_dim already where d_model = number of tokens, head_dim = d_model/num_heads

  • Hydra Attention argues for num_heads = embed_dim to get linear complexity. Have 2 Hydra Attention-Encoder block at the back improved accuracy while reduced FLOPs and runtime. Reimplemented by robflynnyh. Unfortunately, visualize Hydra Attention needed a different math so we will rely on their (figure 3 + appendix) to discuss different pretrained model
  • Dilated-Self Attention used for LongNet: Also linear complexity. Reimplemented by https://github.com/alexisrozhkov/dilated-self-attention

Other than Attention Rollout

  • Attention Rollout
  • Gradient-based Attention Rollout
  • ????