Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential optimizations #4

Open
3 of 8 tasks
justheuristic opened this issue Mar 15, 2022 · 0 comments
Open
3 of 8 tasks

Potential optimizations #4

justheuristic opened this issue Mar 15, 2022 · 0 comments

Comments

@justheuristic
Copy link
Member

justheuristic commented Mar 15, 2022

  • In reversible mode, one can further save memory by computing backward in chunks:
    • a few tokens at a time for feedforward layers, since grad(concat(mlp(x1), mlp(x2))) = concat(grad(mlp(x1)), grad(mlp(x2)))
    • a few queries at a time for self-attention, since grad(head1 + head2) = grad(head1) + grad(head2), where head1 and head2 are attention outputs after linear projection
  • improved checkpointing
    • allow user to specify the number of checkpoints, as in checkpoint_sequential
    • do not rematerialize the last layer, as in checkpoint_sequential
    • optionally cast checkpoints to a lower precision, as in revlib
  • compacted params
    • compacted layernorms, biases
    • compacted adapters
  • Attention could be computed in O(sqrt(n)) memory (Rabe et al, 2021), but this may be overkill
  • sparse or linear attention: they are great for very long sequences. However, for large models, attention is not a bottleneck in typical NLP and vision tasks (tested gpt-3 up to length 4096).
  • Per-block grad scaling as described in (Ramesh et al, 2021) - we rely on Sandwich Norm to maintain stability up to 96 layers (did not test more). However, it would be nice to
    have per-block scaling to avoid the need for an extra LayerNorm.
  • Something else that we missed - please find us on discord.
@justheuristic justheuristic mentioned this issue Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant