Skip to content

Latest commit

 

History

History
53 lines (37 loc) · 2.02 KB

MODEL_UPDATES.md

File metadata and controls

53 lines (37 loc) · 2.02 KB

Model Updates

Note

Please refer to the front-page README for the latest verified release for each model.

September 9, 2024

Note: This feature is available as of release v0.52.0-rc1

  • Added support for any user prompt size up to a maximum of 32k tokens

August 26, 2024

  • Added data parallel demo for a single Galaxy (32 chips)
  • Refactored all modules and tests to use ttnn multi-device tensors

Note: This feature is available as of release v0.51.0-rc33

  • Added multi-batching support to the demo for running multiple batches of users consecutively
  • Improved end-to-end performance through optimizations to the attention mask in flash decoding

August 12, 2024

  • Added support for flash decoding
  • Updated the demo to support multiple batches of users
  • Updated the demo to use the full prefill graph instead of processing a single token of the prompt at a time using decode
  • Added support for decode with 32K context length using flash decoding
  • Fused mixture of experts into a single operation using ttnn.moe

July 29, 2024

  • Added support for LLaMA 3.1 - 8B
  • Runs fast prefill for sequence lengths of up to 512 tokens
  • Supports a maximum context length of 8K tokens
  • Added support for LLaMA 3.1 70B (new scaled rotary position embeddings)
  • Prefill and decode now support 8K context length with batch size 16
  • Added prefill support for 4K context length, using scaled dot product attention