Model | Batch | End-to-end throughput [1] | Device throughput [2] | Target |
---|---|---|---|---|
ResNet-50 (fps) | 20 | 2,850 | 7,200 | 10,000 |
BERT-Large (sen/s) | 12 | 362 | 406 | 410 |
Falcon7B-decode (t/s) | 32 | 135 | 135 | 140 |
ViT (fps) | 8 | 480 | 1570 | 2000 |
T5 small (sen/s) | 140 | |||
Bloom (sen/s) | 70 | |||
U-Net | coming soon |
[1] - Observed from the host. Includes dispatch overhead and kernel execution time.
[2] - Ignoring host overhead. Kernel execution time only.
Note
All model demos in this table function on both N150 and N300 Wormhole cards, unless otherwise stated.
Model | Gen. Token [3] | Batch | End-to-end throughput [1] | Device throughput [2] | Target |
---|---|---|---|---|---|
Falcon7B-decode | 129th | 32 | 11.6 t/s/u - 371 t/s | 15.4 t/s/u - 493 t/s | 21 t/s/u |
Mistral-7B-decode | 33rd | 32 | 10.9 t/s/u - 349 t/s | 13.3 t/s/u - 426 t/s | 21 t/s/u |
Mamba-2.8B-decode | any | 32 | 9.2 t/s/u - 295 t/s | 13.1 t/s/u - 419 t/s | 22 t/s/u |
BERT-Large (sen/s) [4] | any | 8 | 270 | 340 | 400 |
Stable Diffusion 1.4 512x512 (sec/img) | 1 | 8s | 5s |
[1] - Observed from the host. Includes dispatch overhead and kernel execution time.
[2] - Ignoring host overhead. Kernel execution time only.
[3] - Generating the i
'th token in a sequence while the kv_cache is filled with i-1
rows.
[4] - This model demo does not work on N150. It does work on N300.
Model | Technique | Gen. Token [3] | Batch | End-to-end throughput [1] | Device throughput [2] | Target |
---|---|---|---|---|---|---|
Falcon7B-decode | Data Parallel | 129th | 256 | 4.4 t/s/u - 1114 t/s | coming soon | 21 t/s/u |
LLaMA-2-70B-decode | Tensor Parallel | 129th | 32 | 8.5 t/s/u - 272 t/s | 13.9 t/s/u - 445 t/s | 20 t/s/u |
LLaMA-3-70B-decode | Tensor Parallel | 129th | 32 | 8.1 t/s/u - 257 t/s | 13.9 t/s/u - 445 t/s | 20 t/s/u |
Falcon40B-decode | Tensor Parallel | 129th | 32 | 1.5 t/s/u - 48 t/s | 14.0 t/s/u - 448 t/s | 30 t/s/u |
Mixtral7Bx8-decode | Tensor Parallel | 129th | 32 | 7.0 t/s/u - 225 t/s | 27.0 t/s/u - 864 t/s | 28 t/s/u |
ResNet50 | Data Parallel | coming soon |
import ttnn
import torch
with ttnn.manage_device(device_id=0) as device:
a = torch.ones((5, 7))
b = torch.ones((1, 7))
a = ttnn.from_torch(a, device=device, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT)
b = ttnn.from_torch(b, device=device, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT)
output = a + b
output = ttnn.to_torch(output)
print(output)
TT-Metalium is our low-level programming model, enabling kernel development for Tenstorrent hardware.
Get started with simple kernels.