Buy hardware | Install | Discord

TT-NN is python & C++ Neural Network OP library.

API Reference | Model Demos

Grayskull (GS) Models

Model	Batch	End-to-end throughput [1]	Device throughput [2]	Target
ResNet-50 (fps)	20	2,850	7,200	10,000
BERT-Large (sen/s)	12	362	406	410
Falcon7B-decode (t/s)	32	135	135	140
ViT (fps)	8	480	1570	2000
T5 small (sen/s)		140
Bloom (sen/s)		70
U-Net	coming soon

[1] - Observed from the host. Includes dispatch overhead and kernel execution time.

[2] - Ignoring host overhead. Kernel execution time only.

Wormhole (WH) Models

Note

All model demos in this table function on both N150 and N300 Wormhole cards, unless otherwise stated.

Model	Gen. Token [3]	Batch	End-to-end throughput [1]	Device throughput [2]	Target
Falcon7B-decode	129th	32	11.6 t/s/u - 371 t/s	15.4 t/s/u - 493 t/s	21 t/s/u
Mistral-7B-decode	33rd	32	10.9 t/s/u - 349 t/s	13.3 t/s/u - 426 t/s	21 t/s/u
Mamba-2.8B-decode	any	32	9.2 t/s/u - 295 t/s	13.1 t/s/u - 419 t/s	22 t/s/u
BERT-Large (sen/s) [4]	any	8	270	340	400
Stable Diffusion 1.4 512x512 (sec/img)		1	8s	5s

[1] - Observed from the host. Includes dispatch overhead and kernel execution time.

[2] - Ignoring host overhead. Kernel execution time only.

[3] - Generating the i'th token in a sequence while the kv_cache is filled with i-1 rows.

[4] - This model demo does not work on N150. It does work on N300.

T3000 (2x4 mesh of WHs) Models

Model	Technique	Gen. Token [3]	Batch	End-to-end throughput [1]	Device throughput [2]	Target
Falcon7B-decode	Data Parallel	129th	256	4.4 t/s/u - 1114 t/s	coming soon	21 t/s/u
LLaMA-2-70B-decode	Tensor Parallel	129th	32	8.5 t/s/u - 272 t/s	13.9 t/s/u - 445 t/s	20 t/s/u
LLaMA-3-70B-decode	Tensor Parallel	129th	32	8.1 t/s/u - 257 t/s	13.9 t/s/u - 445 t/s	20 t/s/u
Falcon40B-decode	Tensor Parallel	129th	32	1.5 t/s/u - 48 t/s	14.0 t/s/u - 448 t/s	30 t/s/u
Mixtral7Bx8-decode	Tensor Parallel	129th	32	7.0 t/s/u - 225 t/s	27.0 t/s/u - 864 t/s	28 t/s/u
ResNet50	Data Parallel	coming soon

Using TT-NN ops and tensors

import ttnn
import torch

with ttnn.manage_device(device_id=0) as device:
   a = torch.ones((5, 7))
   b = torch.ones((1, 7))

   a = ttnn.from_torch(a, device=device, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT)
   b = ttnn.from_torch(b, device=device, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT)

   output = a + b
   output = ttnn.to_torch(output)

print(output)

TT-Metalium is our low-level programming model, enabling kernel development for Tenstorrent hardware.

Programming Guide | API Reference

Getting started

Get started with simple kernels.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Buy hardware | Install | Discord

API Reference | Model Demos

Grayskull (GS) Models

Wormhole (WH) Models

T3000 (2x4 mesh of WHs) Models

Using TT-NN ops and tensors

Programming Guide | API Reference

Getting started

Files

README.md

Latest commit

History

README.md

File metadata and controls

Buy hardware | Install | Discord

API Reference | Model Demos

Grayskull (GS) Models

Wormhole (WH) Models

T3000 (2x4 mesh of WHs) Models

Using TT-NN ops and tensors

Programming Guide | API Reference

Getting started