Replies: 1 comment 2 replies
-
You could use |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all,
I'm developing a GGML backend for Tenstorrent's chips. This has been a fun project so far. I am able to get most smaller LLMs have all of their layers uploaded to an Wormhole N300 and executing. However, the LLMs become incoherent as more layers are offloaded. After some debugging, I am able to narrow the issue mostly down the the
MUL_MAT
operation. LLMs are very much coherent when that specific operation is disabled - some issue remains but much, much better.I tried debugging through unitests (
test-backend-ops
and another test I wrote to check for edgecases for my backend) yet everyMUL_MAT
tests shows it matches the CPU backend's results. Furthermore, I hacked my backend to de-offload the MulMat operation back onto the CPU with a very bare-bones implementation. That implementation also passes unittests. Further hacking shows the N300's matmul results does not differ from the bare-bones code significantly (L2 error is < 1e-5 per element).The results suggests MatMul may simply be triggering of a deeper bug in my codebase. But I cannot locate it without concrete evidence. Unfortunately I cannot rely on
test_llama
andtest_falcon
intest-backend-ops
due to the testing code not supporting using the CPU as fallback and not having support for ALiBi in softmax and the ROPE operator yet.Is there a way I can run an LLM with different backend side by side? One with my Tenstorrent backend + CPU as fallback and another just CPU. So I can compare the intermediate states and figure out where it starts going wrong.
Image:
The backend working well if I disable MUL_MAT
LLM going incoherence even if I let the CPU do MUL_MAT. Nor does the L2 distance detection finds anything
Martin
Beta Was this translation helpful? Give feedback.
All reactions