Compare inference results and intermediate states from different backends #10253

marty1885 · 2024-11-11T14:18:09Z

marty1885
Nov 11, 2024

Hi all,
I'm developing a GGML backend for Tenstorrent's chips. This has been a fun project so far. I am able to get most smaller LLMs have all of their layers uploaded to an Wormhole N300 and executing. However, the LLMs become incoherent as more layers are offloaded. After some debugging, I am able to narrow the issue mostly down the the MUL_MAT operation. LLMs are very much coherent when that specific operation is disabled - some issue remains but much, much better.

I tried debugging through unitests (test-backend-ops and another test I wrote to check for edgecases for my backend) yet every MUL_MAT tests shows it matches the CPU backend's results. Furthermore, I hacked my backend to de-offload the MulMat operation back onto the CPU with a very bare-bones implementation. That implementation also passes unittests. Further hacking shows the N300's matmul results does not differ from the bare-bones code significantly (L2 error is < 1e-5 per element).

The results suggests MatMul may simply be triggering of a deeper bug in my codebase. But I cannot locate it without concrete evidence. Unfortunately I cannot rely on test_llama and test_falcon in test-backend-ops due to the testing code not supporting using the CPU as fallback and not having support for ALiBi in softmax and the ROPE operator yet.

Is there a way I can run an LLM with different backend side by side? One with my Tenstorrent backend + CPU as fallback and another just CPU. So I can compare the intermediate states and figure out where it starts going wrong.

Image:

The backend working well if I disable MUL_MAT

LLM going incoherence even if I let the CPU do MUL_MAT. Nor does the L2 distance detection finds anything

Martin

slaren · 2024-11-11T14:23:53Z

slaren
Nov 11, 2024
Collaborator

You could use llama-eval-callback to run an evaluation on the CPU, then on your backend, and inspecting the results to find on which operation the results start to differ significantly. It would be good to have a tool to automate this process in a similar way to test-backend-ops.

2 replies

marty1885 Nov 11, 2024
Author

Ahhh thanks! Is there a set of flags I can set to disable every bit of randomness? Or it's by default deterministic?

slaren Nov 11, 2024
Collaborator

I don't think there is any randomness in this example, it will only run one evaluation with whatever prompt you pass with -p.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compare inference results and intermediate states from different backends #10253

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Compare inference results and intermediate states from different backends #10253

marty1885 Nov 11, 2024

Replies: 1 comment · 2 replies

slaren Nov 11, 2024 Collaborator

marty1885 Nov 11, 2024 Author

slaren Nov 11, 2024 Collaborator

marty1885
Nov 11, 2024

Replies: 1 comment 2 replies

slaren
Nov 11, 2024
Collaborator

marty1885 Nov 11, 2024
Author

slaren Nov 11, 2024
Collaborator