You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was wondering whether and how ThunderKittens can help writing cross-GPU-portable code. I saw that the kernels in examples/based/linear_attn_forward between the 4090 (also used for the A100) and H100 look very different when judged by diff, but it seemed like the code is semantically pretty similar and it has just been written at different times or diverged over time. Is it feasible to assume that with some care and #ifdefs, this code could be made portable? Is portable code a use case for ThunderKittens at all or does your intended use favor writing super efficient code for just one GPU architecture?
Assuming portability is feasible at all: if I wanted to write abstracted portable, but highly efficient code with ThunderKittens, what are some specific things I need to look out for?
Thanks in advance and for the project!
The text was updated successfully, but these errors were encountered:
I was wondering whether and how ThunderKittens can help writing cross-GPU-portable code. I saw that the kernels in examples/based/linear_attn_forward between the 4090 (also used for the A100) and H100 look very different when judged by
diff
, but it seemed like the code is semantically pretty similar and it has just been written at different times or diverged over time. Is it feasible to assume that with some care and#ifdef
s, this code could be made portable? Is portable code a use case for ThunderKittens at all or does your intended use favor writing super efficient code for just one GPU architecture?Assuming portability is feasible at all: if I wanted to write abstracted portable, but highly efficient code with ThunderKittens, what are some specific things I need to look out for?
Thanks in advance and for the project!
The text was updated successfully, but these errors were encountered: