-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaN error when using a GPU with no support for igemmlt #165
Comments
Only fix thus far is to lower the threshold for int8 like they did here: https://gist.github.com/whjms/2505ef082a656e7a80a3f663c16f4277 its still buggy and a bit slow. |
Thank you @Ph0rk0z, I was not aware of that gist. I'll try if that works on AMD as well. |
It does resolve the RuntimeError, that's great! But when trying to use 8bit for inference on larger models it seems to use a ton of VRAM, negating any advantage from switching to 8bit. Strange. |
Its because it puts much stuff as FP16 and that shows up when generating. It's POOF and you are back to out of memory again. For me this was slower than offloading or flexgen. Either this is a HW problem or a bug and the dev appears to be busy. |
This worked for me as a band-aid to run inference on OPT-66B, but I don't understand exactly what changing the threshold is doing. I'm assuming by lowering the threshold, we are increasing the number of weights that are considered large outliers, thus converting less of the model into int8? If so, what's the default threshold? |
I assume default is 1.0. |
No, I think it corresponds to the |
Pretty cool. It seems AMD and Nvidia Pascal will be back in business soon anyways, when 4bit gets released. Pascal supports DP4A and so does AMD Vega20 and 6000 series. Looking forward to it. |
Can confirm as a person with a Pascal card -- 4bit works great on it. Llama 30b isn't a problem and is pretty fast. OPT also works but runs slowly. |
I came across a similar problem when finetuning Llama 7B: the hidden states became inf at LlamaMLP (specifically, down_proj). I used V100 with device_capability 7.0 so igemmlt is not supported naturally. Then I found the
The As I understand, state.CB ranges between -127 and 127 and is relatively larger than A_wo_outliers (which is confined by threshold 6.0). Wouldn't it be safer to calculate CB first then do
Is it designed to prevent underflow? I also notice that CB is calculated first in the backward pass (line 455).
|
@richardwth Hi Richard, I'm facing the same problem. Did you solve this bug? |
I edit /site-packages/bitsandbytes/autograd/_functions.py first at #406
then at 468:
and now pythia-12b in 8bits at 1.5 threshold no longer NaN on me. I then switch to full 6.0 threshold and run inference again! @richardwth you are a hero, you fixed this bug and nobody noticed! wahoo! #335 |
It's very informative! May I know how do you find out that |
Do we really need to correct line 468? Based on Richard's finding, we only need to correct line 406, right? |
Try with one and see if you get the error. I'm not sure what they ended up doing with the latest version. Does it just work now and was rewritten? |
In version 0.38.1, script _functions.py, line 410-411, replaced with
and the 2nd part is left untouched, and it works fine. |
Inspired by Finding source of NaN in forward pass, I use the following script to trace the source of NaN, but I think it is not very good:
Do you have better method? @richardwth |
@zhaoqf123 Hi buddy, sorry for the late reply. I did not use any advanced methods as you used here. I manually inserted break points and used |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
I get
RuntimeError: probability tensor contains either inf, nan or element < 0
on most language models when trying to run them in 8bit.I adapted a script made by lorr1 #42 (comment) into a small script that first runs the model using 8bit with igemmlt and then disables the support for igemmlt and runs it again. I tested this on an RTX 3060 and the result is the RuntimeError when running without
igemmlt
. I think there is a bug in the code that replacesigemmlt
on older GPUs.Interestingly, it works on some models, like
EleutherAI/pythia-70m-deduped
,EleutherAI/gpt-neo-125M
,facebook/opt-6.7b
, but on most others it fails with the RuntimeError. When run withEleutherAI/pythia-410m-deduped
it outputs the following:@Ph0rk0z in #131 (comment) also ran into this issue.
The text was updated successfully, but these errors were encountered: