-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
try to build model gemma2 ,but failed. #692
Comments
after I edited builder.py line 1607 like this:
It could go continue. But still crash with no any error message at line 1643:
this is log.
|
The I am working on an improved method to load these large models using |
thanks for your answer, My computer is 64G RAM, I try to do this on the other workstation with 256G RAM + Nvidia A100 now. I hope it can work . |
It work for me. spent more than 200GB RAM for this, convert gemma 2 27b to fp16 onnx for genai. |
I try to run this model but failed. Another problem, the model.onnx.data is large than 52GB, Is any way I can split this file for small size like 10Gb per file ? |
Great to hear that it worked! The
I added the
You can save the weights as one file or save each weight that is larger than a certain size threshold in its own file using the ONNX
Note that if you change the filename when saving the ONNX model, you will need to update the filename in |
I found a very strange phenomenon. I converted gemm2-9b and gemma2-27b into ONNX using the same instructions, and there were no errors during the process. However, when testing with the following code, gemma2-9b-cuda-onnx works fine, but the result generated by gemma2-27b-cuda-onnx is empty.
this is test code:
I debug this code, I found new_token always was 0 ,when I use cuda/gemma-2-27b |
I also tried using C# to infer the converted model. The gemma-2-9b works fine, but the gemma-2-27b, similar to Python, returns empty results. So, I re-ran the command:
for conversion. I suspected it might have been an issue with the fine-tuned model before. However, after re-converting the original Google model, I found the problem still persists. It seems there is a bug in builder.py when converting large models. |
I'm able to reproduce this behavior with Gemma-2 27B. I took a quick glance and the ONNX models produced by the model builder look fine to me. I'll investigate more closely and get back to you. |
Another thing to consider is that gemma/gemma2 uses tied weights, meaning the model builder unnecessarily duplicates these weights, with embedding layer staying fp16, and lm_head converted to q4. Ideally, both should have the same dtype (e.g., fp16), with lm_head being the transpose of the embedding layer. |
The behavior appears to be happening because logit soft-capping is not used in the Since NaNs are appearing in the model's output, we will add logit soft-capping to |
I’m really looking forward to it. |
Logit-softcapping has now been added in this PR to |
### Description This PR adds the `softcap` attribute to the `GroupQueryAttention` op. ### Motivation and Context This PR helps resolve the `NaN` output issue with Gemma-2 raised in [this issue](#692).
### Description This PR adds the `softcap` attribute to the `GroupQueryAttention` op. ### Motivation and Context This PR helps resolve the `NaN` output issue with Gemma-2 raised in [this issue](#692).
when it run to line "tensor = numpy_helper.from_array(np_data)" , it just crash with no error. I added try-except block try to catch some error, but I got nothing.
@kunal-vaishnavi
The text was updated successfully, but these errors were encountered: