-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama2 quantized model on Inf2 generating nonsense #41
Comments
Thank you for reporting, we are trying to reproduce the issue on our end. Can you share the neuron package versions? |
This is everything installed in the environment Package Version absl-py 1.4.0 |
Hello @sumaiyah , we tried to get the quantized checkpoint from the link you sent, however, we were not successful. For such accuracy debug, we would need the checkpoint. Is it possible to share the checkpoint and the script at this email: [email protected] . This would make the debug faster for us. |
@aws-rhsoln sent |
@sumaiyah how did you compile the model... any special arguments for awq? |
Hi @sumaiyah - This model uses the quantization algorithm called AWQ which is currently not supported in TnX. Is it possible to use the standard LLaMa 2 7B weights for your use-case? |
I am following the steps (https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb) to run a Llama2 quantized model (https://huggingface.co/TheBloke/Dolphin-Llama2-7B-AWQ) on an AWS inf2 instance (Inf2 8x large)
I can run the code however when I try to generate a sequence I get a nonsense output stream
The text was updated successfully, but these errors were encountered: