Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLama3] Seqlen 128k for Llama3.2 1B and 3B - Reproduce on reference GPU #15737

Closed
mtairum opened this issue Dec 5, 2024 · 8 comments · Fixed by #15909
Closed

[LLama3] Seqlen 128k for Llama3.2 1B and 3B - Reproduce on reference GPU #15737

mtairum opened this issue Dec 5, 2024 · 8 comments · Fixed by #15909
Assignees

Comments

@mtairum
Copy link
Contributor

mtairum commented Dec 5, 2024

For the 2 smallest llama3 models, 1B and 3B, we see bad output when prefill seqlen is very large, i..e > 64K.

We've debug extensively and cannot find a reason why this happens for these 2 small models, when it works fine for 8B up to 70B models.

Our best guess is that due to the small size of these models, long context lengths become very sensitive to accuracy.

TODO

  • Test the reference 1B / 3B model on CPU or GPU, using the same prompt and look at the output. If the reference is also weak, we can close this issue. Otherwise we'll revisit this.
@yieldthought
Copy link
Contributor

Image
ref.log

1B reference model loses accuracy sharply after around 80k tokens

@yieldthought
Copy link
Contributor

Image
3B follows a similar pattern

@yieldthought
Copy link
Contributor

I was a bit concerned it doesn't really make sense that Meta would release the models in this state and wondered if there's something wrong with the version of the reference we have or the way we are using it e.g. incorrectly propagated rope settings or similar. So I modified the 1b test to run using the huggingface model and tokenizer - it is fine:

Image

@yieldthought
Copy link
Contributor

We're using the 3.1 reference codebase which hard-codes a scaling factor of 8, but 3.2 (as seen in hf below) actually uses 32:
Image
Image

@yieldthought
Copy link
Contributor

The official Meta Llama 3 repo uses a scale_factor of 8 for all the models, including 3.2:
https://github.com/meta-llama/llama-models/blob/fc1e70e7970bdf599a924f6fd06cedb6e3819224/models/llama3/reference_impl/model.py#L47

@yieldthought
Copy link
Contributor

The Meta Llama 3 huggingface repo config.json was updated from 8 to 32 a couple of weeks after release:
https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/commit/c4219cc9e642e492fd0219283fa3c674804bb8ed
Image

It appears this was a bugfix and they did not update their own official GitHub repository.

@yieldthought
Copy link
Contributor

Notified the Meta repo so they can fix this on their end too.

@mtairum
Copy link
Contributor Author

mtairum commented Dec 13, 2024

Merged in PR #15909

Thanks @yieldthought

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants