Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lack of documentation regarding RoPE scaling #2402

Closed
maddes8cht opened this issue Jul 26, 2023 · 10 comments
Closed

Lack of documentation regarding RoPE scaling #2402

maddes8cht opened this issue Jul 26, 2023 · 10 comments
Labels

Comments

@maddes8cht
Copy link
Contributor

maddes8cht commented Jul 26, 2023

There is a lack of documentation on the current development.
There is no documentation on the rope parameters, except for the two lines in the --help command that say:

  --rope-freq-base N RoPE base frequency (default: 10000.0)
  --rope-freq-scale N RoPE frequency scaling factor (default: 1)

There is no mention of RoPE scaling in the primary readme.md, no mention of the new parameters in the readme pages of the "main" example or the "server" example or any of the now linked pages in the "docs" section of the primary readme, and no mention in the new wiki pages.

So it is very hard for a normal user to even notice that he has missed something.
When a user is ready, the only explanation that seems to be available at the moment is PR #2054.

But what are actually reasonable values for scale and base, that requires a lot of reading - the first concrete suggestion is explicitly
For the bold, try adding the following command line parameters to your favorite model: -c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5
What about the not-so bold?

In the course of the PR there were numerous combinations of the parameters base and scale, and I also experimented with recommended combinations.

But reasonably clear descriptions of what values would be recommended in which dependencies to each other, and perhaps also in relation to Llama 2 - that is not really to find and if, then it requires considerable search effort.
RoPE Scaling is a clear extension of the possibilities of Llama - shouldn't there be some form of documentation for it?

@maddes8cht
Copy link
Contributor Author

maddes8cht commented Jul 26, 2023

What do you mean ?
As Llama-2 does have a context-size of 4096, i'd expect it to work with -c 4096 out of the box without using rope params.
But what about larger context like 8k?
Same params as for Llama 1 with 4096 token i guess?

@ikawrakow
Copy link
Contributor

ikawrakow commented Jul 26, 2023

In PR #2295 there is a graph of the --rope-freq-base parameter that minimizes perplexity of the wikitext dataset for contexts up to 4 times the training context for the 7B and 13B models. There is also a fit given

base frequency  = 10000 * (-0.13436 + 0.80541 * x + 0.28833 * x^2) for 7B
base frequency  = 10000 * (-0.41726 + 1.1792 * x + 0.16915 * x^2) for 13B

where x is the ratio of the context length to the training context length of the model. For instance, if you want to run LLaMA-2 with a context of 8192, we have x = 8192 / 4096 = 2. Using x=2 in the above equation gives 26298 for 7B and 26177 for 13B (but there is very little difference in what you get with 26177 or 26298, or even just 26000). I personally find it much easier to just work with --rope-freq-base. You can marginally improve perplexity by also varying --rope-freq-scale, but this requires a lot more experimentation as you now need to find optimum values for two parameters instead of just one.

When you go beyond 4 times the training context size, things get out of control very easily (very high perplexity) unless you find the magical combination of --rope-freq-base and --rope-freq-scale. That's why there is the statement "For the bold, try adding the following command ..." to use a context of 16384 with LLaMA-1 (x = 8, where the above fits absolutely do not apply).

@dandm1
Copy link

dandm1 commented Jul 26, 2023

Thank you for asking about this. I've also been poring over the PRs trying to sort out optimal values to use. There are a lot of different versions of NTK scaling floating around at the moment, and they're all more complex than the linear RoPE scaling implemented in Exllama. In my case I'm interested in applying the scaling to 65b and 70b models and I haven't seen a lot of guidance for those.

Would also appreciate some documentation on the usage of CFG. I've read the paper but I'm still a little unclear on how to maximise the implementation of it in llama.cpp.

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jul 26, 2023

CFG amplifies the prompt output by substracting a similar prompt's outcome that you don't want. You can see in #2217 some funny examples.

@spencekim
Copy link

Is there any guidance on how to set the Rope parameters for 30b+ models?

@klosax
Copy link
Contributor

klosax commented Aug 7, 2023

Added a parameter --rope-scale which makes more sense and is in line with HF config.json for linear context scaling.
Also added documentation of linear RoPE scaling in the readme here

@madiarabis
Copy link

Is there any documentation on how to implement this or an example? I am kind of new in the field and I am fine tuning code llama 2 and I want to increase the context length. But between all these posts I am sort of confused how to implement it actually

@github-actions github-actions bot removed the stale label Mar 30, 2024
@github-actions github-actions bot added the stale label Apr 29, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

@Darrshan-Sankar
Copy link

Added a parameter --rope-scale which makes more sense and is in line with HF config.json for linear context scaling. Also added documentation of linear RoPE scaling in the readme here

s this parameter rope_scale and the parametere rope_freq_scale available in the Langchain's extension of LlamaCPP, the same?
Requesting this clarification as i am pretty confused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants