Lack of documentation regarding RoPE scaling #2402

maddes8cht · 2023-07-26T14:10:02Z

There is a lack of documentation on the current development.
There is no documentation on the rope parameters, except for the two lines in the --help command that say:

  --rope-freq-base N RoPE base frequency (default: 10000.0)
  --rope-freq-scale N RoPE frequency scaling factor (default: 1)

There is no mention of RoPE scaling in the primary readme.md, no mention of the new parameters in the readme pages of the "main" example or the "server" example or any of the now linked pages in the "docs" section of the primary readme, and no mention in the new wiki pages.

So it is very hard for a normal user to even notice that he has missed something.
When a user is ready, the only explanation that seems to be available at the moment is PR #2054.

But what are actually reasonable values for scale and base, that requires a lot of reading - the first concrete suggestion is explicitly
For the bold, try adding the following command line parameters to your favorite model: -c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5
What about the not-so bold?

In the course of the PR there were numerous combinations of the parameters base and scale, and I also experimented with recommended combinations.

But reasonably clear descriptions of what values would be recommended in which dependencies to each other, and perhaps also in relation to Llama 2 - that is not really to find and if, then it requires considerable search effort.
RoPE Scaling is a clear extension of the possibilities of Llama - shouldn't there be some form of documentation for it?

The text was updated successfully, but these errors were encountered:

maddes8cht · 2023-07-26T15:04:06Z

What do you mean ?
As Llama-2 does have a context-size of 4096, i'd expect it to work with -c 4096 out of the box without using rope params.
But what about larger context like 8k?
Same params as for Llama 1 with 4096 token i guess?

ikawrakow · 2023-07-26T21:36:30Z

In PR #2295 there is a graph of the --rope-freq-base parameter that minimizes perplexity of the wikitext dataset for contexts up to 4 times the training context for the 7B and 13B models. There is also a fit given

base frequency  = 10000 * (-0.13436 + 0.80541 * x + 0.28833 * x^2) for 7B
base frequency  = 10000 * (-0.41726 + 1.1792 * x + 0.16915 * x^2) for 13B

where x is the ratio of the context length to the training context length of the model. For instance, if you want to run LLaMA-2 with a context of 8192, we have x = 8192 / 4096 = 2. Using x=2 in the above equation gives 26298 for 7B and 26177 for 13B (but there is very little difference in what you get with 26177 or 26298, or even just 26000). I personally find it much easier to just work with --rope-freq-base. You can marginally improve perplexity by also varying --rope-freq-scale, but this requires a lot more experimentation as you now need to find optimum values for two parameters instead of just one.

When you go beyond 4 times the training context size, things get out of control very easily (very high perplexity) unless you find the magical combination of --rope-freq-base and --rope-freq-scale. That's why there is the statement "For the bold, try adding the following command ..." to use a context of 16384 with LLaMA-1 (x = 8, where the above fits absolutely do not apply).

dandm1 · 2023-07-26T23:07:38Z

Thank you for asking about this. I've also been poring over the PRs trying to sort out optimal values to use. There are a lot of different versions of NTK scaling floating around at the moment, and they're all more complex than the linear RoPE scaling implemented in Exllama. In my case I'm interested in applying the scaling to 65b and 70b models and I haven't seen a lot of guidance for those.

Would also appreciate some documentation on the usage of CFG. I've read the paper but I'm still a little unclear on how to maximise the implementation of it in llama.cpp.

SlyEcho · 2023-07-26T23:17:53Z

CFG amplifies the prompt output by substracting a similar prompt's outcome that you don't want. You can see in #2217 some funny examples.

spencekim · 2023-07-27T16:32:46Z

Is there any guidance on how to set the Rope parameters for 30b+ models?

klosax · 2023-08-07T17:44:04Z

Added a parameter --rope-scale which makes more sense and is in line with HF config.json for linear context scaling.
Also added documentation of linear RoPE scaling in the readme here

madiarabis · 2024-03-29T20:25:35Z

Is there any documentation on how to implement this or an example? I am kind of new in the field and I am fine tuning code llama 2 and I want to increase the context length. But between all these posts I am sort of confused how to implement it actually

github-actions · 2024-05-14T01:31:30Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

Darrshan-Sankar · 2024-07-09T11:05:59Z

Added a parameter --rope-scale which makes more sense and is in line with HF config.json for linear context scaling. Also added documentation of linear RoPE scaling in the readme here

s this parameter rope_scale and the parametere rope_freq_scale available in the Langchain's extension of LlamaCPP, the same?
Requesting this clarification as i am pretty confused.

d-kleine · 2024-10-28T05:15:45Z

@madiarabis https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/07_gpt_to_llama/converting-gpt-to-llama2.ipynb

LostRuins mentioned this issue Aug 9, 2023

[FEATURE] (v1.3.9) - ROPE calculator in launcher, please. LostRuins/koboldcpp#375

Closed

github-actions bot added the stale label Mar 25, 2024

github-actions bot removed the stale label Mar 30, 2024

github-actions bot added the stale label Apr 29, 2024

github-actions bot closed this as completed May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lack of documentation regarding RoPE scaling #2402

Lack of documentation regarding RoPE scaling #2402

maddes8cht commented Jul 26, 2023 •

edited

Loading

maddes8cht commented Jul 26, 2023 •

edited

Loading

ikawrakow commented Jul 26, 2023 •

edited

Loading

dandm1 commented Jul 26, 2023

SlyEcho commented Jul 26, 2023

spencekim commented Jul 27, 2023

klosax commented Aug 7, 2023

madiarabis commented Mar 29, 2024

github-actions bot commented May 14, 2024

Darrshan-Sankar commented Jul 9, 2024

d-kleine commented Oct 28, 2024

Lack of documentation regarding RoPE scaling #2402

Lack of documentation regarding RoPE scaling #2402

Comments

maddes8cht commented Jul 26, 2023 • edited Loading

maddes8cht commented Jul 26, 2023 • edited Loading

ikawrakow commented Jul 26, 2023 • edited Loading

dandm1 commented Jul 26, 2023

SlyEcho commented Jul 26, 2023

spencekim commented Jul 27, 2023

klosax commented Aug 7, 2023

madiarabis commented Mar 29, 2024

github-actions bot commented May 14, 2024

Darrshan-Sankar commented Jul 9, 2024

d-kleine commented Oct 28, 2024

maddes8cht commented Jul 26, 2023 •

edited

Loading

maddes8cht commented Jul 26, 2023 •

edited

Loading

ikawrakow commented Jul 26, 2023 •

edited

Loading