-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upcoming PR - Pushing the Context limit to 8k+ for all existing Falcon models - Longrange Falcon flights #62
Comments
Update; Will be delayed for another day Results with 40B are quite good, though calculating something like perplexity appears a difficult task due to the performance loss at high context. It's 6+ times faster than before the recent KV cache PR but that's still too slow to use comfortably at 8k+ context. |
Well in the end I found that my elaborate new method was more than beaten by findings of two reddit users called bloc97 and emozilla. So after probably 12 hours continued debugging into optimizing ROPE dynamically by compressing the rotation space in various ways and struggling with Falcon 7B just not coping well with any change I stumbled on their findings. After implementing that quite closely Falcon 7B gave me a brilliant response at 4k context (still > 30 tokens/sec generation and a good 8k response as well (see below)
Falcon 40B results are now also in:
#65 |
How is the prompt processing performance like? I found that even at 3k context, the model was unfortunately not practical to use due to the tokenization speed on an RTX 4080. |
I'm working on that part, for a large prompt you need to use "-b" to process the prompt in batches. This is quite a bit flawed currently, I'm working on a overhaul already. When using 7B 3k context is quite useable currently. Sadly the current release has prompt-cache broken, also fixing that ;) |
the current branch ggfalcon_dev did make some progress in terms of processing speed though kv cache is not done on GPU yet, that's the main limitation which gets larger with context |
I plan to PR today, though it depends on final progress.
The computation speed is slow because we currently have no mulmat kernel with interleaving broadcast support yet, so tests are time consuming.
Falcon has twice the vocabulary than llama, in practice that means that Falcon naturally has a performance benefit of 30-40% on english text and about 20-25% on code and foreign languages.
This also means that 50 tokens/sec flacon speed is about as fast as 70 tokens/sec on llama in terms of language throughput.
So a 8k context window on Falcon is equivalent to ~12k context on llama.
The task: Pre-processing a large input such as a book chapter, complex code, a tutorial or a transcription of a meeting
Now I want to be able to interview Falcon about this huge text to work with it, extend it or transform it
For the current work I copied the entire falcon_eval_internal() function from current libfalcon.cpp, that's 20kb of source code and quite exactly 7k falcon tokens and the question asked is
"<|prompter|>Write a summary of 10 sentences covering everything this function does<|endoftext|><|assistant|>"
I'm processing this on a high quality quantization: The 40B Q5_K (OpenAssistant).
Default
Normal Falcon result on the above question and libfalcon.cpp input:
What is going on ? If we look below the surface of how the model understands text, the most essential part for the relationship between tokens is the positional encoding done through "ROPE". Sounds super compilcated but actually all it is is a 2d rotation of each token based on it's position in the total context.
Visualized this rotation of one embedding:
This is how the model was trained to understand relationships between tokens and sequences within a 2048 token context. I am not entirely sure why this quite tight rotation is being used, I assume (hope) someone mathed those parameters out.
Beyond that 2048 context it happens quite fast that the model does not calculate proper attention anymore, at 7k context it's completely braindead.
But by adapting the angle of rotation we can push it back into reality.
For example 8k context with a fixed scaled rotation angle:
The model output now:
"<|prompter|>Write a summary of 10 sentences covering everything this function does<|endoftext|><|assistant|>"
Here is another variant:
This is WIP. I currently have a bunch of different variants running that all perform a bit different.
The amount of hallucination is striking.
The benchmark is the best OpenAI currently has to offer, of course they not only have good parameters but also were fine tuned for that purpose. Fine tuning is something we can do once the Falcon large context parameters are chosen.
Turbo-16k
GPT4 at 8k:
Overall Turbo as well as GPT4 provide a definitely better roundup, especially regarding hallucinations, not super convincing in all cases which is also caused by the code being above the understanding of any llm today.
The text was updated successfully, but these errors were encountered: