Replies: 2 comments
-
Here's two attempts with flash attention enabled. The response questions are irrelevant, and the conversation completely loses focus. And here's with flash attention disabled. Settings otherwise entirely the same. Guesses the right country in two questions. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Tested on an NVIDIA RTX 4000 SFF Ada Generation, an RTX 3060 and a GTX 1080. All seem to be affected. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Based on my own observations and user feedback for my application, I've noticed a significant degradation of output quality when flash attention is enabled. This was noticed on Llama3-8B and Falcon-7B base models, both running on GPU.
The application is continuously making use of context shifting. Is there any known interaction issue between this and flash attention?
Particularly, when flash attention is enabled, completion becomes extremely vague or superficially reactionary and randomly switches topic.
Any similar experiences, or any thoughts on the combination of context shifting and flash attention?
Beta Was this translation helpful? Give feedback.
All reactions