Quality concerns with Flash Attention #9646

kaetemi · 2024-09-25T23:47:43Z

kaetemi
Sep 25, 2024
Collaborator

Based on my own observations and user feedback for my application, I've noticed a significant degradation of output quality when flash attention is enabled. This was noticed on Llama3-8B and Falcon-7B base models, both running on GPU.

The application is continuously making use of context shifting. Is there any known interaction issue between this and flash attention?

Particularly, when flash attention is enabled, completion becomes extremely vague or superficially reactionary and randomly switches topic.

Any similar experiences, or any thoughts on the combination of context shifting and flash attention?

kaetemi · 2024-09-26T00:53:17Z

kaetemi
Sep 26, 2024
Collaborator Author

Here's two attempts with flash attention enabled. The response questions are irrelevant, and the conversation completely loses focus.

And here's with flash attention disabled. Settings otherwise entirely the same. Guesses the right country in two questions.

0 replies

kaetemi · 2024-09-26T00:59:08Z

kaetemi
Sep 26, 2024
Collaborator Author

Tested on an NVIDIA RTX 4000 SFF Ada Generation, an RTX 3060 and a GTX 1080. All seem to be affected.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quality concerns with Flash Attention #9646

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Quality concerns with Flash Attention #9646

kaetemi Sep 25, 2024 Collaborator

Replies: 2 comments

kaetemi Sep 26, 2024 Collaborator Author

kaetemi Sep 26, 2024 Collaborator Author

kaetemi
Sep 25, 2024
Collaborator

kaetemi
Sep 26, 2024
Collaborator Author

kaetemi
Sep 26, 2024
Collaborator Author