-
-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VRAM Usage #39
Comments
That sounds very strange. Which model are you using? (bitrate, etc.) The chunk size is just how much VRAM is reserved at the end of the context when generating, and the step size for rolling the cache when the max context length is exceeded. |
This what I'm using: https://huggingface.co/brucethemoose/Yi-34B-200K-RPMerge-exl2-40bpw 4bpw. Using 4q cache with no speculative. Upon loading the model in exui, i find python consuming 23gb vram, a bit more ooba. In two separate chat sessions before crapping out, I notice that the prompt length of the last successful message is exactly 27338 tokens. Given the exact the number, I don't think believe it is a coincidence. When I try to continue the conversation of these bugged sessions, it out puts either nothing or gibberish. Yet at the same time, vram usage raises to 23.6vram and finally 24vram. After each additional attempt, indicating something sort of memory leak. It is my understanding that vram should not raise significantly (or at all) after the model has loaded. It is certainly what I see with ooba. Ooba can manage at least 55k tokens. Is it related to chunk tokens? I don't see an equivalent option in ooba. Does EXUI have a built in summarization attempt? Or some special caching mechanism? |
I'm having a hard time reproducing this. With Q4 cache and a context length of 55k, that model sits at a consistent VRAM usage just under 22 GB for me. I've tested inference up to 100k tokens without issue. Can you share a little more info about the system? Windows/Linux? CUDA version? Are you using flash-attn? Etc. |
UPDATE: Just installed flash-attention and memory usage seems to be resolved, don't have any spikes/leaks. If anyone needs it, they should use the pre-built wheels from ooba (https://github.com/oobabooga/flash-attention/releases/). I still am able to replicate the problem after pushing context a bit further and my old sessions remain broken. It seems to happen when the context is reached and the halving the prompt occurs when vram is near full, I then get "AssertionError: Total sequence length exceeds cache size in model.forward" (See Error 2 at the bottom for full stack trace). What is the expected behavior after the model has been completely loaded for whatever reason it needs more vram and it is not available? What else persists in a session other than what's in the json? Windows 10. Cuda 12.1. Haven't installed flash-attn, from the git page, my understanding it is possible to install for windows but difficult. This might be the issue. The issue definitely related to VRAM nearing its peak, so if you have a lot more VRAM (which I assume you must), you probably won't be able to reproduce it. I was able to reproduce the issue in a new session. I managed to get passed 30k and reached about 33k when I finally edited part of the conversation. I am using it mainly for story writing assistance, so the block was fairly large. After editing I hit generate, I noticed my VRAM usage for python.exe spike to around 24.1gb. (II am not sure if editing the block had anything to do with it, I may have edited after noticing an issue.) Thus, I can only assume that the problem has to do with creating the cache. I confirmed the problem persists with the session even after restarting EXUI. The model will load. Task Manager reports 23.1gb after loading at 45k. I go hit generate a few times in the session and it reports 24gb and nothing is outputted. Afterwards, I just dumped a whole block of text of roughly 33k token into a new session. No response. Clicking generate again caused VRAM spike. So it seems to be a matter of simply maxing out VRAM and then forcing cache to be rebuilt. I would note that ooba has a much more limited ability for modifying the chat history, but I do not encounter any issues going into 50k+ context except for what I supposed is slower generation of context. From what I understand, ooba automatically installs flash-attention for you... so that may be the reason. Lastly, a couple of points that might help:
On occasions I have tried continuing corrupted sessions sometimes I would get errors like these: Error 2 (Not sure if they is because I changed the model size, doing a lot of different things to see if I can revive a corrupted session by reducing context to save vram, no success): ERROR:waitress:Exception while serving /api/generate |
The AssertionError you're getting is not due to VRAM limitations but some sort of bug in the context management. One issue is that there's no feedback for extremely long prompt processing, and up until recently the client would time out waiting for the server to finish starting a generation. Then the server would eventually finish and silently add the response to the session, but nothing would show in the client until you switch to a different session and back again. The timeout is much longer now but there's still no visual feedback at the moment, so you can probably still probably end up with a confused context with very long prompts. And if a single block of text is longer than the model's whole max context length, there's no mechanism at the moment for cutting that up into smaller chunks. |
VRAM usage is oddly much higher than compared to ooba.
I have only tried YI-34B-200k models. I have 4090 and YI-34B-200K at 55k context uses only 23gb on ooba.
On the exui, the model bugged out at 30k context where vram usage would spike and then the output would be gibberish. I suspect the issue is due to chunk size. I am not sure what chunk size is used for, possibly summarization?
The text was updated successfully, but these errors were encountered: