-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model wishlist #49
Comments
Mistral v0.2 It currently doesn't work with the mistral-gguf setup.
I think this is related to an error on config.json? https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/blob/main/config.json#L19 v0.2 doesn't have sliding-window attention indeed https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 |
Thank you for raising this. I have fixed the issue, and mistral v0.2 should work now. |
Is it currently possible to use models that are not tuned for instructions? It seems that only chat models are supported. $ target/release/mistralrs-server --port 1234 mistral --model-id mistralai/Mistral-7B-v0.1
(...)
No specified chat template, loading default chat template at `./default.json`. |
@hugoabonizio, I just merged support for models with no chat template, so now models that are not tuned for instructions are supported. |
@EricLBuehler, thank you for the quick reply! I'm trying to run lm-eval against mistral.rs to compare it with the Python implementation, but I'm having some issues since it calls the completions endpoint ( |
@hugoabonizio, #107 just added the |
@EricLBuehler wow, that's faster than I can think! 😆 I'm getting an OOM error using non-quantized Mistral on an A100 80GB. Do you have any clue why? $ target/release/mistralrs-server --port 1234 mistral --model-id mistralai/Mistral-7B-v0.1 $ curl http://localhost:1234/v1/completions -H "Content-Type: application/json" -H "Authorization: Bearer EMPTY" -d '{
"model": "",
"prompt": "What is Rust?"
}'
{"message":"DriverError(CUDA_ERROR_OUT_OF_MEMORY, \"out of memory\")","partial_response":{"id":"0","choices":[{"finish_reason":"error","index":0,"text":"\n\nRust is a malicious programming language that frustrates and irritates users. It’s activated by the RustM.exe file hiding on your device, disguised as the Windows Adobe Flash Player update “admin123.exe” file.\n\nUpon installing, the virus blackmails the victim to send bitcoin or pay money to remove Rust.\n\nRecently, the virus has been often distributed via an infection email titled: “Adobe Flash Player”‘.\n\nTherefore, it’s crucial to carefully read the document, since any presented data is commonly false.\n\nYou should never open anything from a stranger on the internet, as the Rust infection can catch you off guard and take control of your device.\n\nFurthermore, you shouldn’t click files that end on ‘.doc’ and ‘.exe’. As you know, these could be easily disguised as something innocent, but rather be something dangerous.\n\n## How to Get Rust Virus and How to Remove Rust From Your Device\n\nAs we already mentioned, it’s very easy to get Rust on your computer. The Rust virus distributor sends an email like you’re just any other bad driver.\n\nAs links and file documents with different extensions can be disguised as a game, email with some new data, or an update that easily infects your computer.\n\nMalicious malware viruses like the Rust virus blackmail the victim with pictures of their computer video feed. Through camera pictures and what seems like a screenshot, the Rust virus feels dangerous and frustrating.\n\nEspecially when the presentation is complete with good spellings and grammar.\n\nThis is a way of gaining the victim’s trust. Meaning that some believe what they are being told.\n\nThis in turn can cause fear, as the virus has installed completely and the device now belongs to the virus.\n\nTake into account, that everything seems innocent and easy to use, but in reality, the victim needs a solution as fast as possible.\n\nThe virus doesn’t give the victim options when to pay or send the amount of money they request for the “ransom”.\n\nRemembering that there’s no guarantee that he’ll really send the files back or not.\n\nSo that’s why VirusPro can help you solve the problem by offering you the options to always be safe and protected when it comes to Viruses like Rust.\n\n## VirusPro\n\nVirusPro is an antivirus that works best with Mac.\n\nThey offer tips and tutorials for utilizing information provided by them, such as installation and data/spam scanning. Also","logprobs":null}],"created":1712927025,"model":"mistralai/Mistral-7B-v0.1","system_fingerprint":"local","object":"text_completion","usage":{"completion_tokens":579,"prompt_tokens":5,"total_tokens":584,"avg_tok_per_sec":30.635262,"avg_prompt_tok_per_sec":32.467533,"avg_compl_tok_per_sec":30.62034,"avg_sample_tok_per_sec":77.887436,"total_time_sec":19.063,"total_prompt_time_sec":0.154,"total_completion_time_sec":18.909,"total_sampling_time_sec":7.498}}} The process starts with ~14GB of memory and grows limitless until OOM. With fewer |
One thing that I noticed is that the error only happens on non-quantized models. Quantized models do not seem to have that problem. I've tested the llama-index integration, which works with a large amount of output tokens (easily more than 500) on an A10 with 24GB. #44 seems to be similar, and I'm not really sure why this is happening. I'll take another look. |
@EricLBuehler yeah, it seems like it's overgrowing the kv-cache or something like that. It doesn't happen with the Candle's original implementation BTW. |
@lucasavila00, do you think you can take a look at this? I've been trying to find what is wrong but made no progress and maybe a second pair of eyes would help. I have the Interestingly, during debugging, I discovered that even after disabling the mistral.rs KV cache mechanism and reverting to the Candle official implementation, the problem persists. Additionally, this is not a problem for the quantized models. |
@EricLBuehler I'll give it a shot. I can't run it locally in CUDA though, too little VRAM, making it harder to debug (I only know how to open the visual profiler locally, etc). But I'll try CPU or a if that doesn't work, a VM. |
Thanks! Let me know if you find anything. |
I can reproduce it on CPU. Te generate this amount of text:
It used 6gb RAM unquant, and 100mb RAM quant. Both running on CPU. As far as I know the quantization should differ only on the weights, right? The activations and KV cache is still full precision on quantized models, right? So the RAM usage should increase by the same amount no matter if quantized... 🤔 |
@EricLBuehler I did a heap dump of it, and, weirdly, 5gb were on repeat kv - also 2gb on kvconcat Compared to quantized, where no memory leaked: |
Thanks, that is very useful. Were you running an X-LoRA model? It looks like |
I'm running:
and
I'm looking at the dumps and the gguf leaks no memory at all (only the model). The regular version leaks. I opened the biggest leaks here:
|
It is very strange that |
Yeah, it's weird. Once I added the panic too it reports the correct function. I also enabled debug symbols with the profiling build profile:
I'm running the candle example. |
No leaks. It even freed the model. I also watched the system resource monitor and it never went above the 29gb it takes to load the model initially. Ah, this was on |
Thank you for getting those traces! I wish we had something like miri for CUDA, that would be very helpful here. After my testing on the branch I mentioned above the only difference is that we use a custom Candle branch. The only changes are for CUDA though, so I'm a bit confused as to why this is happening. I'll take a deeper look. |
Moved to #156. |
Please let us know what model architectures you would like to be added!
Quantized architectures:
The text was updated successfully, but these errors were encountered: