-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apparent memory leak in DecoderLayer of non-quantized models #134
Comments
I did some further profiling and the issue is not the memory leak but the peak usage of memory. Regardless, the KV cache for ~64 tokens of mistral should take ~32 MB. We get head_size=128, n_heads=8, n_layers=32, n_bytes=4, and 2 caches, so: However, the model allocates ~2.5 GB of data for this KV cache. Disabling the cache fixes it. One interesting thing is that removing calls to After hours of looking into it my latest guess is that it's a candle bug that makes it not de-allocate temporary data if the KV cache is enabled. |
Thanks for taking a look. I agree, it must be a problem in our custom Candle branch. I'll probably open a PR to fix this. |
@lucasavila00, this also happens on the hf_candle branch which uses the official and current Candle implementation (not our branch). Can you please check if the profiling results are the same? If so, I will raise an issue on Candle. |
@EricLBuehler it has the same issues. |
Interesting. I'll raise an issue on Candle. |
@EricLBuehler It's gone if I use candle's I double-checked this |
Wow, that is amazing! Can you open a PR so that I can take a look? |
We mmap the safetensors in a thread pool and the collect them, maybe the mmaping in different threads is the problem? |
PR https://github.com/EricLBuehler/mistral.rs/pull/144/files
Beats me. I know little of concurrent/parallel/thread programming. |
Refs some comments in #49.
Originally posted by @lucasavila00 in #49 (comment)
The text was updated successfully, but these errors were encountered: