Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apparent memory leak in DecoderLayer of non-quantized models #134

Closed
EricLBuehler opened this issue Apr 14, 2024 · 9 comments · Fixed by #154
Closed

Apparent memory leak in DecoderLayer of non-quantized models #134

EricLBuehler opened this issue Apr 14, 2024 · 9 comments · Fixed by #154
Assignees
Labels
bug Something isn't working urgent

Comments

@EricLBuehler
Copy link
Owner

Refs some comments in #49.

image

// 1.5gb
__GI___clone3 in libc.so.6
start_thread in libc.so.6
std::sys::pal::unix::thread::Thread::new::thread_start::h40e6fd3f8ce15a14 in mistralrs-server
core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::hea2f92b31a4a3afa in mistralrs-server
std::sys_common::backtrace::__rust_begin_short_backtrace::hc791cc3c3af3a6b3 in mistralrs-server
mistralrs_core::engine::Engine::run::hf69ab0a23e9700c2 in mistralrs-server
_$LT$mistralrs_core..pipeline..mistral..MistralPipeline$u20$as$u20$mistralrs_core..pipeline..Pipeline$GT$::forward::ha7a7d564e9679585 in mistralrs-server
mistralrs_core::models::mistral::Model::forward::h65f13a2e19e8d611 in mistralrs-server
mistralrs_core::models::mistral::DecoderLayer::forward::hff6dd2034ec6664b in mistralrs-server
mistralrs_core::xlora_models::mistral::Attention::repeat_kv::h44480ea00c7561e8 in mistralrs-server
candle_core::tensor::Tensor::reshape::h8992546c4121d565 in mistralrs-server
candle_core::device::Device::alloc_uninit::h8ba026683b968a15 in mistralrs-server
_$LT$candle_core..cpu_backend..CpuDevice$u20$as$u20$candle_core..backend..BackendDevice$GT$::alloc_uninit::he0f5b8080cf5aeb7 in mistralrs-server

// 0.75gb
__GI___clone3 in libc.so.6
start_thread in libc.so.6
std::sys::pal::unix::thread::Thread::new::thread_start::h40e6fd3f8ce15a14 in mistralrs-server
core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::hea2f92b31a4a3afa in mistralrs-server
std::sys_common::backtrace::__rust_begin_short_backtrace::hc791cc3c3af3a6b3 in mistralrs-server
mistralrs_core::engine::Engine::run::hf69ab0a23e9700c2 in mistralrs-server
_$LT$mistralrs_core..pipeline..mistral..MistralPipeline$u20$as$u20$mistralrs_core..pipeline..Pipeline$GT$::forward::ha7a7d564e9679585 in mistralrs-server
mistralrs_core::models::mistral::Model::forward::h65f13a2e19e8d611 in mistralrs-server
mistralrs_core::models::mistral::DecoderLayer::forward::hff6dd2034ec6664b in mistralrs-server
candle_nn::ops::kvconcat::h1329a0a6eafa7765 in mistralrs-server
candle_core::tensor_cat::_$LT$impl$u20$candle_core..tensor..Tensor$GT$::cat::h7c43ea827a13b08e in mistralrs-server
candle_core::tensor_cat::_$LT$impl$u20$candle_core..tensor..Tensor$GT$::cat_contiguous::hd9b8c02cb3a6334d in mistralrs-server
candle_core::device::Device::alloc_uninit::h8ba026683b968a15 in mistralrs-server
_$LT$candle_core..cpu_backend..CpuDevice$u20$as$u20$candle_core..backend..BackendDevice$GT$::alloc_uninit::he0f5b8080cf5aeb7 in mistralrs-server

Originally posted by @lucasavila00 in #49 (comment)

@EricLBuehler EricLBuehler added the bug Something isn't working label Apr 14, 2024
@EricLBuehler EricLBuehler self-assigned this Apr 14, 2024
@lucasavila00
Copy link
Contributor

I did some further profiling and the issue is not the memory leak but the peak usage of memory.
I could not tell if the peak usage was due to previous leaks or if peak usage was 1 single allocation that leaked.

Regardless, the KV cache for ~64 tokens of mistral should take ~32 MB. We get head_size=128, n_heads=8, n_layers=32, n_bytes=4, and 2 caches, so: 128*8*32*4*2 = 262144 bytes.

However, the model allocates ~2.5 GB of data for this KV cache. Disabling the cache fixes it. One interesting thing is that removing calls to copy_out_model did not reduce peak usage, without also calling set_model_none to de-allocate the data.

After hours of looking into it my latest guess is that it's a candle bug that makes it not de-allocate temporary data if the KV cache is enabled.

@EricLBuehler
Copy link
Owner Author

Thanks for taking a look. I agree, it must be a problem in our custom Candle branch. I'll probably open a PR to fix this.

@EricLBuehler
Copy link
Owner Author

EricLBuehler commented Apr 14, 2024

@lucasavila00, this also happens on the hf_candle branch which uses the official and current Candle implementation (not our branch). Can you please check if the profiling results are the same? If so, I will raise an issue on Candle.

@lucasavila00
Copy link
Contributor

image

@EricLBuehler it has the same issues.

@EricLBuehler
Copy link
Owner Author

Interesting. I'll raise an issue on Candle.

@lucasavila00
Copy link
Contributor

@EricLBuehler It's gone if I use candle's from_mmaped_safetensors

I double-checked this

@EricLBuehler
Copy link
Owner Author

Wow, that is amazing! Can you open a PR so that I can take a look?

@EricLBuehler
Copy link
Owner Author

We mmap the safetensors in a thread pool and the collect them, maybe the mmaping in different threads is the problem?

@lucasavila00
Copy link
Contributor

PR https://github.com/EricLBuehler/mistral.rs/pull/144/files

We mmap the safetensors in a thread pool and the collect them, maybe the mmaping in different threads is the problem?

Beats me. I know little of concurrent/parallel/thread programming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working urgent
Projects
None yet
2 participants