Apparent memory leak in DecoderLayer of non-quantized models #134

EricLBuehler · 2024-04-14T11:56:36Z

Refs some comments in #49.

// 1.5gb
__GI___clone3 in libc.so.6
start_thread in libc.so.6
std::sys::pal::unix::thread::Thread::new::thread_start::h40e6fd3f8ce15a14 in mistralrs-server
core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::hea2f92b31a4a3afa in mistralrs-server
std::sys_common::backtrace::__rust_begin_short_backtrace::hc791cc3c3af3a6b3 in mistralrs-server
mistralrs_core::engine::Engine::run::hf69ab0a23e9700c2 in mistralrs-server
_$LT$mistralrs_core..pipeline..mistral..MistralPipeline$u20$as$u20$mistralrs_core..pipeline..Pipeline$GT$::forward::ha7a7d564e9679585 in mistralrs-server
mistralrs_core::models::mistral::Model::forward::h65f13a2e19e8d611 in mistralrs-server
mistralrs_core::models::mistral::DecoderLayer::forward::hff6dd2034ec6664b in mistralrs-server
mistralrs_core::xlora_models::mistral::Attention::repeat_kv::h44480ea00c7561e8 in mistralrs-server
candle_core::tensor::Tensor::reshape::h8992546c4121d565 in mistralrs-server
candle_core::device::Device::alloc_uninit::h8ba026683b968a15 in mistralrs-server
_$LT$candle_core..cpu_backend..CpuDevice$u20$as$u20$candle_core..backend..BackendDevice$GT$::alloc_uninit::he0f5b8080cf5aeb7 in mistralrs-server

// 0.75gb
__GI___clone3 in libc.so.6
start_thread in libc.so.6
std::sys::pal::unix::thread::Thread::new::thread_start::h40e6fd3f8ce15a14 in mistralrs-server
core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::hea2f92b31a4a3afa in mistralrs-server
std::sys_common::backtrace::__rust_begin_short_backtrace::hc791cc3c3af3a6b3 in mistralrs-server
mistralrs_core::engine::Engine::run::hf69ab0a23e9700c2 in mistralrs-server
_$LT$mistralrs_core..pipeline..mistral..MistralPipeline$u20$as$u20$mistralrs_core..pipeline..Pipeline$GT$::forward::ha7a7d564e9679585 in mistralrs-server
mistralrs_core::models::mistral::Model::forward::h65f13a2e19e8d611 in mistralrs-server
mistralrs_core::models::mistral::DecoderLayer::forward::hff6dd2034ec6664b in mistralrs-server
candle_nn::ops::kvconcat::h1329a0a6eafa7765 in mistralrs-server
candle_core::tensor_cat::_$LT$impl$u20$candle_core..tensor..Tensor$GT$::cat::h7c43ea827a13b08e in mistralrs-server
candle_core::tensor_cat::_$LT$impl$u20$candle_core..tensor..Tensor$GT$::cat_contiguous::hd9b8c02cb3a6334d in mistralrs-server
candle_core::device::Device::alloc_uninit::h8ba026683b968a15 in mistralrs-server
_$LT$candle_core..cpu_backend..CpuDevice$u20$as$u20$candle_core..backend..BackendDevice$GT$::alloc_uninit::he0f5b8080cf5aeb7 in mistralrs-server

Originally posted by @lucasavila00 in #49 (comment)

The text was updated successfully, but these errors were encountered:

lucasavila00 · 2024-04-14T16:16:01Z

I did some further profiling and the issue is not the memory leak but the peak usage of memory.
I could not tell if the peak usage was due to previous leaks or if peak usage was 1 single allocation that leaked.

Regardless, the KV cache for ~64 tokens of mistral should take ~32 MB. We get head_size=128, n_heads=8, n_layers=32, n_bytes=4, and 2 caches, so: 128*8*32*4*2 = 262144 bytes.

However, the model allocates ~2.5 GB of data for this KV cache. Disabling the cache fixes it. One interesting thing is that removing calls to copy_out_model did not reduce peak usage, without also calling set_model_none to de-allocate the data.

After hours of looking into it my latest guess is that it's a candle bug that makes it not de-allocate temporary data if the KV cache is enabled.

EricLBuehler · 2024-04-14T16:36:08Z

Thanks for taking a look. I agree, it must be a problem in our custom Candle branch. I'll probably open a PR to fix this.

EricLBuehler · 2024-04-14T19:12:34Z

@lucasavila00, this also happens on the hf_candle branch which uses the official and current Candle implementation (not our branch). Can you please check if the profiling results are the same? If so, I will raise an issue on Candle.

lucasavila00 · 2024-04-14T23:47:42Z

@EricLBuehler it has the same issues.

EricLBuehler · 2024-04-15T12:13:13Z

Interesting. I'll raise an issue on Candle.

lucasavila00 · 2024-04-15T21:18:44Z

@EricLBuehler It's gone if I use candle's from_mmaped_safetensors

I double-checked this

EricLBuehler · 2024-04-15T21:30:02Z

Wow, that is amazing! Can you open a PR so that I can take a look?

EricLBuehler · 2024-04-15T21:38:53Z

We mmap the safetensors in a thread pool and the collect them, maybe the mmaping in different threads is the problem?

lucasavila00 · 2024-04-15T21:58:52Z

PR https://github.com/EricLBuehler/mistral.rs/pull/144/files

We mmap the safetensors in a thread pool and the collect them, maybe the mmaping in different threads is the problem?

Beats me. I know little of concurrent/parallel/thread programming.

EricLBuehler added the bug Something isn't working label Apr 14, 2024

EricLBuehler self-assigned this Apr 14, 2024

EricLBuehler added the urgent label Apr 14, 2024

EricLBuehler mentioned this issue Apr 15, 2024

Varbuilder loading on a single thread #145

Closed

EricLBuehler linked a pull request Apr 16, 2024 that will close this issue

Varbuilder loading on a single thread to fix memory leak #154

Merged

EricLBuehler closed this as completed in #154 Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apparent memory leak in DecoderLayer of non-quantized models #134

Apparent memory leak in DecoderLayer of non-quantized models #134

EricLBuehler commented Apr 14, 2024

lucasavila00 commented Apr 14, 2024

EricLBuehler commented Apr 14, 2024

EricLBuehler commented Apr 14, 2024 •

edited

Loading

lucasavila00 commented Apr 14, 2024

EricLBuehler commented Apr 15, 2024

lucasavila00 commented Apr 15, 2024

EricLBuehler commented Apr 15, 2024

EricLBuehler commented Apr 15, 2024

lucasavila00 commented Apr 15, 2024

Apparent memory leak in DecoderLayer of non-quantized models #134

Apparent memory leak in DecoderLayer of non-quantized models #134

Comments

EricLBuehler commented Apr 14, 2024

lucasavila00 commented Apr 14, 2024

EricLBuehler commented Apr 14, 2024

EricLBuehler commented Apr 14, 2024 • edited Loading

lucasavila00 commented Apr 14, 2024

EricLBuehler commented Apr 15, 2024

lucasavila00 commented Apr 15, 2024

EricLBuehler commented Apr 15, 2024

EricLBuehler commented Apr 15, 2024

lucasavila00 commented Apr 15, 2024

EricLBuehler commented Apr 14, 2024 •

edited

Loading