CUDA out of memory with a presumed "full" offload to CPU #751

av · 2024-09-04T13:43:42Z

Describe the bug

I'm trying to run mistralrs on a VRAM-constrained system (16 GB VRAM, 64 GB RAM), via the docker image.

ghcr.io/ericlbuehler/mistral.rs:cuda-80-0.3

The arguments for the server are:

--port 8021
--serve-ip 0.0.0.0
--token-source env:HF_TOKEN
--no-paged-attn 
--no-kv-cache 
-n 0 
plain 
-m microsoft/Phi-3.5-MoE-instruct 
-a phi3.5moe

As you can see I'm trying to set everything to not use GPU, however it still used very actively and CUDA fails with OOM.

I'm not providing full logs as a plain text as they are completely broken when the CLI displays a "loader" in such setup (see below just for reference), I doubt they are comprehensible.

Broken log output

Here's the output before it gets corruped, unfortunately likely nothing useful there.

Head of the OOM log

harbor.mistralrs  | 2024-09-04T13:38:41.252088Z  INFO mistralrs_server: avx: false, neon: false, simd128: false, f16c: false
harbor.mistralrs  | 2024-09-04T13:38:41.252103Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
harbor.mistralrs  | 2024-09-04T13:38:41.252109Z  INFO mistralrs_server: Model kind is: normal (no quant, no adapters)
harbor.mistralrs  | 2024-09-04T13:38:41.252123Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"    
harbor.mistralrs  | 2024-09-04T13:38:41.252164Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `microsoft/Phi-3.5-MoE-instruct`
harbor.mistralrs  | 2024-09-04T13:38:41.252188Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `microsoft/Phi-3.5-MoE-instruct`
harbor.mistralrs  | 2024-09-04T13:38:41.499832Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00017.safetensors", "model-00002-of-00017.safetensors", "model-00003-of-00017.safetensors", "model-00004-of-00017.safetensors", "model-00005-of-00017.safetensors", "model-00006-of-00017.safetensors", "model-00007-of-00017.safetensors", "model-00008-of-00017.safetensors", "model-00009-of-00017.safetensors", "model-00010-of-00017.safetensors", "model-00011-of-00017.safetensors", "model-00012-of-00017.safetensors", "model-00013-of-00017.safetensors", "model-00014-of-00017.safetensors", "model-00015-of-00017.safetensors", "model-00016-of-00017.safetensors", "model-00017-of-00017.safetensors"]
harbor.mistralrs  | 2024-09-04T13:38:41.640566Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `microsoft/Phi-3.5-MoE-instruct`
harbor.mistralrs  | 2024-09-04T13:38:41.933136Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `microsoft/Phi-3.5-MoE-instruct`
harbor.mistralrs  | 2024-09-04T13:38:41.933252Z  INFO mistralrs_core::device_map: Model has 32 repeating layers.
harbor.mistralrs  | 2024-09-04T13:38:41.933255Z  INFO mistralrs_core::device_map: Loading model according to the following repeating layer mappings:
harbor.mistralrs  | 2024-09-04T13:38:41.933259Z  INFO mistralrs_core::device_map: Layer 0: cuda[0]
harbor.mistralrs  | 2024-09-04T13:38:41.933260Z  INFO mistralrs_core::device_map: Layer 1: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933261Z  INFO mistralrs_core::device_map: Layer 2: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933261Z  INFO mistralrs_core::device_map: Layer 3: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933262Z  INFO mistralrs_core::device_map: Layer 4: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933263Z  INFO mistralrs_core::device_map: Layer 5: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933263Z  INFO mistralrs_core::device_map: Layer 6: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933264Z  INFO mistralrs_core::device_map: Layer 7: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933265Z  INFO mistralrs_core::device_map: Layer 8: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933265Z  INFO mistralrs_core::device_map: Layer 9: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933266Z  INFO mistralrs_core::device_map: Layer 10: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933267Z  INFO mistralrs_core::device_map: Layer 11: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933267Z  INFO mistralrs_core::device_map: Layer 12: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933268Z  INFO mistralrs_core::device_map: Layer 13: cpu

Plucking some output from the corrupted stacktrace (see full on the screenshot for reference):

<candle_core::cuda_backend::CudaStorage as candle_core::backend::BackendStorage>::to_dtype
<mistralrs_core::utils::varbuilder_utils::SafetensorBackend as mistralrs_core::utils::varbuilder_utils::TensorLoaderBackend>::load_name
core::ops::function::FnOnce::call_once{{vtable.shim}}

Latest commit or version

docker image

ghcr.io/ericlbuehler/mistral.rs:cuda-80-0.3

av added the bug Something isn't working label Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory with a presumed "full" offload to CPU #751

CUDA out of memory with a presumed "full" offload to CPU #751

av commented Sep 4, 2024

CUDA out of memory with a presumed "full" offload to CPU #751

CUDA out of memory with a presumed "full" offload to CPU #751

Comments

av commented Sep 4, 2024

Describe the bug

Latest commit or version