Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory with a presumed "full" offload to CPU #751

Open
av opened this issue Sep 4, 2024 · 0 comments
Open

CUDA out of memory with a presumed "full" offload to CPU #751

av opened this issue Sep 4, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@av
Copy link

av commented Sep 4, 2024

Describe the bug

I'm trying to run mistralrs on a VRAM-constrained system (16 GB VRAM, 64 GB RAM), via the docker image.

ghcr.io/ericlbuehler/mistral.rs:cuda-80-0.3

The arguments for the server are:

--port 8021
--serve-ip 0.0.0.0
--token-source env:HF_TOKEN
--no-paged-attn 
--no-kv-cache 
-n 0 
plain 
-m microsoft/Phi-3.5-MoE-instruct 
-a phi3.5moe

As you can see I'm trying to set everything to not use GPU, however it still used very actively and CUDA fails with OOM.

image

I'm not providing full logs as a plain text as they are completely broken when the CLI displays a "loader" in such setup (see below just for reference), I doubt they are comprehensible.

Broken log output

image

Here's the output before it gets corruped, unfortunately likely nothing useful there.

Head of the OOM log

harbor.mistralrs  | 2024-09-04T13:38:41.252088Z  INFO mistralrs_server: avx: false, neon: false, simd128: false, f16c: false
harbor.mistralrs  | 2024-09-04T13:38:41.252103Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
harbor.mistralrs  | 2024-09-04T13:38:41.252109Z  INFO mistralrs_server: Model kind is: normal (no quant, no adapters)
harbor.mistralrs  | 2024-09-04T13:38:41.252123Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"    
harbor.mistralrs  | 2024-09-04T13:38:41.252164Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `microsoft/Phi-3.5-MoE-instruct`
harbor.mistralrs  | 2024-09-04T13:38:41.252188Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `microsoft/Phi-3.5-MoE-instruct`
harbor.mistralrs  | 2024-09-04T13:38:41.499832Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00017.safetensors", "model-00002-of-00017.safetensors", "model-00003-of-00017.safetensors", "model-00004-of-00017.safetensors", "model-00005-of-00017.safetensors", "model-00006-of-00017.safetensors", "model-00007-of-00017.safetensors", "model-00008-of-00017.safetensors", "model-00009-of-00017.safetensors", "model-00010-of-00017.safetensors", "model-00011-of-00017.safetensors", "model-00012-of-00017.safetensors", "model-00013-of-00017.safetensors", "model-00014-of-00017.safetensors", "model-00015-of-00017.safetensors", "model-00016-of-00017.safetensors", "model-00017-of-00017.safetensors"]
harbor.mistralrs  | 2024-09-04T13:38:41.640566Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `microsoft/Phi-3.5-MoE-instruct`
harbor.mistralrs  | 2024-09-04T13:38:41.933136Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `microsoft/Phi-3.5-MoE-instruct`
harbor.mistralrs  | 2024-09-04T13:38:41.933252Z  INFO mistralrs_core::device_map: Model has 32 repeating layers.
harbor.mistralrs  | 2024-09-04T13:38:41.933255Z  INFO mistralrs_core::device_map: Loading model according to the following repeating layer mappings:
harbor.mistralrs  | 2024-09-04T13:38:41.933259Z  INFO mistralrs_core::device_map: Layer 0: cuda[0]
harbor.mistralrs  | 2024-09-04T13:38:41.933260Z  INFO mistralrs_core::device_map: Layer 1: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933261Z  INFO mistralrs_core::device_map: Layer 2: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933261Z  INFO mistralrs_core::device_map: Layer 3: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933262Z  INFO mistralrs_core::device_map: Layer 4: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933263Z  INFO mistralrs_core::device_map: Layer 5: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933263Z  INFO mistralrs_core::device_map: Layer 6: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933264Z  INFO mistralrs_core::device_map: Layer 7: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933265Z  INFO mistralrs_core::device_map: Layer 8: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933265Z  INFO mistralrs_core::device_map: Layer 9: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933266Z  INFO mistralrs_core::device_map: Layer 10: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933267Z  INFO mistralrs_core::device_map: Layer 11: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933267Z  INFO mistralrs_core::device_map: Layer 12: cpu
harbor.mistralrs  | 2024-09-04T13:38:41.933268Z  INFO mistralrs_core::device_map: Layer 13: cpu

Plucking some output from the corrupted stacktrace (see full on the screenshot for reference):

<candle_core::cuda_backend::CudaStorage as candle_core::backend::BackendStorage>::to_dtype
<mistralrs_core::utils::varbuilder_utils::SafetensorBackend as mistralrs_core::utils::varbuilder_utils::TensorLoaderBackend>::load_name
core::ops::function::FnOnce::call_once{{vtable.shim}}

Latest commit or version

docker image

ghcr.io/ericlbuehler/mistral.rs:cuda-80-0.3
@av av added the bug Something isn't working label Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant