Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: invalid device ordinal (Llama 3.1 8B working, but not working with Llama3.1 70B or 3.3 70B) #685

Open
2 tasks
chai-dev682 opened this issue Dec 24, 2024 · 0 comments

Comments

@chai-dev682
Copy link

System Info

I am using runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04 container image
I am deploying with 1 * A100

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

I am walking with your readthedocs.io
I am deploying in runpod instance, (1 * A100)

When I deploy llama 3.1 8B, it creates http serving at 5001 port, so that I can query to that http server
But when I deploy llama 3.1 70B or 3.3 70B, it can't create http server, and gets below error

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

How to fix this issue?

Error logs

  • /root/miniconda3/envs/llamastack-meta-reference-gpu/bin/python -m llama_stack.distribution.server.server --yaml-config /root/.llama/distributions/llamastack-meta-reference-gpu/meta-reference-gpu-run.yaml --port 5001 --env INFERENCE_MODEL=meta-llama/Llama-3.3-70B-Instruct
    Setting CLI environment variable INFERENCE_MODEL => meta-llama/Llama-3.3-70B-Instruct
    Using config file: /root/.llama/distributions/llamastack-meta-reference-gpu/meta-reference-gpu-run.yaml
    Run configuration:
    apis:
  • agents
  • datasetio
  • eval
  • inference
  • memory
  • safety
  • scoring
  • telemetry
    conda_env: meta-reference-gpu
    datasets: []
    docker_image: null
    eval_tasks: []
    image_name: meta-reference-gpu
    memory_banks: []
    metadata_store:
    db_path: /root/.llama/distributions/meta-reference-gpu/registry.db
    namespace: null
    type: sqlite
    models:
  • metadata: {}
    model_id: meta-llama/Llama-3.3-70B-Instruct
    model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType
    • llm
      provider_id: meta-reference-inference
      provider_model_id: null
  • metadata:
    embedding_dimension: 384
    model_id: all-MiniLM-L6-v2
    model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType
    • embedding
      provider_id: sentence-transformers
      provider_model_id: null
      providers:
      agents:
    • config:
      persistence_store:
      db_path: /root/.llama/distributions/meta-reference-gpu/agents_store.db
      namespace: null
      type: sqlite
      provider_id: meta-reference
      provider_type: inline::meta-reference
      datasetio:
    • config: {}
      provider_id: huggingface
      provider_type: remote::huggingface
    • config: {}
      provider_id: localfs
      provider_type: inline::localfs
      eval:
    • config: {}
      provider_id: meta-reference
      provider_type: inline::meta-reference
      inference:
    • config:
      checkpoint_dir: 'null'
      max_seq_len: 4096
      model: meta-llama/Llama-3.3-70B-Instruct
      provider_id: meta-reference-inference
      provider_type: inline::meta-reference
    • config: {}
      provider_id: sentence-transformers
      provider_type: inline::sentence-transformers
      memory:
    • config:
      kvstore:
      db_path: /root/.llama/distributions/meta-reference-gpu/faiss_store.db
      namespace: null
      type: sqlite
      provider_id: faiss
      provider_type: inline::faiss
      safety:
    • config: {}
      provider_id: llama-guard
      provider_type: inline::llama-guard
      scoring:
    • config: {}
      provider_id: basic
      provider_type: inline::basic
    • config: {}
      provider_id: llm-as-judge
      provider_type: inline::llm-as-judge
    • config:
      openai_api_key: ''
      provider_id: braintrust
      provider_type: inline::braintrust
      telemetry:
    • config:
      service_name: llama-stack
      sinks: console,sqlite
      sqlite_db_path: /root/.llama/distributions/meta-reference-gpu/trace_store.db
      provider_id: meta-reference
      provider_type: inline::meta-reference
      scoring_fns: []
      shields: []
      version: '2'

Warning: bwrap is not available. Code interpreter tool will not work correctly.

initializing model parallel with size 8
initializing ddp with size 1
initializing pipeline with size 1
W1224 07:25:49.426000 9468 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 9534 via signal SIGTERM
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] failed (exitcode: 1) local_rank: 1 (pid: 9535) of fn: worker_process_entrypoint (start_method: fork)
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] Traceback (most recent call last):
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 687, in _poll
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] self._pc.join(-1)
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 203, in join
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] raise ProcessRaisedException(msg, error_index, failed_process.pid)
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] torch.multiprocessing.spawn.ProcessRaisedException:
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] -- Process 1 terminated with the following error:
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] Traceback (most recent call last):
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] fn(i, *args)
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 611, in wrap
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] ret = record(fn)(*args
)
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] return f(*args, **kwargs)
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/workspace/llama-stack/llama_stack/providers/inline/inference/meta_reference/parallel_utils.py", line 242, in worker_process_entrypoint
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] model = init_model_cb()
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/workspace/llama-stack/llama_stack/providers/inline/inference/meta_reference/model_parallel.py", line 46, in init_model_cb
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] llama = Llama.build(config, model_id, llama_model)
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/workspace/llama-stack/llama_stack/providers/inline/inference/meta_reference/generation.py", line 104, in build
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] torch.cuda.set_device(local_rank)
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/site-packages/torch/cuda/init.py", line 478, in set_device
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] torch._C._cuda_setDevice(device)
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] RuntimeError: CUDA error: invalid device ordinal
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]
Process SpawnProcess-1:
Traceback (most recent call last):
File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/workspace/llama-stack/llama_stack/providers/inline/inference/meta_reference/parallel_utils.py", line 284, in launch_dist_group
elastic_launch(launch_config, entrypoint=worker_process_entrypoint)(
File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
worker_process_entrypoint FAILED


Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-24_07:25:48
host : ac0adcc2e518
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 9535)
error_file: /tmp/torchelastic_fejjk5r7/f1e6dd9a-6083-4d4e-a797-b6734d8bf410_ot3sjget/attempt_0/1/error.json
traceback : Traceback (most recent call last):
File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/workspace/llama-stack/llama_stack/providers/inline/inference/meta_reference/parallel_utils.py", line 242, in worker_process_entrypoint
model = init_model_cb()
File "/workspace/llama-stack/llama_stack/providers/inline/inference/meta_reference/model_parallel.py", line 46, in init_model_cb
llama = Llama.build(config, model_id, llama_model)
File "/workspace/llama-stack/llama_stack/providers/inline/inference/meta_reference/generation.py", line 104, in build
torch.cuda.set_device(local_rank)
File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/site-packages/torch/cuda/init.py", line 478, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Expected behavior

http serve must be working

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant