You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I deploy llama 3.1 8B, it creates http serving at 5001 port, so that I can query to that http server
But when I deploy llama 3.1 70B or 3.3 70B, it can't create http server, and gets below error
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Warning: bwrap is not available. Code interpreter tool will not work correctly.
initializing model parallel with size 8
initializing ddp with size 1
initializing pipeline with size 1
W1224 07:25:49.426000 9468 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 9534 via signal SIGTERM
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] failed (exitcode: 1) local_rank: 1 (pid: 9535) of fn: worker_process_entrypoint (start_method: fork)
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] Traceback (most recent call last):
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 687, in _poll
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] self._pc.join(-1)
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 203, in join
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] raise ProcessRaisedException(msg, error_index, failed_process.pid)
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] torch.multiprocessing.spawn.ProcessRaisedException:
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] -- Process 1 terminated with the following error:
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] Traceback (most recent call last):
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] fn(i, *args)
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 611, in wrap
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] ret = record(fn)(*args)
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] return f(*args, **kwargs)
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/workspace/llama-stack/llama_stack/providers/inline/inference/meta_reference/parallel_utils.py", line 242, in worker_process_entrypoint
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] model = init_model_cb()
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/workspace/llama-stack/llama_stack/providers/inline/inference/meta_reference/model_parallel.py", line 46, in init_model_cb
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] llama = Llama.build(config, model_id, llama_model)
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/workspace/llama-stack/llama_stack/providers/inline/inference/meta_reference/generation.py", line 104, in build
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] torch.cuda.set_device(local_rank)
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/site-packages/torch/cuda/init.py", line 478, in set_device
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] torch._C._cuda_setDevice(device)
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] RuntimeError: CUDA error: invalid device ordinal
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]
E1224 07:25:49.545000 9468 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]
Process SpawnProcess-1:
Traceback (most recent call last):
File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/workspace/llama-stack/llama_stack/providers/inline/inference/meta_reference/parallel_utils.py", line 284, in launch_dist_group
elastic_launch(launch_config, entrypoint=worker_process_entrypoint)(
File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
worker_process_entrypoint FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-12-24_07:25:48
host : ac0adcc2e518
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 9535)
error_file: /tmp/torchelastic_fejjk5r7/f1e6dd9a-6083-4d4e-a797-b6734d8bf410_ot3sjget/attempt_0/1/error.json
traceback : Traceback (most recent call last):
File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/workspace/llama-stack/llama_stack/providers/inline/inference/meta_reference/parallel_utils.py", line 242, in worker_process_entrypoint
model = init_model_cb()
File "/workspace/llama-stack/llama_stack/providers/inline/inference/meta_reference/model_parallel.py", line 46, in init_model_cb
llama = Llama.build(config, model_id, llama_model)
File "/workspace/llama-stack/llama_stack/providers/inline/inference/meta_reference/generation.py", line 104, in build
torch.cuda.set_device(local_rank)
File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/site-packages/torch/cuda/init.py", line 478, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Expected behavior
http serve must be working
The text was updated successfully, but these errors were encountered:
System Info
I am using runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04 container image
I am deploying with 1 * A100
Information
🐛 Describe the bug
I am walking with your readthedocs.io
I am deploying in runpod instance, (1 * A100)
When I deploy llama 3.1 8B, it creates http serving at 5001 port, so that I can query to that http server
But when I deploy llama 3.1 70B or 3.3 70B, it can't create http server, and gets below error
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.How to fix this issue?
Error logs
Setting CLI environment variable INFERENCE_MODEL => meta-llama/Llama-3.3-70B-Instruct
Using config file: /root/.llama/distributions/llamastack-meta-reference-gpu/meta-reference-gpu-run.yaml
Run configuration:
apis:
conda_env: meta-reference-gpu
datasets: []
docker_image: null
eval_tasks: []
image_name: meta-reference-gpu
memory_banks: []
metadata_store:
db_path: /root/.llama/distributions/meta-reference-gpu/registry.db
namespace: null
type: sqlite
models:
model_id: meta-llama/Llama-3.3-70B-Instruct
model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType
provider_id: meta-reference-inference
provider_model_id: null
embedding_dimension: 384
model_id: all-MiniLM-L6-v2
model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType
provider_id: sentence-transformers
provider_model_id: null
providers:
agents:
persistence_store:
db_path: /root/.llama/distributions/meta-reference-gpu/agents_store.db
namespace: null
type: sqlite
provider_id: meta-reference
provider_type: inline::meta-reference
datasetio:
provider_id: huggingface
provider_type: remote::huggingface
provider_id: localfs
provider_type: inline::localfs
eval:
provider_id: meta-reference
provider_type: inline::meta-reference
inference:
checkpoint_dir: 'null'
max_seq_len: 4096
model: meta-llama/Llama-3.3-70B-Instruct
provider_id: meta-reference-inference
provider_type: inline::meta-reference
provider_id: sentence-transformers
provider_type: inline::sentence-transformers
memory:
kvstore:
db_path: /root/.llama/distributions/meta-reference-gpu/faiss_store.db
namespace: null
type: sqlite
provider_id: faiss
provider_type: inline::faiss
safety:
provider_id: llama-guard
provider_type: inline::llama-guard
scoring:
provider_id: basic
provider_type: inline::basic
provider_id: llm-as-judge
provider_type: inline::llm-as-judge
openai_api_key: ''
provider_id: braintrust
provider_type: inline::braintrust
telemetry:
service_name: llama-stack
sinks: console,sqlite
sqlite_db_path: /root/.llama/distributions/meta-reference-gpu/trace_store.db
provider_id: meta-reference
provider_type: inline::meta-reference
scoring_fns: []
shields: []
version: '2'
Warning:
bwrap
is not available. Code interpreter tool will not work correctly.Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-12-24_07:25:48
host : ac0adcc2e518
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 9535)
error_file: /tmp/torchelastic_fejjk5r7/f1e6dd9a-6083-4d4e-a797-b6734d8bf410_ot3sjget/attempt_0/1/error.json
traceback : Traceback (most recent call last):
File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/workspace/llama-stack/llama_stack/providers/inline/inference/meta_reference/parallel_utils.py", line 242, in worker_process_entrypoint
model = init_model_cb()
File "/workspace/llama-stack/llama_stack/providers/inline/inference/meta_reference/model_parallel.py", line 46, in init_model_cb
llama = Llama.build(config, model_id, llama_model)
File "/workspace/llama-stack/llama_stack/providers/inline/inference/meta_reference/generation.py", line 104, in build
torch.cuda.set_device(local_rank)
File "/root/miniconda3/envs/llamastack-meta-reference-gpu/lib/python3.10/site-packages/torch/cuda/init.py", line 478, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Expected behavior
http serve must be working
The text was updated successfully, but these errors were encountered: