You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78 Driver Version: 550.78 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:17:00.0 Off | 0 |
| N/A 42C P0 69W / 300W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe Off | 00000000:65:00.0 Off | 0 |
| N/A 43C P0 71W / 300W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A100 80GB PCIe Off | 00000000:CA:00.0 Off | 0 |
| N/A 35C P0 61W / 300W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A100 80GB PCIe Off | 00000000:E3:00.0 Off | 0 |
| N/A 34C P0 64W / 300W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Deployment specificities: I am using Apptainer instead of Docker. I don't think it's responsible, since some inference queries work correctly.
Information
Docker
The CLI directly
Tasks
An officially supported command
My own modifications
Reproduction
Create a SIF image of the suggested version of TGI: apptainer pull hf_tgi.sif docker://"ghcr.io/huggingface/text-generation-inference:2.4.0"
Run meta-llama/Llama-3.2-11B-Vision-Instruct model: apptainer run --nv --env "HF_TOKEN=$$SECRET$$" --bind ./models:/data:rw hf_tgi.sif --model-id "meta-llama/Llama-3.2-11B-Vision-Instruct" --port 27685 --revision "cee5b78e6faed15d5f2e6d8a654fd5b247c0d5ca"
The model will download, and the web server will spin up.
After this, try to use curl to call the model with a grammar:
curl localhost:27685/generate -X POST -H 'Content-Type: application/json' -d '{
"inputs": "I saw a puppy a cat and a raccoon during my bike ride in the park",
"parameters": {
"repetition_penalty": 1.3,
"grammar": {
"type": "json",
"value": {
"properties": {
"location": {
"type": "string"
},
"activity": {
"type": "string"
},
"animals_seen": {
"type": "integer",
"minimum": 1,
"maximum": 5
},
"animals": {
"type": "array",
"items": {
"type": "string"
}
}
},
"required": ["location", "activity", "animals_seen", "animals"]
}
}
}
}'
TGI will then fail with a bunch of device-side assert errors and exit, cURL will return {"error":"Request failed during generation: Server error: Unexpected <class 'RuntimeError'>: CUDA error: device-side assert triggered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n","error_type":"generation"}
Please note that normal inferences work both via curl and via OpenAI-compatible API with the same model on the same machine, so the problem is somehow related to "grammar". Using tools via the OpenAI-compatible API leads to the same exact error.
Expected behavior
The model should return a JSON output as in the example provided in the documentation.
The text was updated successfully, but these errors were encountered:
System Info
Version:
text-generation-launcher 2.4.0
Environment:
Hardware: 4 x A100
Deployment specificities: I am using Apptainer instead of Docker. I don't think it's responsible, since some inference queries work correctly.
Information
Tasks
Reproduction
Create a SIF image of the suggested version of TGI:
apptainer pull hf_tgi.sif docker://"ghcr.io/huggingface/text-generation-inference:2.4.0"
Run meta-llama/Llama-3.2-11B-Vision-Instruct model:
apptainer run --nv --env "HF_TOKEN=$$SECRET$$" --bind ./models:/data:rw hf_tgi.sif --model-id "meta-llama/Llama-3.2-11B-Vision-Instruct" --port 27685 --revision "cee5b78e6faed15d5f2e6d8a654fd5b247c0d5ca"
The model will download, and the web server will spin up.
After this, try to use curl to call the model with a grammar:
TGI will then fail with a bunch of device-side assert errors and exit, cURL will return
{"error":"Request failed during generation: Server error: Unexpected <class 'RuntimeError'>: CUDA error: device-side assert triggered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n","error_type":"generation"}
Please note that normal inferences work both via curl and via OpenAI-compatible API with the same model on the same machine, so the problem is somehow related to "grammar". Using tools via the OpenAI-compatible API leads to the same exact error.
Expected behavior
The model should return a JSON output as in the example provided in the documentation.
The text was updated successfully, but these errors were encountered: