Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

device-side assert triggered when trying to use LLaMA 3.2 Vision with grammar #2729

Open
2 of 4 tasks
SokolAnn opened this issue Nov 6, 2024 · 0 comments
Open
2 of 4 tasks

Comments

@SokolAnn
Copy link

SokolAnn commented Nov 6, 2024

System Info

Version:
text-generation-launcher 2.4.0

Environment:

Target: x86_64-unknown-linux-gnu
Cargo version: 1.80.1
Commit sha: 0a655a0ab5db15f08e45d8c535e263044b944190
Docker label: sha-0a655a0

Hardware: 4 x A100

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          Off |   00000000:17:00.0 Off |                    0 |
| N/A   42C    P0             69W /  300W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          Off |   00000000:65:00.0 Off |                    0 |
| N/A   43C    P0             71W /  300W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100 80GB PCIe          Off |   00000000:CA:00.0 Off |                    0 |
| N/A   35C    P0             61W /  300W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100 80GB PCIe          Off |   00000000:E3:00.0 Off |                    0 |
| N/A   34C    P0             64W /  300W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Deployment specificities: I am using Apptainer instead of Docker. I don't think it's responsible, since some inference queries work correctly.

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Create a SIF image of the suggested version of TGI:
apptainer pull hf_tgi.sif docker://"ghcr.io/huggingface/text-generation-inference:2.4.0"

Run meta-llama/Llama-3.2-11B-Vision-Instruct model:
apptainer run --nv --env "HF_TOKEN=$$SECRET$$" --bind ./models:/data:rw hf_tgi.sif --model-id "meta-llama/Llama-3.2-11B-Vision-Instruct" --port 27685 --revision "cee5b78e6faed15d5f2e6d8a654fd5b247c0d5ca"

The model will download, and the web server will spin up.

After this, try to use curl to call the model with a grammar:

curl localhost:27685/generate     -X POST     -H 'Content-Type: application/json'     -d '{
    "inputs": "I saw a puppy a cat and a raccoon during my bike ride in the park",
    "parameters": {
        "repetition_penalty": 1.3,
        "grammar": {
            "type": "json",
            "value": {
                "properties": {
                    "location": {
                        "type": "string"
                    },
                    "activity": {
                        "type": "string"
                    },
                    "animals_seen": {
                        "type": "integer",
                        "minimum": 1,
                        "maximum": 5
                    },
                    "animals": {
                        "type": "array",
                        "items": {
                            "type": "string"
                        }
                    }
                },
                "required": ["location", "activity", "animals_seen", "animals"]
            }
        }
    }
}'

TGI will then fail with a bunch of device-side assert errors and exit, cURL will return {"error":"Request failed during generation: Server error: Unexpected <class 'RuntimeError'>: CUDA error: device-side assert triggered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n","error_type":"generation"}

Please note that normal inferences work both via curl and via OpenAI-compatible API with the same model on the same machine, so the problem is somehow related to "grammar". Using tools via the OpenAI-compatible API leads to the same exact error.

Expected behavior

The model should return a JSON output as in the example provided in the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant