launch TGI with the argument `--max-input-tokens` smaller than sliding_window=4096 (got here max_input_tokens=16384) #2730

ashwincv0112 · 2024-11-07T07:31:36Z

System Info

We are using a EC2 instance with T4 machine in AWS (g4dn.2xlarge) for deploying our fine-tuned model.

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Please be informed that we are trying to deploy our fine-tuned version of starcoder2-3B model to an EC2 instance (T4 machine).

We are using the below command for deploying the model.

export volume=$PWD/data && docker run --gpus all -d -p 8080:80 -v $volume:/data -e USE_FLASH_ATTENTION=false ghcr.io/huggingface/text-generation-inference:2.3.0 --model-id=/data/ --max-input-tokens 16384 --max-total-tokens 26048

the deployment is failing with the below error message:

ValueError: The backend cuda does not support sliding window attention that is used by the model type starcoder2. To use this model nonetheless with the cuda backend, please launch TGI with the argument `--max-input-tokens` smaller than sliding_window=4096 (got here max_input_tokens=16384).

Could you please let us know if there is any way we can change the max_input_tokens parameter value while deploying the model.

Thanks.

Expected behavior

We would be able to change the --max-input-tokens parameter value to 16384 while deploying the model.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

launch TGI with the argument `--max-input-tokens` smaller than sliding_window=4096 (got here max_input_tokens=16384) #2730

launch TGI with the argument `--max-input-tokens` smaller than sliding_window=4096 (got here max_input_tokens=16384) #2730

ashwincv0112 commented Nov 7, 2024 •

edited

Loading

launch TGI with the argument --max-input-tokens smaller than sliding_window=4096 (got here max_input_tokens=16384) #2730

launch TGI with the argument --max-input-tokens smaller than sliding_window=4096 (got here max_input_tokens=16384) #2730

Comments

ashwincv0112 commented Nov 7, 2024 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

launch TGI with the argument `--max-input-tokens` smaller than sliding_window=4096 (got here max_input_tokens=16384) #2730

launch TGI with the argument `--max-input-tokens` smaller than sliding_window=4096 (got here max_input_tokens=16384) #2730

ashwincv0112 commented Nov 7, 2024 •

edited

Loading