You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dumped the raw HTTP request from my server that is calling then replicated on my personal machine to the same TGI server and get 2 different responses. I dumped the raw HTTP calls because after validating the payload my only thought was maybe headers but there are no headers included aside from Content-Type and Content-Length. Using fasthttp in golang to make the call. The current example isn't the best since the responses are close. Normally I get garbage from the server call and quality from the local machine call. I tried rolling back to v2.2.0 to exclude prefix caching in case that was the problem but the qwen model is not supported. Is it possible to disable prefix caching to test?
Server
waiting of #2676 to validate if this is a prefix caching issue but I have confirmed with LOG_LEVEL=debug that the exact same params and input render different results with seed set
After disabling prefix caching I seem to be getting the same response across different different machines
sam-ulrich1
changed the title
Getting 2 different responses from the same HTTP call with seed set depending on what machine calls
Prefix caching causes 2 different responses from the same HTTP call with seed set depending on what machine calls
Oct 23, 2024
System Info
tag:2.3.1 docker image running on nvidia 4090 on top of 20.04 Ubuntu
Information
Tasks
Reproduction
Dumped the raw HTTP request from my server that is calling then replicated on my personal machine to the same TGI server and get 2 different responses. I dumped the raw HTTP calls because after validating the payload my only thought was maybe headers but there are no headers included aside from Content-Type and Content-Length. Using fasthttp in golang to make the call. The current example isn't the best since the responses are close. Normally I get garbage from the server call and quality from the local machine call. I tried rolling back to v2.2.0 to exclude prefix caching in case that was the problem but the qwen model is not supported. Is it possible to disable prefix caching to test?
Server
Response
Local
Response
Expected behavior
Same response with the same call to the same tgi server regardless of the macine
The text was updated successfully, but these errors were encountered: