Performance of streaming requests is worse than non-streaming #2613
Labels
bug
Something isn't working
Investigating
Performance
Issue about performance number
triaged
Issue has been triaged by maintainers
System Info
CPU x86_64
GPU NVIDIA H20
TensorRT branch: v0.13.0
NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2
Who can help?
@kaiyux
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
We have enabled
KV-cache-reuse
, and expect a big performance improvement, but not.We use our own stress testing tool to request the streaming interface,
v2/models/ensemble/generate_stream
Expected behavior
We expect that when we enabled
KV-cache-reuse
, the performance can be improved by 30%.actual behavior
If we use non-streaming mode, the performance is normal.
However, we got a lot of requests timeout with streaming mode.
What's more, we have observed a phenomenon that the postprocess phase takes a very long time.
Is it caused by decoding taking too much time?
additional notes
Please help us analyze this problem.
Thanks so much.
The text was updated successfully, but these errors were encountered: