Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of streaming requests is worse than non-streaming #2613

Open
2 of 4 tasks
activezhao opened this issue Dec 24, 2024 · 0 comments
Open
2 of 4 tasks

Performance of streaming requests is worse than non-streaming #2613

activezhao opened this issue Dec 24, 2024 · 0 comments
Labels
bug Something isn't working Investigating Performance Issue about performance number triaged Issue has been triaged by maintainers

Comments

@activezhao
Copy link

activezhao commented Dec 24, 2024

System Info

CPU x86_64

GPU NVIDIA H20

TensorRT branch: v0.13.0

NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2

Who can help?

@kaiyux

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

We have enabled KV-cache-reuse, and expect a big performance improvement, but not.

We use our own stress testing tool to request the streaming interface, v2/models/ensemble/generate_stream

  1. It is about 1 req/s, the performance is normal.
�[32m2024-12-24T15:02:32+08:00 [info]�[0m start testing at concurrency 60
�[32m2024-12-24T15:02:35+08:00 [info]�[0m stream completion tokens: 128, time spent: 1.2s, speed: 110.5tokens/s
�[32m2024-12-24T15:02:36+08:00 [info]�[0m stream completion tokens: 128, time spent: 1.2s, speed: 109.6tokens/s
........................................................................................................................................................................................
�[32m2024-12-24T15:03:30+08:00 [info]�[0m stream completion tokens: 128, time spent: 1.1s, speed: 111.9tokens/s
�[32m2024-12-24T15:03:31+08:00 [info]�[0m stream completion tokens: 128, time spent: 1.2s, speed: 109.3tokens/s
�[32m2024-12-24T15:03:32+08:00 [info]�[0m stream completion tokens: 128, time spent: 1.1s, speed: 112.0tokens/s
�[32m2024-12-24T15:03:32+08:00 [info]�[0m waiting for all goroutines to finish...
�[32m2024-12-24T15:03:33+08:00 [info]�[0m stream completion tokens: 128, time spent: 1.2s, speed: 109.7tokens/s
�[34m2024-12-24T15:03:33+08:00 [debug]�[0m [time: 60.174s, total: 59, success: 59, fail: 0] | Server: [ 121.431 tokens/s, 0.980 req/s ] | Client: 108.274 tokens/s | Stream thresholds: 50% | MaxStreamSpeed: 109.7 tokens/s | Prompt length: 2000
  1. It is about 2 req/s, the performance is worse, the time spent continues to increase.
[32m2024-12-24T15:04:03+08:00 [info]�[0m start testing at concurrency 120
�[32m2024-12-24T15:04:05+08:00 [info]�[0m stream completion tokens: 128, time spent: 1.9s, speed: 67.9tokens/s
�[32m2024-12-24T15:04:06+08:00 [info]�[0m stream completion tokens: 128, time spent: 2.4s, speed: 53.9tokens/s
�[32m2024-12-24T15:04:07+08:00 [info]�[0m stream completion tokens: 128, time spent: 2.8s, speed: 45.1tokens/s
�[32m2024-12-24T15:04:08+08:00 [info]�[0m stream completion tokens: 128, time spent: 3.2s, speed: 39.5tokens/s
�[32m2024-12-24T15:04:09+08:00 [info]�[0m stream completion tokens: 128, time spent: 3.7s, speed: 34.9tokens/s
�[32m2024-12-24T15:04:10+08:00 [info]�[0m stream completion tokens: 128, time spent: 4.0s, speed: 31.9tokens/s
�[32m2024-12-24T15:04:10+08:00 [info]�[0m stream completion tokens: 128, time spent: 4.1s, speed: 31.5tokens/s
�[32m2024-12-24T15:04:12+08:00 [info]�[0m stream completion tokens: 128, time spent: 4.5s, speed: 28.4tokens/s
�[32m2024-12-24T15:04:13+08:00 [info]�[0m stream completion tokens: 128, time spent: 5.1s, speed: 25.3tokens/s
�[32m2024-12-24T15:04:14+08:00 [info]�[0m stream completion tokens: 128, time spent: 5.5s, speed: 23.3tokens/s
�[32m2024-12-24T15:04:15+08:00 [info]�[0m stream completion tokens: 128, time spent: 6.0s, speed: 21.4tokens/s
�[32m2024-12-24T15:04:16+08:00 [info]�[0m stream completion tokens: 128, time spent: 6.5s, speed: 19.8tokens/s
�[32m2024-12-24T15:04:16+08:00 [info]�[0m stream completion tokens: 128, time spent: 6.8s, speed: 18.7tokens/s
�[32m2024-12-24T15:04:17+08:00 [info]�[0m stream completion tokens: 128, time spent: 7.2s, speed: 17.7tokens/s
�[32m2024-12-24T15:04:19+08:00 [info]�[0m stream completion tokens: 128, time spent: 7.9s, speed: 16.2tokens/s
�[32m2024-12-24T15:04:19+08:00 [info]�[0m stream completion tokens: 128, time spent: 8.4s, speed: 15.3tokens/s
�[32m2024-12-24T15:04:20+08:00 [info]�[0m stream completion tokens: 128, time spent: 8.8s, speed: 14.5tokens/s
�[32m2024-12-24T15:04:22+08:00 [info]�[0m stream completion tokens: 128, time spent: 9.4s, speed: 13.6tokens/s
�[32m2024-12-24T15:04:22+08:00 [info]�[0m stream completion tokens: 128, time spent: 9.8s, speed: 13.1tokens/s
�[32m2024-12-24T15:04:23+08:00 [info]�[0m stream completion tokens: 128, time spent: 10.3s, speed: 12.5tokens/s
�[32m2024-12-24T15:04:24+08:00 [info]�[0m stream completion tokens: 128, time spent: 10.8s, speed: 11.8tokens/s
�[32m2024-12-24T15:04:25+08:00 [info]�[0m stream completion tokens: 128, time spent: 11.3s, speed: 11.4tokens/s
�[32m2024-12-24T15:04:26+08:00 [info]�[0m stream completion tokens: 128, time spent: 11.8s, speed: 10.8tokens/s
�[32m2024-12-24T15:04:27+08:00 [info]�[0m stream completion tokens: 128, time spent: 12.2s, speed: 10.5tokens/s
�[32m2024-12-24T15:04:28+08:00 [info]�[0m stream completion tokens: 128, time spent: 12.6s, speed: 10.2tokens/s
�[32m2024-12-24T15:04:29+08:00 [info]�[0m stream completion tokens: 128, time spent: 13.1s, speed: 9.8tokens/s
�[32m2024-12-24T15:04:30+08:00 [info]�[0m stream completion tokens: 128, time spent: 13.6s, speed: 9.4tokens/s
�[32m2024-12-24T15:04:31+08:00 [info]�[0m stream completion tokens: 128, time spent: 14.0s, speed: 9.2tokens/s
�[32m2024-12-24T15:04:32+08:00 [info]�[0m stream completion tokens: 128, time spent: 14.3s, speed: 8.9tokens/s
�[32m2024-12-24T15:04:33+08:00 [info]�[0m stream completion tokens: 128, time spent: 14.7s, speed: 8.7tokens/s
�[32m2024-12-24T15:04:34+08:00 [info]�[0m stream completion tokens: 128, time spent: 15.2s, speed: 8.4tokens/s
�[32m2024-12-24T15:04:35+08:00 [info]�[0m stream completion tokens: 128, time spent: 15.7s, speed: 8.2tokens/s
�[32m2024-12-24T15:04:36+08:00 [info]�[0m stream completion tokens: 128, time spent: 16.3s, speed: 7.8tokens/s
�[32m2024-12-24T15:04:37+08:00 [info]�[0m stream completion tokens: 128, time spent: 16.6s, speed: 7.7tokens/s

Expected behavior

We expect that when we enabled KV-cache-reuse, the performance can be improved by 30%.

actual behavior

If we use non-streaming mode, the performance is normal.

However, we got a lot of requests timeout with streaming mode.

What's more, we have observed a phenomenon that the postprocess phase takes a very long time.

Is it caused by decoding taking too much time?

Image

additional notes

Please help us analyze this problem.

Thanks so much.

@activezhao activezhao added the bug Something isn't working label Dec 24, 2024
@nv-guomingz nv-guomingz added the Performance Issue about performance number label Dec 24, 2024
@github-actions github-actions bot added triaged Issue has been triaged by maintainers Investigating labels Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Investigating Performance Issue about performance number triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

2 participants