Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove threadsafe #2907

Merged
merged 17 commits into from
Jan 3, 2025
Merged

Remove threadsafe #2907

merged 17 commits into from
Jan 3, 2025

Conversation

grimoire
Copy link
Collaborator

@grimoire grimoire commented Dec 17, 2024

  • Thread-safe mode has been removed.
  • asyncio.Queue -> asyncio.Event
  • Better host performance

Note that EOS would be output in this PR.

@lvhan028
Copy link
Collaborator

We have users who use pytorch engine in multi-thread env.
Pls provide a guide for them about migrating the non-threadsafe pytorch engine

@lvhan028
Copy link
Collaborator

Add WARNING that threadsafe is removed

@lvhan028
Copy link
Collaborator

"Better host performance", so what's the performance now?

@grimoire
Copy link
Collaborator Author

"Better host performance", so what's the performance now?

llama3-8b, tp=1, 3000 prompt, 256 concurrency

concurrency: 256
elapsed_time: 133.107s

first token latency(s)(min, max, ave): 0.119, 4.574, 0.621
per-token latency(s) percentile(50, 75, 95, 99): [0.028, 0.03, 0.284, 0.47]

number of prompt tokens: 676779
number of completion tokens: 612685
token throughput (completion token): 4602.956 token/s
token throughput (prompt + completion token): 9687.436 token/s
RPS (request per second): 22.538 req/s
RPM (request per minute): 1352.297 req/min

llama3-8b, tp=1, 10000 prompt, 512 concurrency

concurrency: 512
elapsed_time: 386.856s

first token latency(s)(min, max, ave): 0.259, 7.529, 0.823
per-token latency(s) percentile(50, 75, 95, 99): [0, 0.055, 0.894, 1.138]

number of prompt tokens: 2238358
number of completion tokens: 1995438
token throughput (completion token): 5158.094 token/s
token throughput (prompt + completion token): 10944.123 token/s
RPS (request per second): 25.849 req/s
RPM (request per minute): 1550.966 req/min

@lvhan028
Copy link
Collaborator

"Note that EOS would be output in this PR."
@lzhangzz will the tm refactoring you are working on output the EOS and stop_token_id to async_engine?

Copy link
Collaborator

@RunningLeon RunningLeon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lzhangzz
Copy link
Collaborator

"Note that EOS would be output in this PR." @lzhangzz will the tm refactoring you are working on output the EOS and stop_token_id to async_engine?

We need to discuss how EOS/stop_token_ids should be skipped in the async engine.

For some models, EOS is part of their chat template we may exclude the token in the reponse but step should not be rewinded (i.e. the token is kept in kv cache).

However, for models like vicuna, EOS must be excluded from both response and kv cache (rewind step to the token before EOS).

print(output[1].text)
```

If you do need multithreading, it would be easy to warp it like below:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lzhangzz After PR #2968, can users use pipeline api in multithreading?

adapter_name: Optional[str] = None,
use_tqdm: bool = False,
**kwargs):
async def async_batch_infer(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this API purely for multithread migration?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@lvhan028 lvhan028 merged commit aabc90d into InternLM:main Jan 3, 2025
5 checks passed
@chengyuma
Copy link

大佬,请教一下这个 PR 带来性能提升的原因是什么,是因为把 Queue 换成了 asyncio.Queue 吗?

@grimoire
Copy link
Collaborator Author

@chengyuma 核心是用协程让 cpu 和 gpu 运算 overlap,避免 gpu 等待。

@chengyuma
Copy link

@chengyuma 核心是用协程让 cpu 和 gpu 运算 overlap,避免 gpu 等待。

那多线程也可以避免 GPU 等待,在这个场景里协程的优势是什么呢,是更好的调度,更轻量的调度?

@grimoire
Copy link
Collaborator Author

@chengyuma python 的多线程有 GIL 的,实际不是真并行

@chengyuma
Copy link

@chengyuma python 的多线程有 GIL 的,实际不是真并行

这个我知道,可是协程也不是并行呀

@grimoire
Copy link
Collaborator Author

@chengyuma 协程可控,可以保证起了足够多 kernel 后再 await 切换 task;线程更不可控

@chengyuma
Copy link

@chengyuma 协程可控,可以保证起了足够多 kernel 后再 await 切换 task;线程更不可控

嗯嗯,有道理,谢谢!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants