-
Notifications
You must be signed in to change notification settings - Fork 457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove threadsafe #2907
Remove threadsafe #2907
Conversation
We have users who use pytorch engine in multi-thread env. |
Add WARNING that |
"Better host performance", so what's the performance now? |
llama3-8b, tp=1, 3000 prompt, 256 concurrency
llama3-8b, tp=1, 10000 prompt, 512 concurrency
|
"Note that EOS would be output in this PR." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
We need to discuss how EOS/stop_token_ids should be skipped in the async engine. For some models, EOS is part of their chat template we may exclude the token in the reponse but However, for models like vicuna, EOS must be excluded from both response and kv cache (rewind |
print(output[1].text) | ||
``` | ||
|
||
If you do need multithreading, it would be easy to warp it like below: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lmdeploy/serve/async_engine.py
Outdated
adapter_name: Optional[str] = None, | ||
use_tqdm: bool = False, | ||
**kwargs): | ||
async def async_batch_infer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this API purely for multithread migration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
大佬,请教一下这个 PR 带来性能提升的原因是什么,是因为把 |
@chengyuma 核心是用协程让 cpu 和 gpu 运算 overlap,避免 gpu 等待。 |
那多线程也可以避免 GPU 等待,在这个场景里协程的优势是什么呢,是更好的调度,更轻量的调度? |
@chengyuma python 的多线程有 GIL 的,实际不是真并行 |
这个我知道,可是协程也不是并行呀 |
@chengyuma 协程可控,可以保证起了足够多 kernel 后再 await 切换 task;线程更不可控 |
嗯嗯,有道理,谢谢! |
Note that EOS would be output in this PR.