-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Torchrun launching multiple api_server #2402
Conversation
docs/en/llm/api_server.md
Outdated
@@ -249,6 +249,33 @@ curl http://{server_ip}:{server_port}/v1/chat/interactive \ | |||
lmdeploy serve gradio api_server_url --server-name ${gradio_ui_ip} --server-port ${gradio_ui_port} | |||
``` | |||
|
|||
## Launch multiple api servers | |||
|
|||
Following is a possible way to launch multiple api servers through torchrun. Just create a python script with the following codes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @AllentDan In what scenarios would this feature be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some researchers tend to use torchrun.
Conflicts: lmdeploy/serve/openai/api_server.py
fi | ||
# 启动 torchrun 并放入后台 | ||
# 再次强调多机环境下并不需要传--nnodes 或者 --master-addr 等参数,相当于每个机器上执行一次单节点的 torchrun 即可。 | ||
torchrun \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
每个节点上都手动执行一次么?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不是,这里集群调度会自动在每个节点都运行这个脚本
2. torchrun 启动脚本 `torchrun --nproc_per_node 2 script.py InternLM/internlm2-chat-1_8b --proxy_url http://{proxy_node_name}:{proxy_node_port}`. **注意**: 多机多卡不要用默认 url `0.0.0.0:8000`,我们需要输入真实ip对应的地址,如:`11.25.34.55:8000`。多机情况下,因为不需要子节点间的通信,所以并不需要用户指定 torchrun 的 `--nnodes` 等参数,只要能保证每个节点执行一次单节点的 torchrun 就行。 | ||
|
||
```python | ||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这部分脚本可以放到 CLI 么?
torchrun --nproc_per_node 2 lmdeploy serve node <model_path> --proxy-url ip:port
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
似乎不太行。直接给脚本也方便用户自己自定义需求吧
source /path/to/your/home/miniconda3/bin/activate /path/to/your/home/miniconda3/envs/your_env | ||
export HOME=/path/to/your/home | ||
# 获取主节点IP地址(假设 MLP_WORKER_0_HOST 是主节点的IP) | ||
MASTER_IP=${MLP_WORKER_0_HOST} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MLP_WORKER_0_HOST, MLP_ROLE_INDEX 是火山云上的环境变量么
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
No description provided.