Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torchrun launching multiple api_server #2402

Merged
merged 10 commits into from
Dec 26, 2024
Merged

Conversation

AllentDan
Copy link
Collaborator

No description provided.

@@ -249,6 +249,33 @@ curl http://{server_ip}:{server_port}/v1/chat/interactive \
lmdeploy serve gradio api_server_url --server-name ${gradio_ui_ip} --server-port ${gradio_ui_port}
```

## Launch multiple api servers

Following is a possible way to launch multiple api servers through torchrun. Just create a python script with the following codes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @AllentDan In what scenarios would this feature be used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some researchers tend to use torchrun.

Conflicts:
	lmdeploy/serve/openai/api_server.py
@lvhan028 lvhan028 self-requested a review December 12, 2024 04:48
fi
# 启动 torchrun 并放入后台
# 再次强调多机环境下并不需要传--nnodes 或者 --master-addr 等参数,相当于每个机器上执行一次单节点的 torchrun 即可。
torchrun \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

每个节点上都手动执行一次么?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不是,这里集群调度会自动在每个节点都运行这个脚本

2. torchrun 启动脚本 `torchrun --nproc_per_node 2 script.py InternLM/internlm2-chat-1_8b --proxy_url http://{proxy_node_name}:{proxy_node_port}`. **注意**: 多机多卡不要用默认 url `0.0.0.0:8000`,我们需要输入真实ip对应的地址,如:`11.25.34.55:8000`。多机情况下,因为不需要子节点间的通信,所以并不需要用户指定 torchrun 的 `--nnodes` 等参数,只要能保证每个节点执行一次单节点的 torchrun 就行。

```python
import os
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这部分脚本可以放到 CLI 么?

torchrun --nproc_per_node 2 lmdeploy serve node <model_path> --proxy-url ip:port

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

似乎不太行。直接给脚本也方便用户自己自定义需求吧

source /path/to/your/home/miniconda3/bin/activate /path/to/your/home/miniconda3/envs/your_env
export HOME=/path/to/your/home
# 获取主节点IP地址(假设 MLP_WORKER_0_HOST 是主节点的IP)
MASTER_IP=${MLP_WORKER_0_HOST}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLP_WORKER_0_HOST, MLP_ROLE_INDEX 是火山云上的环境变量么

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的

@lvhan028 lvhan028 requested a review from RunningLeon December 26, 2024 03:23
Copy link
Collaborator

@RunningLeon RunningLeon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lvhan028 lvhan028 merged commit d9b8372 into InternLM:main Dec 26, 2024
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants