Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 似乎卡死的都是VLM模型,看着是个系统性问题? #2743

Open
3 tasks
DefTruth opened this issue Nov 13, 2024 · 14 comments
Open
3 tasks

[Bug] 似乎卡死的都是VLM模型,看着是个系统性问题? #2743

DefTruth opened this issue Nov 13, 2024 · 14 comments
Assignees

Comments

@DefTruth
Copy link
Contributor

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

翻了一下关于卡死的issue,结合自己使用的情况。发现卡死问题都是VLM模型。是否和使用了accelerate有关系?不太确定accelerate是否使用了nccl后端的集合通信。如果有,vision和llm模块实际是流水线化,vision的集合通信可能会和llm的通信冲突,导致死锁。

Reproduction

NOOP

Environment

NOOP

Error traceback

NOOP
@DefTruth
Copy link
Contributor Author

DefTruth commented Nov 13, 2024

用了今天最新的代码编译lmdeploy,还是会有hang住的问题。主要用VLM开TP>=2,总是会有一些实例会寄。直接卡死。这个问题从我用lmdeploy开始就一直存在。

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090 D      On  |   00000000:16:00.0 Off |                    0 |
| 38%   47C    P2             73W /  425W |   19447MiB /  23028MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090 D      On  |   00000000:27:00.0 Off |                    0 |
| 36%   36C    P8             16W /  425W |   20501MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

@irexyc

启动命令:

lmdeploy serve api_server tmp/${MODEL_LLAVA} \
                 --model-name ${MODEL_LLAVA} \
                 --server-port 80 \
                 --backend turbomind \
                 --tp 2 \
                 --session-len 4096 \
                 --max-batch-size 128 \
                 --cache-max-entry-count 0.6 \
                 --enable-prefix-caching \
                 --vision-max-batch-size 1 \

@irexyc
Copy link
Collaborator

irexyc commented Nov 14, 2024

accelerate 没有用 nccl,单纯的只是把权重和计算分到多个卡上线性的计算。

确实反馈比较多,我这边一直没有稳定的复现,可以把 --log-level INFO 打开一下,看一下卡住的时候服务器的信息吧。

另外可以看一下如果加了同步会不会有问题,方式是
export TM_DEBUG_LEVEL=DEBUG
lmdeploy serve api_server ... --log-level DEBUG

@DefTruth
Copy link
Contributor Author

@irexyc 非常感谢您的回复。export TM_DEBUG_LEVEL=DEBUG 这个正在试,但是这个选项会产生大量日志,无法在生产环境使用。请问还有其他方式可以“加同步”不?另外,再请教一下,这里说的“同步”具体是指什么的同步?

@irexyc
Copy link
Collaborator

irexyc commented Nov 14, 2024

同步就是在每个 cuda 调用之后加了 cudaStreamSynchronize,如果 cuda 调用有问题可以及时暴露出来。

@DefTruth
Copy link
Contributor Author

@irexyc 很难稳定复现,我的观察是,某些卡,比如4090D,TP=2,上起大量实例,其中会有1%~3%的概率出现hang。但是其他卡比如L20/A30等又不会。

@DefTruth
Copy link
Contributor Author

同步就是在每个 cuda 调用之后加了 cudaStreamSynchronize,如果 cuda 调用有问题可以及时暴露出来。

请问有方法在不增加冗余日志的基础上加同步吗?

@irexyc
Copy link
Collaborator

irexyc commented Nov 14, 2024

不想要那么多日志的话,--log-level 可以不设置 debug,只设置 TM_DEBUG_LEVEL 环境变量就可以了

export TM_DEBUG_LEVEL=DEBUG

@DefTruth
Copy link
Contributor Author

好的,我们试试,非常感谢~

@DefTruth
Copy link
Contributor Author

不想要那么多日志的话,--log-level 可以不设置 debug,只设置 TM_DEBUG_LEVEL 环境变量就可以了

export TM_DEBUG_LEVEL=DEBUG

这么说,那应该 --log-level 保持默认的ERROR级别就可以了是吧。

@irexyc
Copy link
Collaborator

irexyc commented Nov 14, 2024

是的

@DefTruth
Copy link
Contributor Author

是的

thanks~

@DefTruth
Copy link
Contributor Author

DefTruth commented Nov 14, 2024

@irexyc 您好,想再请教一下,enable_custom_all_reduce,默认是开启还是关闭的?比如在没有P2P支持的情况下。看代码似乎是只在8卡时开启。

return;
}
if (rank_size != RANKS_PER_NODE) {
#ifdef BUILD_MULTI_GPU
if (rank_size > 1) {
TM_LOG_WARNING("Custom All Reduce only supports 8 Ranks currently. Using NCCL as Comm.");
}
#else
FT_CHECK_WITH_INFO(rank_size == 1,
fmtstr("Custom All Reduce only supports 8 Ranks currently, got rank_size %ld. FT needs "

@lzhangzz
Copy link
Collaborator

enable_custom_all_reduce 这个选项目前没用

@DefTruth
Copy link
Contributor Author

enable_custom_all_reduce 这个选项目前没用

get, thanks~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants