[Bug] 似乎卡死的都是VLM模型，看着是个系统性问题？ #2743

DefTruth · 2024-11-13T03:13:10Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

翻了一下关于卡死的issue，结合自己使用的情况。发现卡死问题都是VLM模型。是否和使用了accelerate有关系？不太确定accelerate是否使用了nccl后端的集合通信。如果有，vision和llm模块实际是流水线化，vision的集合通信可能会和llm的通信冲突，导致死锁。

Reproduction

NOOP

Environment

NOOP

Error traceback

NOOP

DefTruth · 2024-11-13T08:13:53Z

用了今天最新的代码编译lmdeploy，还是会有hang住的问题。主要用VLM开TP>=2，总是会有一些实例会寄。直接卡死。这个问题从我用lmdeploy开始就一直存在。

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090 D      On  |   00000000:16:00.0 Off |                    0 |
| 38%   47C    P2             73W /  425W |   19447MiB /  23028MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090 D      On  |   00000000:27:00.0 Off |                    0 |
| 36%   36C    P8             16W /  425W |   20501MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

@irexyc

启动命令：

lmdeploy serve api_server tmp/${MODEL_LLAVA} \
                 --model-name ${MODEL_LLAVA} \
                 --server-port 80 \
                 --backend turbomind \
                 --tp 2 \
                 --session-len 4096 \
                 --max-batch-size 128 \
                 --cache-max-entry-count 0.6 \
                 --enable-prefix-caching \
                 --vision-max-batch-size 1 \

irexyc · 2024-11-14T07:03:28Z

accelerate 没有用 nccl，单纯的只是把权重和计算分到多个卡上线性的计算。

确实反馈比较多，我这边一直没有稳定的复现，可以把 --log-level INFO 打开一下，看一下卡住的时候服务器的信息吧。

另外可以看一下如果加了同步会不会有问题，方式是
export TM_DEBUG_LEVEL=DEBUG
lmdeploy serve api_server ... --log-level DEBUG

DefTruth · 2024-11-14T07:10:09Z

@irexyc 非常感谢您的回复。export TM_DEBUG_LEVEL=DEBUG 这个正在试，但是这个选项会产生大量日志，无法在生产环境使用。请问还有其他方式可以“加同步”不？另外，再请教一下，这里说的“同步”具体是指什么的同步？

irexyc · 2024-11-14T07:12:18Z

同步就是在每个 cuda 调用之后加了 cudaStreamSynchronize，如果 cuda 调用有问题可以及时暴露出来。

DefTruth · 2024-11-14T07:12:37Z

@irexyc 很难稳定复现，我的观察是，某些卡，比如4090D，TP=2，上起大量实例，其中会有1%~3%的概率出现hang。但是其他卡比如L20/A30等又不会。

DefTruth · 2024-11-14T07:13:21Z

同步就是在每个 cuda 调用之后加了 cudaStreamSynchronize，如果 cuda 调用有问题可以及时暴露出来。

请问有方法在不增加冗余日志的基础上加同步吗？

irexyc · 2024-11-14T07:15:38Z

不想要那么多日志的话，--log-level 可以不设置 debug，只设置 TM_DEBUG_LEVEL 环境变量就可以了

export TM_DEBUG_LEVEL=DEBUG

DefTruth · 2024-11-14T07:16:51Z

好的，我们试试，非常感谢~

DefTruth · 2024-11-14T07:28:54Z

不想要那么多日志的话，--log-level 可以不设置 debug，只设置 TM_DEBUG_LEVEL 环境变量就可以了

export TM_DEBUG_LEVEL=DEBUG

这么说，那应该 --log-level 保持默认的ERROR级别就可以了是吧。

irexyc · 2024-11-14T07:29:51Z

是的

DefTruth · 2024-11-14T07:30:46Z

是的

thanks~

DefTruth · 2024-11-14T09:47:25Z

@irexyc 您好，想再请教一下，enable_custom_all_reduce，默认是开启还是关闭的？比如在没有P2P支持的情况下。看代码似乎是只在8卡时开启。

lmdeploy/src/turbomind/utils/custom_ar_comm.cc

Lines 134 to 144 in 8e0076a

    
                   return; 
        
               } 
        
               if (rank_size != RANKS_PER_NODE) { 
        
           #ifdef BUILD_MULTI_GPU 
        
                   if (rank_size > 1) { 
        
                       TM_LOG_WARNING("Custom All Reduce only supports 8 Ranks currently. Using NCCL as Comm."); 
        
                   } 
        
           #else 
        
                   FT_CHECK_WITH_INFO(rank_size == 1, 
        
                                      fmtstr("Custom All Reduce only supports 8 Ranks currently, got rank_size %ld. FT needs "

lzhangzz · 2024-11-14T10:03:12Z

enable_custom_all_reduce 这个选项目前没用

DefTruth · 2024-11-14T10:13:50Z

enable_custom_all_reduce 这个选项目前没用

get, thanks~

lvhan028 assigned irexyc Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] 似乎卡死的都是VLM模型，看着是个系统性问题？ #2743

[Bug] 似乎卡死的都是VLM模型，看着是个系统性问题？ #2743

DefTruth commented Nov 13, 2024

DefTruth commented Nov 13, 2024 •

edited

Loading

irexyc commented Nov 14, 2024

DefTruth commented Nov 14, 2024

irexyc commented Nov 14, 2024

DefTruth commented Nov 14, 2024

DefTruth commented Nov 14, 2024

irexyc commented Nov 14, 2024

DefTruth commented Nov 14, 2024

DefTruth commented Nov 14, 2024

irexyc commented Nov 14, 2024

DefTruth commented Nov 14, 2024

DefTruth commented Nov 14, 2024 •

edited

Loading

lzhangzz commented Nov 14, 2024

DefTruth commented Nov 14, 2024

[Bug] 似乎卡死的都是VLM模型，看着是个系统性问题？ #2743

[Bug] 似乎卡死的都是VLM模型，看着是个系统性问题？ #2743

Comments

DefTruth commented Nov 13, 2024

Checklist

Describe the bug

Reproduction

Environment

Error traceback

DefTruth commented Nov 13, 2024 • edited Loading

irexyc commented Nov 14, 2024

DefTruth commented Nov 14, 2024

irexyc commented Nov 14, 2024

DefTruth commented Nov 14, 2024

DefTruth commented Nov 14, 2024

irexyc commented Nov 14, 2024

DefTruth commented Nov 14, 2024

DefTruth commented Nov 14, 2024

irexyc commented Nov 14, 2024

DefTruth commented Nov 14, 2024

DefTruth commented Nov 14, 2024 • edited Loading

lzhangzz commented Nov 14, 2024

DefTruth commented Nov 14, 2024

DefTruth commented Nov 13, 2024 •

edited

Loading

DefTruth commented Nov 14, 2024 •

edited

Loading