[Hardware][Ascend] Add Ascend NPU backend #8054

wangshuai09 · 2024-08-31T06:55:59Z

As mentioned in #7692, this PR make Ascend NPU backend available in VLLM.

RoadMap:

Support Device

Atlas 800I A2 Inference Server
Atlas 800T A2 Training Server
Atals 300T A2 Training Card

Install

install CANN, make sure the version matches torch2.1
run VLLM_TARGET_DEVICE=npu pip install -e . to install vllm
test python examples/offline_inference_npu.py

Using Dockerfile.npu

Clone branch npu_support and step into vllm

git clone -b npu_support https://github.com/wangshuai09/vllm.git
cd vllm

Build the docker image

docker build -t vllm-npu -f Dockerfile.npu .

Run docker container.
modify --device /dev/davinci0 according to your device.

docker run -dit -v /usr/local/dcmi:/usr/local/dcmi -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -v /etc/ascend_install.info:/etc/ascend_install.info --device /dev/davinci0 --device /dev/davinci_manager --device /dev/devmm_svm --device /dev/hisi_hdc --shm-size 16G --name vllm vllm-npu:latest bash

Enter the container

docker exec -it vllm bash

Collaborators

@MengqingCao @dgy516 @hi-liuyifeng @Lin-Qingyang-Alec @liujie92 @JiasenTian @weiwei567 @JuntongMa @xiangjie
@zhangxy1234 @ldh2020 @Eviannn @agoodnoob @rumoralot

This work is still in WIP stage.

github-actions · 2024-08-31T06:56:11Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

seoibiubiu · 2024-09-01T14:03:58Z

Is there any document on how to use it?

wangshuai09 · 2024-09-02T01:42:01Z

Is there any document on how to use it?

This work is not ready, if you want to develop this together, follow this,

install CANN, make sure the version matches torch2.1
run VLLM_TARGET_DEVICE=npu pip install -e . to install vllm
test python examples/offline_inference_npu.py, only support single prompt now.

seoibiubiu · 2024-09-02T02:22:19Z

Is there any document on how to use it?

This work is not ready, if you want to develop this together, follow this,

install CANN, make sure the version matches torch2.1

run VLLM_TARGET_DEVICE=npu pip install -e . to install vllm

test python examples/offline_inference_npu.py, only support single prompt now.

very thankful, I'll try it.

wyzanski · 2024-09-02T10:38:20Z

I followed the above steps and reported the following error. What is the reason?

wangshuai09 · 2024-09-02T11:14:02Z

@wyzanski There is a fatal error about git, i think you may need to recheck your git config.

Aiwenqiuyu · 2024-09-09T03:21:27Z

期待对国产化的支持！

jkl375 · 2024-09-11T02:38:58Z

感谢对国产化的支持！

MengqingCao · 2024-09-11T09:31:37Z

TODO:

update vllm/attention/backends/ascend.py to the latest version.

XYZliang · 2024-09-12T06:10:43Z

感谢对国产化的支持！期待在昇腾系列上的效果，太缺一个高效的推理引擎了

beardog6 · 2024-09-14T03:15:49Z

是否支持在线推理呢

wangshuai09 · 2024-09-18T09:10:51Z

是否支持在线推理呢

Does it means starting an OpenAI-compatible API server? The latest code already supports, like this:

# start server
vllm serve facebook/opt-125m

# request
curl http://localhost:8000/v1/completions -H "Content-Type
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 20,
"temperature": 0
}'

# output
{"id":"cmpl-862bb9206aa84004a55c625b75e6dfea","object":"text_completion","created":1726649591,"model":"facebook/opt-125m","choices":[{"index":0,"text":" great place to live.  I've lived in San Francisco for a few years now and I've","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":25,"completion_tokens":20}}

XYZliang · 2024-09-18T09:25:01Z

What Ascend NPU devices are currently supported?
The latest version of lmdeploy also supports Ascend NPU, but only 910B and 310P are supported, as other devices lack the operator support they require and will need to wait for CANN implementation. I encounter errors when testing with the 910A.
However, it seems that most users are using Ascend 910A. Is it possible to adapt it directly?

WangxuP · 2024-09-18T09:56:22Z

是否支持在线推理呢

是不是意味着要启动一个兼容 OpenAI 的 API 服务器呢？最新的代码已经支持了，像这样：

# start server
vllm serve facebook/opt-125m

# request
curl http://localhost:8000/v1/completions -H "Content-Type
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 20,
"temperature": 0
}'

# output
{"id":"cmpl-862bb9206aa84004a55c625b75e6dfea","object":"text_completion","created":1726649591,"model":"facebook/opt-125m","choices":[{"index":0,"text":" great place to live.  I've lived in San Francisco for a few years now and I've","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":25,"completion_tokens":20}}

suooprted qwen series LLM？

wangshuai09 · 2024-09-18T09:59:21Z

Hi @XYZliang, 910A is not supported now, we will work on supports for more type of devices.

wangshuai09 · 2024-09-18T10:09:49Z

@WangxuP we do not check the model corretness now, here is a simple offline result:

INFO 09-18 10:03:24 selector.py:237] Cannot use _Backend.FLASH_ATTN backend on NPU.
INFO 09-18 10:03:24 selector.py:161] Using ASCEND_TORCH backend.
[W compiler_depend.ts:623] Warning: expandable_segments currently defaults to false. You can enable this feature by `export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True`. (function operator())
INFO 09-18 10:03:33 npu_model_runner.py:319] Starting to load model Qwen/Qwen2-7B-Instruct...
INFO 09-18 10:03:33 selector.py:237] Cannot use _Backend.FLASH_ATTN backend on NPU.
INFO 09-18 10:03:33 selector.py:161] Using ASCEND_TORCH backend.
INFO 09-18 10:03:34 weight_utils.py:235] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  1.90it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.43it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.29it/s]

INFO 09-18 10:03:39 npu_model_runner.py:330] Loading model weights took 14.2487 GB
/workspace/cmq/ws-code/vllm/vllm/model_executor/layers/sampler.py:437: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:74.)
  top_p_mask[:, -1] = False
INFO 09-18 10:03:45 gpu_executor.py:122] # GPU blocks: 37996, # CPU blocks: 4681
Processed prompts: 100%|████████| 2/2 [00:04<00:00,  2.34s/it, est. speed input: 2.56 toks/s, output: 42.72 toks/s]
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States. The president is the commander-in-chief of the armed forces, the head of the executive branch, and is responsible for enforcing federal laws, taking care that federal laws are faithfully executed, and serving as the commander in chief of the armed forces. The president is also the head of state and represents the nation to foreign governments and to the world at large. The president is the chief diplomat, the chief executive, and the chief legislator of'
Prompt: 'The future of AI is', Generated text: " here, and it's not just about robots and self-driving cars. AI is transforming every industry, from healthcare to finance, and it's changing the way we live and work. In this article, we'll explore the latest advancements in AI and how they're impacting our world.\nOne of the most exciting areas of AI research is natural language processing (NLP). NLP is the ability of machines to understand and interpret human language. This technology is being used to create chatbots, virtual assistants,"

RogerWYQ · 2024-09-18T15:19:49Z

should we install mindie first?

zhangzhiqiangcs · 2024-09-19T01:25:53Z

Is there a Dockerfile for npu to build image ?

XYZliang · 2024-09-19T01:37:27Z

e you could test on your device with latest code

Thanks for your help, I will test further today. Is it convenient to tell you the type of NPU chip or device name you are using?

MengqingCao · 2024-09-19T01:50:35Z

should we install mindie first?

Ascend SingleOps and Ascend MindIE are independent backends, you can use Ascend SingleOps now without installing MindIE. BTW, MindIE currently is not ready.

MengqingCao · 2024-09-19T01:54:15Z

Is there a Dockerfile for npu to build image ?

The Dockerfile is ready now at https://github.com/vllm-project/vllm/pull/8054/files#diff-67922969885e8d987974f014c4c6e25fc2ae46b75760bcc6c93b9cc541268781

Using Dockerfile.npu

Clone branch npu_support and step into vllm

git clone -b npu_support https://github.com/wangshuai09/vllm.git
cd vllm

Build the docker image

docker build -t vllm-npu -f Dockerfile.npu .

Run docker container.
modify --device /dev/davinci0 according to your device.

docker run -dit -v /usr/local/dcmi:/usr/local/dcmi -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -v /etc/ascend_install.info:/etc/ascend_install.info --device /dev/davinci0 --device /dev/davinci_manager --device /dev/devmm_svm --device /dev/hisi_hdc --shm-size 16G --name vllm vllm-npu:latest bash

Enter the container

docker exec -it vllm bash

MengqingCao · 2024-09-19T01:57:00Z

e you could test on your device with latest code

Thanks for your help, I will test further today. Is it convenient to tell you the type of NPU chip or device name you are using?

We are using Atlas 300T A2 training card

WWCTF · 2024-09-19T03:04:53Z

请问这个应该怎么解决？

MengqingCao · 2024-09-19T03:20:40Z

请问这个应该怎么解决？

Could you offer your env info, including python, cann, device name? And if you confirm your code has been updated as the latest in this pr, plz offer me a minimal reproduction method.

WWCTF · 2024-09-19T03:31:13Z

请问这个应该怎么解决？

Could you offer your env info, including python, cann, device name? And if you confirm your code has been updated as the latest in this pr, plz offer me a minimal reproduction method.

python：3.10.12；cann：toolkit_8.0.RC2，kernels_310P_8.0.RC2；推理卡300I DUO

MengqingCao · 2024-09-19T06:22:20Z

Could you offer your env info, including python, cann, device name? And if you confirm your code has been updated as the latest in this pr, plz offer me a minimal reproduction method.

python：3.10.12；cann：toolkit_8.0.RC2，kernels_310P_8.0.RC2；推理卡300I DUO

According to the official documentation, this operator has more restrictions on 310p.
https://www.hiascend.com/document/detail/zh/canncommercial/80RC22/apiref/appdevgapi/context/aclnnPromptFlashAttentionV3.md#%E7%BA%A6%E6%9D%9F%E4%B8%8E%E9%99%90%E5%88%B6

The current PR is developed based on Atlas 300T A2 training card. If you are interested in supporting 310p, welcome to join the development of this PR.

WWCTF · 2024-09-19T06:50:04Z

Could you offer your env info, including python, cann, device name? And if you confirm your code has been updated as the latest in this pr, plz offer me a minimal reproduction method.

python：3.10.12；cann：toolkit_8.0.RC2，kernels_310P_8.0.RC2；推理卡300I DUO

According to the official documentation, this operator has more restrictions on 310p. https://www.hiascend.com/document/detail/zh/canncommercial/80RC22/apiref/appdevgapi/context/aclnnPromptFlashAttentionV3.md#%E7%BA%A6%E6%9D%9F%E4%B8%8E%E9%99%90%E5%88%B6

The current PR is developed based on Atlas 300T A2 training card. If you are interested in supporting 310p, welcome to join the development of this PR.

好的，感谢

wrennywang · 2024-09-19T08:00:15Z

可以支持多卡推理吗？

MengqingCao · 2024-11-27T01:02:17Z

[rank0]: RuntimeError: map:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:373 NPU function error: aclrtMallocPhysical, error code is 507899
[rank0]: [ERROR] 2024-11-26-22:57:15 (PID:2316891, Device:0, RankID:-1) ERR00100 PTA call acl api failed
[rank0]: [Error]: An internal error occurs in the Driver module. 
[rank0]:         Rectify the fault based on the error information in the ascend log.
[rank0]: EL9999: Inner Error!
[rank0]: EL9999  [drv api]halMemCreate failed. drvRet=17.[FUNC:MallocPhysical][FILE:npu_driver.cc][LINE:5439]
[rank0]:         TraceBack (most recent call last):
[rank0]:         rtMallocPhysical execute failed, reason=[driver error:internal error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
[rank0]:         malloc physical memory failed, runtime result = 507899[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

[rank0]:[W1126 22:57:15.656738355 compiler_depend.ts:659] Warning: 0Failed to find function aclrtSynchronizeDeviceWithTimeout (function operator())

It seems something wrong with malloc memory, can you offer your env info, including NPU type, the version of CANN and driver? Also, it would be best if you could provide your reproduction method.

guihonghao · 2024-11-27T02:47:36Z

[rank0]: RuntimeError: map:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:373 NPU function error: aclrtMallocPhysical, error code is 507899
[rank0]: [ERROR] 2024-11-26-22:57:15 (PID:2316891, Device:0, RankID:-1) ERR00100 PTA call acl api failed
[rank0]: [Error]: An internal error occurs in the Driver module. 
[rank0]:         Rectify the fault based on the error information in the ascend log.
[rank0]: EL9999: Inner Error!
[rank0]: EL9999  [drv api]halMemCreate failed. drvRet=17.[FUNC:MallocPhysical][FILE:npu_driver.cc][LINE:5439]
[rank0]:         TraceBack (most recent call last):
[rank0]:         rtMallocPhysical execute failed, reason=[driver error:internal error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
[rank0]:         malloc physical memory failed, runtime result = 507899[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

[rank0]:[W1126 22:57:15.656738355 compiler_depend.ts:659] Warning: 0Failed to find function aclrtSynchronizeDeviceWithTimeout (function operator())

It seems something wrong with malloc memory, can you offer your env info, including NPU type, the version of CANN and driver? Also, it would be best if you could provide your reproduction method.

我将torch降级到2.4.0后，解决了上面的问题。但是现在出现了新的问题，RuntimeError: aclnnPromptFlashAttentionV3 or aclnnPromptFlashAttentionV3GetWorkspaceSize not in libopapi.so, or libopapi.sonot found. 这需要怎么解决呢？

[rank0]:   File "/data1/guihonghao/vllm/vllm/worker/model_runner_base.py", line 116, in _wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/data1/guihonghao/vllm/vllm/worker/model_runner.py", line 1654, in execute_model
[rank0]:     hidden_or_intermediate_states = model_executable(
[rank0]:   File "/data1/guihonghao/anaconda3/envs/uie/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/data1/guihonghao/anaconda3/envs/uie/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/data1/guihonghao/vllm/vllm/model_executor/models/qwen2.py", line 456, in forward
[rank0]:     hidden_states = self.model(input_ids, positions, kv_caches,
[rank0]:   File "/data1/guihonghao/vllm/vllm/compilation/decorators.py", line 143, in __call__
[rank0]:     return self.forward(*args, **kwargs)
[rank0]:   File "/data1/guihonghao/vllm/vllm/model_executor/models/qwen2.py", line 306, in forward
[rank0]:     hidden_states, residual = layer(
[rank0]:   File "/data1/guihonghao/anaconda3/envs/uie/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/data1/guihonghao/anaconda3/envs/uie/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/data1/guihonghao/vllm/vllm/model_executor/models/qwen2.py", line 226, in forward
[rank0]:     hidden_states = self.self_attn(
[rank0]:   File "/data1/guihonghao/anaconda3/envs/uie/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/data1/guihonghao/anaconda3/envs/uie/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/data1/guihonghao/vllm/vllm/model_executor/models/qwen2.py", line 169, in forward
[rank0]:     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
[rank0]:   File "/data1/guihonghao/anaconda3/envs/uie/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/data1/guihonghao/anaconda3/envs/uie/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/data1/guihonghao/vllm/vllm/attention/layer.py", line 99, in forward
[rank0]:     return self.impl.forward(query,
[rank0]:   File "/data1/guihonghao/vllm/vllm/attention/backends/ascend.py", line 476, in forward
[rank0]:     output = torch_npu.npu_prompt_flash_attention(
[rank0]:   File "/data1/guihonghao/anaconda3/envs/uie/lib/python3.10/site-packages/torch/_ops.py", line 1061, in __call__
[rank0]:     return self_._op(*args, **(kwargs or {}))
[rank0]: RuntimeError: aclnnPromptFlashAttentionV3 or aclnnPromptFlashAttentionV3GetWorkspaceSize not in libopapi.so, or libopapi.sonot found.
[rank0]: [ERROR] 2024-11-27-10:45:50 (PID:2487788, Device:0, RankID:-1) ERR01004 OPS invalid pointer

huyz-git · 2024-11-27T07:17:24Z

[rank0]: RuntimeError: map:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:373 NPU function error: aclrtMallocPhysical, error code is 507899
[rank0]: [ERROR] 2024-11-26-22:57:15 (PID:2316891, Device:0, RankID:-1) ERR00100 PTA call acl api failed
[rank0]: [Error]: An internal error occurs in the Driver module. 
[rank0]:         Rectify the fault based on the error information in the ascend log.
[rank0]: EL9999: Inner Error!
[rank0]: EL9999  [drv api]halMemCreate failed. drvRet=17.[FUNC:MallocPhysical][FILE:npu_driver.cc][LINE:5439]
[rank0]:         TraceBack (most recent call last):
[rank0]:         rtMallocPhysical execute failed, reason=[driver error:internal error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
[rank0]:         malloc physical memory failed, runtime result = 507899[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

[rank0]:[W1126 22:57:15.656738355 compiler_depend.ts:659] Warning: 0Failed to find function aclrtSynchronizeDeviceWithTimeout (function operator())

It seems something wrong with malloc memory, can you offer your env info, including NPU type, the version of CANN and driver? Also, it would be best if you could provide your reproduction method.

我将torch降级到2.4.0后，解决了上面的问题。但是现在出现了新的问题，RuntimeError: aclnnPromptFlashAttentionV3 or aclnnPromptFlashAttentionV3GetWorkspaceSize not in libopapi.so, or libopapi.sonot found. 这需要怎么解决呢？

[rank0]:   File "/data1/guihonghao/vllm/vllm/worker/model_runner_base.py", line 116, in _wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/data1/guihonghao/vllm/vllm/worker/model_runner.py", line 1654, in execute_model
[rank0]:     hidden_or_intermediate_states = model_executable(
[rank0]:   File "/data1/guihonghao/anaconda3/envs/uie/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/data1/guihonghao/anaconda3/envs/uie/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/data1/guihonghao/vllm/vllm/model_executor/models/qwen2.py", line 456, in forward
[rank0]:     hidden_states = self.model(input_ids, positions, kv_caches,
[rank0]:   File "/data1/guihonghao/vllm/vllm/compilation/decorators.py", line 143, in __call__
[rank0]:     return self.forward(*args, **kwargs)
[rank0]:   File "/data1/guihonghao/vllm/vllm/model_executor/models/qwen2.py", line 306, in forward
[rank0]:     hidden_states, residual = layer(
[rank0]:   File "/data1/guihonghao/anaconda3/envs/uie/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/data1/guihonghao/anaconda3/envs/uie/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/data1/guihonghao/vllm/vllm/model_executor/models/qwen2.py", line 226, in forward
[rank0]:     hidden_states = self.self_attn(
[rank0]:   File "/data1/guihonghao/anaconda3/envs/uie/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/data1/guihonghao/anaconda3/envs/uie/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/data1/guihonghao/vllm/vllm/model_executor/models/qwen2.py", line 169, in forward
[rank0]:     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
[rank0]:   File "/data1/guihonghao/anaconda3/envs/uie/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/data1/guihonghao/anaconda3/envs/uie/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/data1/guihonghao/vllm/vllm/attention/layer.py", line 99, in forward
[rank0]:     return self.impl.forward(query,
[rank0]:   File "/data1/guihonghao/vllm/vllm/attention/backends/ascend.py", line 476, in forward
[rank0]:     output = torch_npu.npu_prompt_flash_attention(
[rank0]:   File "/data1/guihonghao/anaconda3/envs/uie/lib/python3.10/site-packages/torch/_ops.py", line 1061, in __call__
[rank0]:     return self_._op(*args, **(kwargs or {}))
[rank0]: RuntimeError: aclnnPromptFlashAttentionV3 or aclnnPromptFlashAttentionV3GetWorkspaceSize not in libopapi.so, or libopapi.sonot found.
[rank0]: [ERROR] 2024-11-27-10:45:50 (PID:2487788, Device:0, RankID:-1) ERR01004 OPS invalid pointer

aclnn 算子不存在，应该是你的 CANN 版本太旧了

new-TonyWang · 2024-11-29T10:22:05Z

qwen2.5 72b 开启--enable-prefix-caching之后会报错。关闭--enable-prefix-caching之后，使用贪心采样，temperature=0,top=1，相同prompt多次推理结果不一致，有时首token也不一致。

Traceback (most recent call last):
(VllmWorkerProcess pid=2782674) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] The above exception was the direct cause of the following exception:
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop.
(VllmWorkerProcess pid=2782677) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=2782674) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] 
(VllmWorkerProcess pid=2782677) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] Traceback (most recent call last):
(VllmWorkerProcess pid=2782674) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] Traceback (most recent call last):
(VllmWorkerProcess pid=2782677) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/worker/model_runner_base.py", line 116, in _wrapper
(VllmWorkerProcess pid=2782674) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=2782677) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2782674) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=2782677) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/worker/worker_base.py", line 85, in start_worker_execution_loop
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/worker/model_runner.py", line 1654, in execute_model
(VllmWorkerProcess pid=2782674) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2782677) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     output = self.execute_model(execute_model_req=None)
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=2782674) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2782677) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/worker/worker_base.py", line 343, in execute_model
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2782674) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/worker/worker_base.py", line 85, in start_worker_execution_loop
(VllmWorkerProcess pid=2782677) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     output = self.model_runner.execute_model(
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2782674) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     output = self.execute_model(execute_model_req=None)
(VllmWorkerProcess pid=2782677) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2782677) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2782674) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/worker/worker_base.py", line 343, in execute_model
(VllmWorkerProcess pid=2782677) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/worker/model_runner_base.py", line 152, in _wrapper
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/model_executor/models/qwen2.py", line 456, in forward
(VllmWorkerProcess pid=2782674) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     output = self.model_runner.execute_model(
(VllmWorkerProcess pid=2782677) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     raise type(err)(
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=2782674) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2782677) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241129-181351.pkl): shape '[-1, 49, 8, 128]' is invalid for input of size 504832
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/compilation/decorators.py", line 143, in __call__
(VllmWorkerProcess pid=2782674) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return self.forward(*args, **kwargs)
(VllmWorkerProcess pid=2782674) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/worker/model_runner_base.py", line 152, in _wrapper
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/model_executor/models/qwen2.py", line 306, in forward
(VllmWorkerProcess pid=2782674) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     raise type(err)(
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     hidden_states, residual = layer(
(VllmWorkerProcess pid=2782674) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241129-181351.pkl): shape '[-1, 49, 8, 128]' is invalid for input of size 504832
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/model_executor/models/qwen2.py", line 226, in forward
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     hidden_states = self.self_attn(
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/model_executor/models/qwen2.py", line 169, in forward
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/attention/layer.py", line 99, in forward
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return self.impl.forward(query,
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/attention/backends/ascend.py", line 467, in forward
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     query = query.view(-1, attn_metadata.max_prefill_seq_len,
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] RuntimeError: shape '[-1, 49, 8, 128]' is invalid for input of size 504832
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] 
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] The above exception was the direct cause of the following exception:
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] 
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] Traceback (most recent call last):
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/worker/worker_base.py", line 85, in start_worker_execution_loop
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     output = self.execute_model(execute_model_req=None)
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/worker/worker_base.py", line 343, in execute_model
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     output = self.model_runner.execute_model(
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/worker/model_runner_base.py", line 152, in _wrapper
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     raise type(err)(
(VllmWorkerProcess pid=2782673) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241129-181351.pkl): shape '[-1, 49, 8, 128]' is invalid for input of size 504832
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop.
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] Traceback (most recent call last):
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/worker/model_runner_base.py", line 116, in _wrapper
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/worker/model_runner.py", line 1654, in execute_model
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/model_executor/models/qwen2.py", line 456, in forward
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/compilation/decorators.py", line 143, in __call__
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return self.forward(*args, **kwargs)
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/model_executor/models/qwen2.py", line 306, in forward
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     hidden_states, residual = layer(
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/model_executor/models/qwen2.py", line 226, in forward
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     hidden_states = self.self_attn(
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/model_executor/models/qwen2.py", line 169, in forward
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/attention/layer.py", line 99, in forward
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return self.impl.forward(query,
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/attention/backends/ascend.py", line 467, in forward
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     query = query.view(-1, attn_metadata.max_prefill_seq_len,
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] RuntimeError: shape '[-1, 49, 8, 128]' is invalid for input of size 504832
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] 
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] The above exception was the direct cause of the following exception:
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] 
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] Traceback (most recent call last):
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/worker/worker_base.py", line 85, in start_worker_execution_loop
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     output = self.execute_model(execute_model_req=None)
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/worker/worker_base.py", line 343, in execute_model
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     output = self.model_runner.execute_model(
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/worker/model_runner_base.py", line 152, in _wrapper
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     raise type(err)(
(VllmWorkerProcess pid=2782672) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241129-181351.pkl): shape '[-1, 49, 8, 128]' is invalid for input of size 504832
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop.
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] Traceback (most recent call last):
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 11-29 18:13:51 engine.py:135] RuntimeError("Error in model execution (input dumped to /tmp/err_execute_model_input_20241129-181351.pkl): shape '[-1, 49, 8, 128]' is invalid for input of size 504832")
ERROR 11-29 18:13:51 engine.py:135] Traceback (most recent call last):
ERROR 11-29 18:13:51 engine.py:135]   File "/root/workspace/vllm/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 11-29 18:13:51 engine.py:135]     return func(*args, **kwargs)
ERROR 11-29 18:13:51 engine.py:135]   File "/root/workspace/vllm/vllm/worker/model_runner.py", line 1654, in execute_model
ERROR 11-29 18:13:51 engine.py:135]     hidden_or_intermediate_states = model_executable(
ERROR 11-29 18:13:51 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 11-29 18:13:51 engine.py:135]     return self._call_impl(*args, **kwargs)
ERROR 11-29 18:13:51 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 11-29 18:13:51 engine.py:135]     return forward_call(*args, **kwargs)
ERROR 11-29 18:13:51 engine.py:135]   File "/root/workspace/vllm/vllm/model_executor/models/qwen2.py", line 456, in forward
ERROR 11-29 18:13:51 engine.py:135]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 11-29 18:13:51 engine.py:135]   File "/root/workspace/vllm/vllm/compilation/decorators.py", line 143, in __call__
ERROR 11-29 18:13:51 engine.py:135]     return self.forward(*args, **kwargs)
ERROR 11-29 18:13:51 engine.py:135]   File "/root/workspace/vllm/vllm/model_executor/models/qwen2.py", line 306, in forward
ERROR 11-29 18:13:51 engine.py:135]     hidden_states, residual = layer(
ERROR 11-29 18:13:51 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 11-29 18:13:51 engine.py:135]     return self._call_impl(*args, **kwargs)
ERROR 11-29 18:13:51 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 11-29 18:13:51 engine.py:135]     return forward_call(*args, **kwargs)
ERROR 11-29 18:13:51 engine.py:135]   File "/root/workspace/vllm/vllm/model_executor/models/qwen2.py", line 226, in forward
ERROR 11-29 18:13:51 engine.py:135]     hidden_states = self.self_attn(
ERROR 11-29 18:13:51 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 11-29 18:13:51 engine.py:135]     return self._call_impl(*args, **kwargs)
ERROR 11-29 18:13:51 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 11-29 18:13:51 engine.py:135]     return forward_call(*args, **kwargs)
ERROR 11-29 18:13:51 engine.py:135]   File "/root/workspace/vllm/vllm/model_executor/models/qwen2.py", line 169, in forward
ERROR 11-29 18:13:51 engine.py:135]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 11-29 18:13:51 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 11-29 18:13:51 engine.py:135]     return self._call_impl(*args, **kwargs)
ERROR 11-29 18:13:51 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 11-29 18:13:51 engine.py:135]     return forward_call(*args, **kwargs)
ERROR 11-29 18:13:51 engine.py:135]   File "/root/workspace/vllm/vllm/attention/layer.py", line 99, in forward
ERROR 11-29 18:13:51 engine.py:135]     return self.impl.forward(query,
ERROR 11-29 18:13:51 engine.py:135]   File "/root/workspace/vllm/vllm/attention/backends/ascend.py", line 467, in forward
ERROR 11-29 18:13:51 engine.py:135]     query = query.view(-1, attn_metadata.max_prefill_seq_len,
ERROR 11-29 18:13:51 engine.py:135] RuntimeError: shape '[-1, 49, 8, 128]' is invalid for input of size 504832
ERROR 11-29 18:13:51 engine.py:135] 
ERROR 11-29 18:13:51 engine.py:135] The above exception was the direct cause of the following exception:
ERROR 11-29 18:13:51 engine.py:135] 
ERROR 11-29 18:13:51 engine.py:135] Traceback (most recent call last):
ERROR 11-29 18:13:51 engine.py:135]   File "/root/workspace/vllm/vllm/engine/multiprocessing/engine.py", line 133, in start
ERROR 11-29 18:13:51 engine.py:135]     self.run_engine_loop()
ERROR 11-29 18:13:51 engine.py:135]   File "/root/workspace/vllm/vllm/engine/multiprocessing/engine.py", line 196, in run_engine_loop
ERROR 11-29 18:13:51 engine.py:135]     request_outputs = self.engine_step()
ERROR 11-29 18:13:51 engine.py:135]   File "/root/workspace/vllm/vllm/engine/multiprocessing/engine.py", line 214, in engine_step
ERROR 11-29 18:13:51 engine.py:135]     raise e
ERROR 11-29 18:13:51 engine.py:135]   File "/root/workspace/vllm/vllm/engine/multiprocessing/engine.py", line 205, in engine_step
ERROR 11-29 18:13:51 engine.py:135]     return self.engine.step()
ERROR 11-29 18:13:51 engine.py:135]   File "/root/workspace/vllm/vllm/engine/llm_engine.py", line 1466, in step
ERROR 11-29 18:13:51 engine.py:135]     outputs = self.model_executor.execute_model(
ERROR 11-29 18:13:51 engine.py:135]   File "/root/workspace/vllm/vllm/executor/distributed_gpu_executor.py", line 82, in execute_model
ERROR 11-29 18:13:51 engine.py:135]     driver_outputs = self._driver_execute_model(execute_model_req)
ERROR 11-29 18:13:51 engine.py:135]   File "/root/workspace/vllm/vllm/executor/multiproc_gpu_executor.py", line 158, in _driver_execute_model
ERROR 11-29 18:13:51 engine.py:135]     return self.driver_worker.execute_model(execute_model_req)
ERROR 11-29 18:13:51 engine.py:135]   File "/root/workspace/vllm/vllm/worker/worker_base.py", line 343, in execute_model
ERROR 11-29 18:13:51 engine.py:135]     output = self.model_runner.execute_model(
ERROR 11-29 18:13:51 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 11-29 18:13:51 engine.py:135]     return func(*args, **kwargs)
ERROR 11-29 18:13:51 engine.py:135]   File "/root/workspace/vllm/vllm/worker/model_runner_base.py", line 152, in _wrapper
ERROR 11-29 18:13:51 engine.py:135]     raise type(err)(
ERROR 11-29 18:13:51 engine.py:135] RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241129-181351.pkl): shape '[-1, 49, 8, 128]' is invalid for input of size 504832
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/worker/model_runner.py", line 1654, in execute_model
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/model_executor/models/qwen2.py", line 456, in forward
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/compilation/decorators.py", line 143, in __call__
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return self.forward(*args, **kwargs)
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/model_executor/models/qwen2.py", line 306, in forward
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     hidden_states, residual = layer(
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/model_executor/models/qwen2.py", line 226, in forward
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     hidden_states = self.self_attn(
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/model_executor/models/qwen2.py", line 169, in forward
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/attention/layer.py", line 99, in forward
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return self.impl.forward(query,
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/attention/backends/ascend.py", line 467, in forward
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     query = query.view(-1, attn_metadata.max_prefill_seq_len,
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] RuntimeError: shape '[-1, 49, 8, 128]' is invalid for input of size 504832
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] 
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] The above exception was the direct cause of the following exception:
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] 
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] Traceback (most recent call last):
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/worker/worker_base.py", line 85, in start_worker_execution_loop
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     output = self.execute_model(execute_model_req=None)
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/worker/worker_base.py", line 343, in execute_model
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     output = self.model_runner.execute_model(
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]   File "/root/workspace/vllm/vllm/worker/model_runner_base.py", line 152, in _wrapper
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229]     raise type(err)(
(VllmWorkerProcess pid=2782675) ERROR 11-29 18:13:51 multiproc_worker_utils.py:229] RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241129-181351.pkl): shape '[-1, 49, 8, 128]' is invalid for input of size 504832

mergify · 2024-11-30T01:10:34Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wangshuai09.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ljqiang17 · 2024-12-03T09:14:22Z

I followed the above steps and reported the following error. What is the reason?

Same error. Have you solved this problem?

beardog6 · 2024-12-09T09:32:14Z

The v1/chat/completions interface always returns a link to an image in the header.Have others encountered the same issue?

tghfly · 2024-12-23T06:59:20Z

使用Atlas 300T Pro 训练卡（型号：9000）报如下错误

INFO 12-23 06:53:00 model_runner.py:1035] Loading model weights took 0.9277 GB
INFO 12-23 06:53:00 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241223-065300.pkl...
('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
INFO 12-23 06:53:06 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241223-065300.pkl.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1608, in execute_model
[rank0]:     hidden_or_intermediate_states = model_executable(
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/vllm/model_executor/models/qwen2.py", line 369, in forward
[rank0]:     hidden_states = self.model(input_ids, positions, kv_caches,
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/vllm/model_executor/models/qwen2.py", line 285, in forward
[rank0]:     hidden_states, residual = layer(
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/vllm/model_executor/models/qwen2.py", line 210, in forward
[rank0]:     hidden_states = self.self_attn(
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/vllm/model_executor/models/qwen2.py", line 157, in forward
[rank0]:     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/vllm/attention/layer.py", line 98, in forward
[rank0]:     return self.impl.forward(query,
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/vllm/attention/backends/ascend.py", line 473, in forward
[rank0]:     output = torch_npu.npu_prompt_flash_attention(
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/torch/_ops.py", line 1061, in __call__
[rank0]:     return self_._op(*args, **(kwargs or {}))
[rank0]: RuntimeError: call aclnnPromptFlashAttentionV3 failed, detail:EZ1001: [PID: 10600] 2024-12-23-06:53:00.641.293 PromptFlashAttention LaunchAicore failed.
[rank0]:         TraceBack (most recent call last):
[rank0]:         Parse dynamic kernel config fail.
[rank0]:         AclOpKernelInit failed opType
[rank0]:         PromptFlashAttention LaunchAicore failed.
[rank0]: [ERROR] 2024-12-23-06:53:00 (PID:10600, Device:0, RankID:-1) ERR01100 OPS call acl api failed

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/workspace/vllm/examples/offline_inference.py", line 14, in <module>
[rank0]:     llm = LLM(model="/mnt/models/Qwen2.5-0.5B-Instruct")
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 214, in __init__
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 585, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 349, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 484, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocks
[rank0]:     return self.driver_worker.determine_num_available_blocks()
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/vllm/worker/npu_worker.py", line 148, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/vllm/worker/npu_model_runner.py", line 271, in profile_run
[rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/python3.9/lib/python3.9/site-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
[rank0]:     raise type(err)(
[rank0]: RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241223-065300.pkl): call aclnnPromptFlashAttentionV3 failed, detail:EZ1001: [PID: 10600] 2024-12-23-06:53:00.641.293 PromptFlashAttention LaunchAicore failed.
[rank0]:         TraceBack (most recent call last):
[rank0]:         Parse dynamic kernel config fail.
[rank0]:         AclOpKernelInit failed opType
[rank0]:         PromptFlashAttention LaunchAicore failed.

[rank0]: [ERROR] 2024-12-23-06:53:00 (PID:10600, Device:0, RankID:-1) ERR01100 OPS call acl api failed

我的环境信息如下：

系统：Ubuntu 22.04.5 LTS
Driver version: 24.1.rc2
CANN version: 8.0.RC3
# npu-smi info
+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2                 Version: 24.1.rc2                                             |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 1     910B                | OK            | 63.1        34                0    / 0             |
| 0                         | 0000:81:00.0  | 0           2384 / 15038      1    / 32768         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| No running processes found in NPU 1                                                            |
+===========================+===============+====================================================+

MengqingCao · 2024-12-23T07:10:28Z

@tghfly Your device doesn't support op aclnnPromptFlashAttentionV3 currently.

mergify · 2024-12-24T11:52:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wangshuai09.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gournd · 2025-01-07T02:58:39Z

I followed the above steps and reported the following error. What is the reason?

so ,how to solve this question? sir.

gournd · 2025-01-07T07:02:07Z

这是怎么回事呢？

gournd · 2025-01-07T09:41:20Z

请问这在哪修改batch_size呢

gournd · 2025-01-09T07:16:32Z

测完了可以用，但是性能比mindie还是差一些
……
---- 回复留言 ---- |来自| @。> | |日期| 2024年10月23日 17:06 | |至| vllm-项目/vllm _@** ._ > | |抄送| ccly1996 _@。>，提及_@。** > | |主题 |回复：[vllm-project/vllm] [硬件][Ascend] 添加 Ascend NPU（PR #8054）| 牛逼，请问现在910B/310P现在可以使用了吗？ — 直接回复此电子邮件，在 GitHub 上查看，或取消订阅。您收到此消息是因为您被提及。消息 ID：@ .*** >

@ccly1996310p上怎么跑的啊？我认真怎么推理出来的结果还是不行呢

请问310p，现在可以正常推理了嘛，我这现在还是会出现乱码

MengqingCao · 2025-01-09T11:46:12Z

@gournd 310P is not supported now

inseptember · 2025-01-13T07:04:09Z

请问支持量化模型加载吗？

Co-authored-by: wangshuai09 <[email protected]> Signed-off-by: MengqingCao <[email protected]>

MengqingCao · 2025-01-13T11:19:25Z

请问支持量化模型加载吗？

Not supported now :-(

Signed-off-by: MengqingCao <[email protected]>

xuedinge233 mentioned this pull request Aug 31, 2024

[Feature]: Request for Ascend NPU support #6368

Closed

wangshuai09 force-pushed the npu_support branch from 6f89d38 to 6ae737e Compare September 9, 2024 07:05

MengqingCao force-pushed the npu_support branch from 620514f to 198b85b Compare November 29, 2024 09:52

mergify bot removed the needs-rebase label Nov 29, 2024

mergify bot added the needs-rebase label Nov 30, 2024

MengqingCao force-pushed the npu_support branch from 3eda947 to 06f1b1d Compare November 30, 2024 01:49

mergify bot removed the needs-rebase label Nov 30, 2024

MengqingCao force-pushed the npu_support branch from 06f1b1d to 2b92b5c Compare December 16, 2024 03:47

mergify bot added the needs-rebase label Dec 24, 2024

zhuo97 mentioned this pull request Jan 2, 2025

[WIP] support Ascend NPU backend OpenRLHF/OpenRLHF#605

Open

glowwormX mentioned this pull request Jan 10, 2025

The NPU cannot be identified during vllm_npu execution. ray-project/ray#49751

Closed

[Hardware][Ascend] Init Ascend NPU Support

e25c764

Co-authored-by: wangshuai09 <[email protected]> Signed-off-by: MengqingCao <[email protected]>

MengqingCao force-pushed the npu_support branch from 27244b2 to e25c764 Compare January 13, 2025 11:16

mergify bot removed the needs-rebase label Jan 13, 2025

code format

14917da

Signed-off-by: MengqingCao <[email protected]>

[Hardware][Ascend] Add Ascend NPU backend #8054

Are you sure you want to change the base?

[Hardware][Ascend] Add Ascend NPU backend #8054

Conversation

wangshuai09 commented Aug 31, 2024 • edited Loading

RoadMap:

Support Device

Install

Using Dockerfile.npu

Collaborators

github-actions bot commented Aug 31, 2024

seoibiubiu commented Sep 1, 2024

wangshuai09 commented Sep 2, 2024

seoibiubiu commented Sep 2, 2024

wyzanski commented Sep 2, 2024

wangshuai09 commented Sep 2, 2024

Aiwenqiuyu commented Sep 9, 2024

jkl375 commented Sep 11, 2024

MengqingCao commented Sep 11, 2024 • edited Loading

XYZliang commented Sep 12, 2024

beardog6 commented Sep 14, 2024

wangshuai09 commented Sep 18, 2024 • edited Loading

XYZliang commented Sep 18, 2024

WangxuP commented Sep 18, 2024

wangshuai09 commented Sep 18, 2024 • edited Loading

wangshuai09 commented Sep 18, 2024

RogerWYQ commented Sep 18, 2024

zhangzhiqiangcs commented Sep 19, 2024

XYZliang commented Sep 19, 2024

MengqingCao commented Sep 19, 2024

MengqingCao commented Sep 19, 2024 • edited Loading

MengqingCao commented Sep 19, 2024

WWCTF commented Sep 19, 2024

MengqingCao commented Sep 19, 2024

WWCTF commented Sep 19, 2024

MengqingCao commented Sep 19, 2024

WWCTF commented Sep 19, 2024

wrennywang commented Sep 19, 2024

MengqingCao commented Nov 27, 2024

guihonghao commented Nov 27, 2024

huyz-git commented Nov 27, 2024

new-TonyWang commented Nov 29, 2024 • edited Loading

mergify bot commented Nov 30, 2024

ljqiang17 commented Dec 3, 2024

beardog6 commented Dec 9, 2024

tghfly commented Dec 23, 2024

MengqingCao commented Dec 23, 2024 • edited Loading

mergify bot commented Dec 24, 2024

gournd commented Jan 7, 2025

gournd commented Jan 7, 2025

gournd commented Jan 7, 2025

gournd commented Jan 9, 2025 • edited Loading

MengqingCao commented Jan 9, 2025

inseptember commented Jan 13, 2025

MengqingCao commented Jan 13, 2025

wangshuai09 commented Aug 31, 2024 •

edited

Loading

MengqingCao commented Sep 11, 2024 •

edited

Loading

wangshuai09 commented Sep 18, 2024 •

edited

Loading

wangshuai09 commented Sep 18, 2024 •

edited

Loading

MengqingCao commented Sep 19, 2024 •

edited

Loading

new-TonyWang commented Nov 29, 2024 •

edited

Loading

MengqingCao commented Dec 23, 2024 •

edited

Loading

gournd commented Jan 9, 2025 •

edited

Loading