Is there an option to obtain attention matrices during inference, similar to the output_attentions=True parameter in the transformers package? #9121

yuhkalhic · 2024-08-21T17:45:54Z

yuhkalhic
Aug 21, 2024

Feature Request: Access to Attention Matrices and/or KV-Cache during Inference
I'm wondering if there's a way to obtain attention matrices or access the KV-Cache during inference with vLLM, similar to how the transformers package allows this with the output_attensions=True parameter or through the past_key_values attribute.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there an option to obtain attention matrices during inference, similar to the output_attentions=True parameter in the transformers package? #9121

{{title}}

Replies: 0 comments

Select a reply

Is there an option to obtain attention matrices during inference, similar to the output_attentions=True parameter in the transformers package? #9121

yuhkalhic Aug 21, 2024

Replies: 0 comments

yuhkalhic
Aug 21, 2024