Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cuDNN flash attention sequence packing #25812

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Cjkkkk
Copy link
Contributor

@Cjkkkk Cjkkkk commented Jan 9, 2025

  • add cudnn flash attention sequence packing support where multiple batches(segments) can be packed into one batch.
  • add two extra arguments q_offsets and kv_offsets. Offsets tensor specify the starting pos of each segment inside one batch. q_seqlen and kv_seqlen will be also required to specify the actual sequence length of each segment in case of paddings.
  • cudnn accepts q_offsets of shape [S] where S is the number of segments. Since S can change at runtime from batch to batch, we design it to have shape [B, M] where B is the number of batches and M is the maximum number of segments. Therefore static allocation can be used and all non zero entries of q_offsets will be shift to left and form a shape [S] tensor before passing to cuDNN.
  • the unit test is comparing using segment_id to generate segment_mask to cuDNN.

Related XLA PR: openxla/xla#20861

@Cjkkkk
Copy link
Contributor Author

Cjkkkk commented Jan 9, 2025

@superbobry Hi Sergei, could you help review this PR? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant