Add cuDNN flash attention sequence packing #25812

Cjkkkk · 2025-01-09T19:20:51Z

add cudnn flash attention sequence packing support where multiple batches(segments) can be packed into one batch.
add two extra arguments q_offsets and kv_offsets. Offsets tensor specify the starting pos of each segment inside one batch. q_seqlen and kv_seqlen will be also required to specify the actual sequence length of each segment in case of paddings.
cudnn accepts q_offsets of shape [S] where S is the number of segments. Since S can change at runtime from batch to batch, we design it to have shape [B, M] where B is the number of batches and M is the maximum number of segments. Therefore static allocation can be used and all non zero entries of q_offsets will be shift to left and form a shape [S] tensor before passing to cuDNN.
the unit test is comparing using segment_id to generate segment_mask to cuDNN.

Cjkkkk · 2025-01-09T19:22:54Z

@superbobry Hi Sergei, could you help review this PR? Thanks!

add segment packing

28b642a

Provide feedback