attention kernel does not work for sequence length = 1 #6

adamomainz · 2024-08-15T20:44:45Z

Currently the attention kernel does not work well in special cases. An example of this is with the following shapes
q.shape = torch.Size([4, 32, 1, 128])
k.shape = torch.Size([4, 32, 20, 128])
v.shape = torch.Size([4, 32, 20, 128])

https://github.com/triton-lang/kernels/blob/main/kernels/flash_attention.py#L23

Repro steps:

add triton kernel here https://github.com/triton-lang/kernels/blob/main/models/llama/llama/math_ops.py#L64
run
CUDA_LAUNCH_BLOCKING=1 python3.9 -m main llama_chat_completion --profile=False --benchmark=False --ckpt_dir="models/llama/meta-llama/Meta-Llama-3-8B-Instruct/original" --tokenizer_path="models/llama/meta-llama/Meta-Llama-3-8B-Instruct/original/tokenizer.model" --use_triton=True

The text was updated successfully, but these errors were encountered:

adamomainz added the kernels label Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

attention kernel does not work for sequence length = 1 #6

attention kernel does not work for sequence length = 1 #6

adamomainz commented Aug 15, 2024 •

edited

Loading

attention kernel does not work for sequence length = 1 #6

attention kernel does not work for sequence length = 1 #6

Comments

adamomainz commented Aug 15, 2024 • edited Loading

adamomainz commented Aug 15, 2024 •

edited

Loading