(Enhancement) Applying mask to attention in one operation (3.5 Hiding future words with causal attention) #282

labdmitriy · 2024-07-22T19:13:05Z

labdmitriy
Jul 22, 2024

Bug description

Hi Sebastian,

I think that it is not a bug but possible enhancement - to apply mask we have two steps now:

Creating lower triangular matrix ones and zeros:

mask_simple = torch.tril(torch.ones(context_length, context_length))

Multiply attention matrix with triangular matrix:

masked_simple = attn_weights * mask_simple

However this function (torch.tril) can be applied directly to attention matrix to get the same result:

torch.tril(attn_weights)

Thank you.

What operating system are you using?

None

Where do you run your code?

None

Environment

labdmitriy · 2024-07-22T19:21:13Z

labdmitriy
Jul 22, 2024
Author

Probably the same applies here:

mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)
masked = attn_scores.masked_fill(mask.bool(), -torch.inf)

We can use only attn_scores to get the same result:

attn_scores.masked_fill(torch.triu(attn_scores, diagonal=1).bool(), -torch.inf)

But maybe for demonstration your approach is more intuitive.

0 replies

rasbt · 2024-07-23T02:10:59Z

rasbt
Jul 23, 2024
Maintainer

Thanks for sharing. I agree, it could be applied directly. But like you said at the bottom, I did it in a step-wise fashion in the book to make it a bit more intuitive, I hope.

The other reason is that triu is quite expensive compared to just applying the mask once we have it. In the grand scheme maybe not that much, but for realistic input sizes here ~30%:

import torch
import torch.nn as nn



class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert d_out % num_heads == 0, "d_out must be divisible by num_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs
        self.dropout = nn.Dropout(dropout)
        self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))

    def forward(self, x, triu=False):
        b, num_tokens, d_in = x.shape

        keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
        queries = self.W_query(x)
        values = self.W_value(x)

        # We implicitly split the matrix by adding a `num_heads` dimension
        # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) 
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        # Compute scaled dot-product attention (aka self-attention) with a causal mask
        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head
        
        if triu:
            attn_scores.masked_fill_(torch.triu(attn_scores, diagonal=1).bool(), -torch.inf)
        
        else:
            # Original mask truncated to the number of tokens and converted to boolean
            mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

            # Use the mask to fill attention scores
            attn_scores.masked_fill_(mask_bool, -torch.inf)
        
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Shape: (b, num_tokens, num_heads, head_dim)
        context_vec = (attn_weights @ values).transpose(1, 2) 
        
        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec) # optional projection

        return context_vec

torch.manual_seed(123)

context_length = 1024
d_in = 256
d_out = 256

mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)

batch = torch.randn([8, 4, 256])

torch.equal(mha.forward(batch, triu=True), mha.forward(batch, triu=False))
# Returns True

%timeit mha.forward(batch, triu=False)
# 172 µs ± 8.18 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%timeit mha.forward(batch, triu=True)
# 128 µs ± 6.17 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

I do like your simplification though. I have a sheet where I collect ideas for interesting bonus contents, and I think this would be a cool one for a "simplified" or "mimimal" attention implementation.

Thanks a lot for sharing!

1 reply

labdmitriy Jul 23, 2024
Author

Thanks a lot for such detailed answer!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Enhancement) Applying mask to attention in one operation (3.5 Hiding future words with causal attention) #282

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

(Enhancement) Applying mask to attention in one operation (3.5 Hiding future words with causal attention) #282

labdmitriy Jul 22, 2024

Bug description

What operating system are you using?

Where do you run your code?

Environment

Replies: 2 comments · 1 reply

labdmitriy Jul 22, 2024 Author

rasbt Jul 23, 2024 Maintainer

labdmitriy Jul 23, 2024 Author

labdmitriy
Jul 22, 2024

Replies: 2 comments 1 reply

labdmitriy
Jul 22, 2024
Author

rasbt
Jul 23, 2024
Maintainer

labdmitriy Jul 23, 2024
Author