About Position Encoding #84

pisiguiii · 2024-08-14T08:35:54Z

Hi!

I want to ask, did you try to use instead of sin position encoder PE with learnable layer? If yes, how did it behave?

Also I'm interested, as I understand from paper, in final version of visual prompt processing you concatenate encoded boxes and content embeddings, so if we have encoded boxes embedding 256d and content embedding 256d => our final CAT(B, C) d will be 512. Did you try to summarize this embeddings? Like:
Q = Linear(B + C) with d = 256?

The text was updated successfully, but these errors were encountered:

Mountchicken · 2024-08-15T09:08:45Z

Hi @pisiguiii
We didn't try learnable position embedding but I think it should have a similar performance with sincos position embedding. This is verified in DETR series. Specifically, In DINO, they are using sincos position embedding and we directly follow their implementation

The content embedding and position embedding will be added instead of concatnated.

yu-xi-wang · 2024-08-28T10:39:09Z

Hi @Mountchicken , thanks for the great job! I also have some questions related to this issue.

In the paper, about the prompt encoder it says

These content embeddings are concatenated with position embeddings along the channel dimension, and a linear layer is applied for projection, thereby constructing the input query embedding Q

But in the implementation, it actually first perform [C;C'] + [B;B'] result as a tensor with shape [k + 1, 256], and then feed it into a linear layer without change the shape, output the Q = linear([C;C'] + [B;B']) with shape [k + 1, 256]?

In the next step, the Q will be used to extract and aggregate target regions by performing MSDeformAttn with encoded features. MSDeformAttn(Q_j ,b_j , {f_i}), in this formula, b_j is the jth box which should be the reference points, Q_j is the jth prompt query embedding. In the question 1, if I understand correctly, this Q_j already have position embedding information due to the '+'. But in the code of DINO, I saw it frequently use

src2 = MSDeformAttn(self.with_pos_embed(src, pos), reference_points, src, ...)
src = src + self.dropout1(src2)
src = self.norm1(src)

This with_pos_embed just simply return src + pos.
So, I'm confused which of following implementation is correct?
1.

Q = linear( [C;C'] + [B;B'] )
src2 = MSDeformAttn(self.with_pos_embed(Q, B), B, f, ...)
Q = Q + self.dropout1(src2)
Q = self.norm1(Q)

Q = linear( [C;C'] + [B;B'] )
src2 = MSDeformAttn(Q, B, f, ...)
[C;C']  = [C;C']  + self.dropout1(src2)
[C;C']  = self.norm1([C;C'] )

The option 1 seems add position embedding twice, and position embedding will remained in final V sounds not make sense. Could you help me understand this? Thank you!

Mountchicken · 2024-08-30T06:42:15Z

Hi @yu-xi-wang
Sorry for the late reply. I checked the code again and an issue with the description of the prompt encoder in our paper. The correct calculation method should be:

Q = [C; C']
Position = [B; B']
src2 = MSDeformAttn(self.with_pos_embed(Q, Position), box_or_point_coordinates, f, ...)
Q = Q + self.dropout1(src2)
Q = self.norm1(Q)
Visual_prompt_embedding = Q[:, -1]

Content embedding and position embedding will not be concatnated but added during attention.

yu-xi-wang · 2024-08-30T09:47:25Z

Hi @Mountchicken thank you so much for the reply! Yes, it make sense to me now!

VilisovEvgeny · 2024-09-16T09:28:46Z

Hi @Mountchicken, hi @yu-xi-wang!

I tried to implement the positional encoding code as I understood it from the article, but I ran into a problem that all my encoded boxes had almost identical embeddings. Maybe you can help me understand what I'm missing?

This is my code:


def _boxes_embed(self, x):
        bs, K, D = x.size()  # K is the number of bounding boxes, D should be 4
        pe = torch.zeros(bs, K, D, self.num_pos_feats, device=x.device)
        
        # Create the scaling factor for positional encoding
        dim_t = torch.arange(self.num_pos_feats * 4, dtype=torch.float32, device=x.device)
        dim_t = 10000 ** (2 * (dim_t // 2) / (self.num_pos_feats * 4))

        for i in range(D):
            pos = x[:, :, i].unsqueeze(-1)  # Shape: [K, 1]
            scaled_pos = pos / dim_t[self.num_pos_feats * i:self.num_pos_feats * (i + 1)]  # Shape: [K, num_pos_feats]
            pe[:, :, i, 0::2] = torch.sin(scaled_pos[:, :, 0::2])  # Apply sine to even indices
            pe[:, :, i, 1::2] = torch.cos(scaled_pos[:, :, 1::2])  # Apply cosine to odd indices
        
        pe = pe.view(bs, K, -1)  # Concatenate the embeddings to get a shape of [K, 256]
        return pe

Mountchicken · 2024-09-17T10:10:44Z

@VilisovEvgeny
Here is the code that I use to get the position embedding:

def gen_sineembed_for_position(pos_tensor):
    # n_query, bs, _ = pos_tensor.size()
    # sineembed_tensor = torch.zeros(n_query, bs, 256)
    scale = 2 * math.pi
    dim_t = torch.arange(128, dtype=torch.float32, device=pos_tensor.device)
    dim_t = 10000**(2 * (dim_t // 2) / 128)
    x_embed = pos_tensor[:, :, 0] * scale
    y_embed = pos_tensor[:, :, 1] * scale
    pos_x = x_embed[:, :, None] / dim_t
    pos_y = y_embed[:, :, None] / dim_t
    pos_x = torch.stack((pos_x[:, :, 0::2].sin(), pos_x[:, :, 1::2].cos()),
                        dim=3).flatten(2)
    pos_y = torch.stack((pos_y[:, :, 0::2].sin(), pos_y[:, :, 1::2].cos()),
                        dim=3).flatten(2)
    if pos_tensor.size(-1) == 2:
        pos = torch.cat((pos_y, pos_x), dim=2)
    elif pos_tensor.size(-1) == 4:
        w_embed = pos_tensor[:, :, 2] * scale
        pos_w = w_embed[:, :, None] / dim_t
        pos_w = torch.stack((pos_w[:, :, 0::2].sin(), pos_w[:, :, 1::2].cos()),
                            dim=3).flatten(2)

        h_embed = pos_tensor[:, :, 3] * scale
        pos_h = h_embed[:, :, None] / dim_t
        pos_h = torch.stack((pos_h[:, :, 0::2].sin(), pos_h[:, :, 1::2].cos()),
                            dim=3).flatten(2)

        pos = torch.cat((pos_y, pos_x, pos_w, pos_h), dim=2)
    else:
        raise ValueError("Unknown pos_tensor shape(-1):{}".format(
            pos_tensor.size(-1)))
    return pos

VilisovEvgeny · 2024-09-17T10:55:35Z

@Mountchicken thanks for provided solution!

But I'm a little confused, why does cosine similarity between obtained pos embeddings do not decreasing lower than 0.7? Is it a common behavior?

Mountchicken · 2024-09-18T11:14:45Z

I'm not sure if this is normal. Did you normalize the box coordinates to 0-1 before you got the position embedding?

VilisovEvgeny · 2024-09-18T11:27:35Z

yes, I did. I also checked if used boxes in cxcywh format
In all my sine encoding realizations I also met similar behavior.

VilisovEvgeny · 2024-09-19T08:32:10Z

I run some tests with small part of LVIS dataset trying to check all embeddings (not only the last embedding [: -1]) and this is what i get:
This is how to interpret labels:
(number of image dataset sample)(class name)(number of unique box "visual prompt"/global visual prompt with box [0.5, 0.5, 1.0, 1.0])

Here I was comparing final visual prompt embeddings:

here I was comparing pos encoded boxes embeddings obtained from function
def gen_sineembed_for_position(pos_tensor):

Comparing with DINOv repo prompt encoding part looks like MSDeformAttnTransformerEncoderLayer. I'm literally copy-past this class. And provided matrix show that global embedding similar with global embeddings (class token) from different images and have lowest similarity with visual prompt embeddings of it's own sample. How is this possible? Did you met with problems like this?

VilisovEvgeny · 2024-09-24T09:01:35Z

@Mountchicken could you help me with my issue which I described previously? Final global embeddings have much more similarity with others global embeddings then with final embeddings of their own classes. I'm following paper and use GroundingDINO DeformAttnDecoderLayer as a base.

Mountchicken · 2024-09-24T09:48:54Z

Hi @VilisovEvgeny
Sorry for the late reply. Based on your visualization, if I understand correctly, you visualized the similarity between global content embeddings of different categories across different images and found that they are quite similar. Since the global content embeddings are the same before entering the visual prompt encoder, the final outputs might also be quite similar. This is indeed an issue, and we haven't discussed this problem before.

VilisovEvgeny · 2024-09-24T10:40:58Z

Thanks for your reply, @Mountchicken!

I visualized not only global embedding, but all embedding from final output (so there is one embedding for each unique sample per class per image and one global embedding). The main problem is that my global embeddings from one image for different classes is too much similar to each other, so when I'm trying to fit my pipeline it doesn't even pass sanity check on small amount of data.

This is how looks similarity between global embeddings of different classes from one and different images:

I understand, that this is too much to ask about, but I would be very grateful if you could tell me what the average similarity is between global embeddings obtained from different classes from the same image and from different images.
Following both paper and your advises from Issues, I can't understand, am I implementing Visual Prompt Encoder correctly or not(

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About Position Encoding #84

About Position Encoding #84

pisiguiii commented Aug 14, 2024 •

edited

Loading

Mountchicken commented Aug 15, 2024

yu-xi-wang commented Aug 28, 2024

Mountchicken commented Aug 30, 2024 •

edited

Loading

yu-xi-wang commented Aug 30, 2024

VilisovEvgeny commented Sep 16, 2024 •

edited

Loading

Mountchicken commented Sep 17, 2024

VilisovEvgeny commented Sep 17, 2024 •

edited

Loading

Mountchicken commented Sep 18, 2024

VilisovEvgeny commented Sep 18, 2024 •

edited

Loading

VilisovEvgeny commented Sep 19, 2024 •

edited

Loading

VilisovEvgeny commented Sep 24, 2024 •

edited

Loading

Mountchicken commented Sep 24, 2024

VilisovEvgeny commented Sep 24, 2024 •

edited

Loading

About Position Encoding #84

About Position Encoding #84

Comments

pisiguiii commented Aug 14, 2024 • edited Loading

Mountchicken commented Aug 15, 2024

yu-xi-wang commented Aug 28, 2024

Mountchicken commented Aug 30, 2024 • edited Loading

yu-xi-wang commented Aug 30, 2024

VilisovEvgeny commented Sep 16, 2024 • edited Loading

Mountchicken commented Sep 17, 2024

VilisovEvgeny commented Sep 17, 2024 • edited Loading

Mountchicken commented Sep 18, 2024

VilisovEvgeny commented Sep 18, 2024 • edited Loading

VilisovEvgeny commented Sep 19, 2024 • edited Loading

VilisovEvgeny commented Sep 24, 2024 • edited Loading

Mountchicken commented Sep 24, 2024

VilisovEvgeny commented Sep 24, 2024 • edited Loading

pisiguiii commented Aug 14, 2024 •

edited

Loading

Mountchicken commented Aug 30, 2024 •

edited

Loading

VilisovEvgeny commented Sep 16, 2024 •

edited

Loading

VilisovEvgeny commented Sep 17, 2024 •

edited

Loading

VilisovEvgeny commented Sep 18, 2024 •

edited

Loading

VilisovEvgeny commented Sep 19, 2024 •

edited

Loading

VilisovEvgeny commented Sep 24, 2024 •

edited

Loading

VilisovEvgeny commented Sep 24, 2024 •

edited

Loading