-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GroundingDino] Fix grounding dino loss 🚨 #31828
base: main
Are you sure you want to change the base?
[GroundingDino] Fix grounding dino loss 🚨 #31828
Conversation
src/transformers/models/grounding_dino/modeling_grounding_dino.py
Outdated
Show resolved
Hide resolved
@amyeroberts FYI for some reason, when testing locally, |
c.c. @amyeroberts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing!
Overall looks good. A breaking change, so we should append the PR title with 🚨, but I think this is acceptable as it's aligning ourselves with the recommended loss calculation.
input_ids = torch.tensor([101, 3869, 1012, 11420, 1012, 1012, 102]) | ||
input_ids = input_ids.unsqueeze(0).expand(self.batch_size, -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why switch to hard coded input ids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise tests that use labels
and, therefore, compute a loss would complain as build_label_maps
(here) would return None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you can add a comment to explain this
src/transformers/models/grounding_dino/modeling_grounding_dino.py
Outdated
Show resolved
Hide resolved
src/transformers/models/grounding_dino/modeling_grounding_dino.py
Outdated
Show resolved
Hide resolved
src/transformers/models/grounding_dino/modeling_grounding_dino.py
Outdated
Show resolved
Hide resolved
src/transformers/models/grounding_dino/modeling_grounding_dino.py
Outdated
Show resolved
Hide resolved
@EduardoPach Thanks for the working this loss. Just sharing more well developed code for finetuning GroundingDINO https://github.com/open-mmlab/mmdetection/blob/main/configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py |
src/transformers/models/grounding_dino/modeling_grounding_dino.py
Outdated
Show resolved
Hide resolved
c.c. @amyeroberts |
Maybe @NielsRogge could have a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing!
input_ids = torch.tensor([101, 3869, 1012, 11420, 1012, 1012, 102]) | ||
input_ids = input_ids.unsqueeze(0).expand(self.batch_size, -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you can add a comment to explain this
Cough cough, c.c @amyeroberts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the continued work on this!
Main comments are about the precision and clarity of language in the docstrings and comments. It's important to write messages such that someone new to the code can understand.
src/transformers/models/grounding_dino/modeling_grounding_dino.py
Outdated
Show resolved
Hide resolved
src/transformers/models/grounding_dino/modeling_grounding_dino.py
Outdated
Show resolved
Hide resolved
src/transformers/models/grounding_dino/modeling_grounding_dino.py
Outdated
Show resolved
Hide resolved
src/transformers/models/grounding_dino/modeling_grounding_dino.py
Outdated
Show resolved
Hide resolved
src/transformers/models/grounding_dino/modeling_grounding_dino.py
Outdated
Show resolved
Hide resolved
src/transformers/models/grounding_dino/modeling_grounding_dino.py
Outdated
Show resolved
Hide resolved
src/transformers/models/grounding_dino/modeling_grounding_dino.py
Outdated
Show resolved
Hide resolved
src/transformers/models/grounding_dino/modeling_grounding_dino.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks good. Let me check also in this PR #32483 for stable training convergence.
def sigmoid_focal_loss(inputs, targets, num_boxes, alpha: float = 0.25, gamma: float = 2): | ||
# Similar to the one used in `DeformableDetr` but we pass `num_queries`, as `logits` are flattened | ||
# due to masked selection, and support different `reduction` modes. | ||
def sigmoid_focal_loss( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/longzw1997/Open-GroundingDino/blob/main/models/GroundingDINO/utils.py#L138
Even though there are customization in this code, I like the current version of sigmoid_focal_loss
👍🏼
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
c.c. @amyeroberts |
@@ -741,3 +780,53 @@ def test_cross_attention_mask(self): | |||
self.assertTrue(torch.allclose(outputs1.logits, outputs_batched.logits[:1], atol=1e-3)) | |||
# For some reason 12 elements are > 1e-3, but the rest are fine | |||
self.assertTrue(torch.allclose(outputs2.logits, outputs_batched.logits[1:], atol=1.8e-3)) | |||
|
|||
def test_grounding_dino_loss(self): | |||
ds = load_dataset("EduardoPacheco/aquarium-sample", split="train") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should move this to huggingface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, wdyt @amyeroberts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this, it is great to have it tested!
Do we actually need to load the whole dataset here? Can't we copy annotations and upload one image somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qubvel by upload one image somewhere you mean to a hf dataset or to the tests fixtures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, no, maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ydshieh can you please help, do we have any place for test assets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EduardoPach @qubvel I think we can do like this
# We will verify our results on an image of cute cats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi. If IIRC, that dataset contains only 2 examples, right?
The easiest way is to have it as
hf-internal-testing/aquarium-sample
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qubvel I added you to hf-internal-testing
(you have to accept the invitation) then you can create a copy of the above dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing and all the work iterating on this!
Just two things before we're ready to merge:
- This convo
- Slow model tests -- Could you push an empty commit with the message
[run_slow] grounding_dino
?
Slow tests are failed 🥲 I suppose most of them in the same state on the main, but the loss test is also failed, can you have a look? At least we need to add a better message in assert to see the diff. |
@qubvel Sorry, the problem was not synchronize the device type of Strange thing is that local CLI returns follows but the CI pass on this
Can you also confirm about this? -> turns out that CI was having some delay (Local was correct) |
I have these ones also failed locally, making a fix for cross_attention + batch equivalence
|
Two tests failed in CI
For cpu-gpu test I have pretty large diffs locally and even {
"logits": tensor(0.0),
"pred_boxes": tensor(0.8244),
"last_hidden_state": tensor(0.9422),
"init_reference_points": tensor(0.7221),
"intermediate_hidden_states": tensor(2.5513),
"intermediate_reference_points": tensor(0.8300),
"encoder_last_hidden_state_vision": tensor(0.0001),
"encoder_last_hidden_state_text": tensor(9.5665e-06),
"enc_outputs_class": tensor(1.8090e-05),
"enc_outputs_coord_logits": tensor(nan),
} |
is failing on our CI ( which could be find in |
is failing for quite long time 😅 but if you are willing to check it we appreciate a lot, but it's not necessary to merge this PR. (but |
@ydshieh |
@SangbumChoi do you mean the loss test is random seed dependant? In that case, we can either slightly increase the tolerance or mask it as |
@qubvel Still not sure about it. I will debug deeper more tomorrow but the fact is that sometime the CI pass and sometime fails 😭 (Or there might be a computing issue with torch.backends.cudnn.deterministic = True) |
Ok, sure! |
HI @SangbumChoi , from the Or does the |
Yeah, otherwise can you explain more detail about the reason why |
with the current Maybe a better term is why we are not using |
@stevenwudi Yeah good point. Even though the loss scale is large, eventually it trains well at my experiment, but as you suggested it could be better if we set |
(Actually, I don't have any experiment stats to support, and I am happy to know that you did train well at your experiment) |
@SangbumChoi I did this before but switch back to default
@stevenwudi first of all thanks for spending your time and looking into the code! I had the same thought, but then after some experiments (as mentioned above) I set
This is not use in their code. See here. Thus the expected value, since we're basing ourselves in Btw if you actually look the |
@EduardoPach thanks for the detailed explaination, code link and the training loss, this is super helpful. nit: curious for the extremely large value of
Hence, I can not help but wonder: does the high
Hmm, good to know, it seems that the Open-Groundingdino code is kinda really messy, glad we will have this HF implemetation soon 👍 |
@ydshieh @qubvel @EduardoPach The reason of failing the CI test was the difference between cpu and gpu. There is a slight difference from the beginning part (e.g.
@EduardoPach I think the test_grounding_dino_loss is the same issue as I stated above. So for the solution we can recalculate based on the gpu, enlarge the tolerance, or remove this testing function |
Additional analysis for cpu/gpu difference
Difference: 0 backbone also has some large difference at the second and last layer. Also SwinEmbedding has 1e-6 difference even if it is the very beginning of the architecture
|
@qubvel requesting for the review |
We are running CI on T4 GPU: we can update the expected values accordingly on our side. |
updated the values for |
Hi! I suppose we can merge it as tests fails are unrelated! The only thing it would be nice to do before merging: I see @ArthurZucker started the initiative of moving losses to a separate module, let's update this branch and move this loss there too. Thanks! |
Basically, just moving the loss implementation to a new file in the |
It would be great if you can handle this weekend. @EduardoPach |
What does this PR do?
Fixes #31434
As the original repo doesn't provide the loss implementation I'm using the one implemented here as a baseline since it was mentioned by the original repo, on this issue IDEA-Research/GroundingDINO#241, as a reliable source if one wants to train a
GroundingDino
modelTODO:
GroundingDinoMatcher
andGroundingDinoLoss
are working properlyExplanation of the Issue and Solution
So the issue was that
GroundingDinoLoss
andGroundingDinoHungarianMatcher
were just a copy fromDeformableDetr
which is used for closed-set object detection (i.e. a fixed set of categories). Whereas inGroundingDino
there's no limited amount of categories and the output logits ared_model
dimensional where the firstseq_len
elements have a specified value and the subsequent arenan
. The main differences are:class_labels
are associated with the text prompt usedFor instance if an image with bounding boxes with fishes and jellyfishes using a prompt
"fish. jellyfish."
fish should haveclass_label
0 assigned to it and jellyfish should have 1 assigned. If the position of jellyfish and fish in the prompt swapped then theclass_label
s would swap as well. Moreover, jellyfish is represented by two tokens ([20919, 7529]
) and fish by one token ([3869]
) therefore we need to select the appropriate logits for each class.As the original implementation doesn't provide the training loop or the loss implementation, but does recommend other implementations for training
GroundingDino
on this issue IDEA-Research/GroundingDINO#241, I took as baseline the implementation from Open-GroundingDino as it supports both visual grounding and object detection and they've trained their ownGroundingDino
using their code base achieving good performance.Things added in this PR are:
build_label_maps
which generates a list oftorch.Tensor
with lenghtbatch_size
mapping each category to its corresponding tokens based on theinput_ids
build_text_mask
just expand theattention_mask
to select the appropriate tokens when computingGroundingDino.loss_labels
enc_topk_proposals
,encoder_logits
andencoder_pred_boxes
toGroundingDinoModelOutput
andGroundingDinoObjectDetectionOutput
to compute first stage lossclass_loss_coefficient
(with correct default value) andclass_loss_reduction
toGroundingDinoConfig
.class_loss_reduction
was added because insigmoid_focal_loss
from the baseline implementation they reducedloss_ce
with a simple sum, but that makes the losses imbalanced most of the time and in the original implementation they do have asigmoid_focal_loss
implemented, but usingmean
reduction, therefore I made I decided to make it configurable and use thesum
one for testing reasonsGroundingDinoLoss
andGroundingDinoHungarianMatcher
Also added a new integration test called
test_grounding_dino_loss
where I compare the loss obtained from 2 sample images with the baseline implementation fromOpen-GroundingDino
.c.c. @amyeroberts