Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation details of the loss scaling algorithm #53

Open
jiajingk opened this issue Mar 14, 2024 · 1 comment
Open

Implementation details of the loss scaling algorithm #53

jiajingk opened this issue Mar 14, 2024 · 1 comment

Comments

@jiajingk
Copy link

jiajingk commented Mar 14, 2024

Thank you for sharing such great works!

I encountered the following problem during the reproduction of your paper and was wondering if you might be willing to offer some guidance or clarification.

I trained the 2M VIMAPolicy (with all weights initialized by their default initial distributions except T5) on a small subset of the VIMA-Bench dataset (32 samples per task and 13 tasks in total) and tried to make it overfit.

It can be found that the imitation loss (calculated by cross_entropy_loss(dist_dict._logits, discrete_target_action)) of different action attributes (such as pose0_rotation, pose1_position) can change very differently during the training process, like the plot showing below. In this experiment, the final loss is calculated by taking the sum of all those action attributes with equal weights and then normalized by time step length


The plot shows how different loss (per step) attributes converges.
for example, `pose0_rotation_0` means the loss associated with the first dimension of `pose0_rotation` at a single time step.

By zooming to the first and last 100 epochs of the experiment, it can be found all dimensions of pose0_rotation and the first two dimensions of pose1_rotation converge very quickly to zero while the other losses converge relatively slow. The scaling between them changes dynamically.


First100 epochs


Last 100 epochs

In the same experiment, I also measured the ratio of the average loss between different tasks and got the following table. For example, 16.745474 means that the average loss of rearrange_then_restore samples is about 16x larger than the one of novel_noun samples

novel_noun                     1.000000
sweep_without_exceeding        1.602642
rotate                         1.857377
visual_manipulation            1.998764
twist                          3.802508
manipulate_old_neighbor        4.956325
scene_understanding            5.033336
follow_order                   5.132609
rearrange                      5.827855
pick_in_order_then_restore    11.248917
rearrange_then_restore        16.745474

I would like to know how those losses (per action attribute and per task) are balanced during training. Thank you

@amitkparekh
Copy link

I've encountered similar questions. I released everything I did here: https://github.com/amitkparekh/CoGeLoT, maybe it has some answers to your questions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants