Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Thanks to @ByronHsu, he identified that the implementation in #378 lacked a reference model for DPO, effectively making it a CPO (Contrastive Preference Optimization) instead. To address this issue, I have:
ref_chosen_logps
andref_rejected_logps
These changes ensure that DPO tests and benchmarks now function correctly.
DPO Loss Formulation
As mentioned in the previous PR #378,
In a reference setting, we get the formula:
For the loss:
This corresponds to the code:
Testing Done
Updated benchmarks:
make test
to ensure correctnessmake checkstyle
to ensure code stylemake test-convergence
to ensure convergence