Fix DPO with Reference Model #387

austin362667 · 2024-11-15T19:40:13Z

Summary

Thanks to @ByronHsu, he identified that the implementation in #378 lacked a reference model for DPO, effectively making it a CPO (Contrastive Preference Optimization) instead. To address this issue, I have:

Added a reference model
Implemented ref_chosen_logps and ref_rejected_logps
Incorporated a partial function in the forward pass

These changes ensure that DPO tests and benchmarks now function correctly.

DPO Loss Formulation

As mentioned in the previous PR #378,

In a reference setting, we get the formula:

$$r_\theta(x,y_c) - r_\theta(x,y_r) = \log(\pi_\theta(y_c|x)) - \log(\pi_\theta(y_r|x))$$

For the loss:

$$-\log(\sigma((\log(\pi_\theta(y_c|x)) - \log(\pi_\theta(y_r|x)) - \log(\pi_{\theta_{\text{ref}}}(y_c|x)) + \log(\pi_{\theta_{\text{ref}}}(y_r|x))) * \beta))$$

This corresponds to the code:

# Policy model log probabilities
policy_chosen_logps = log_probs(policy_chosen_logits)
policy_rejected_logps = log_probs(policy_rejected_logits)

# Reference model log probabilities
ref_chosen_logps = log_probs(ref_chosen_logits)
ref_rejected_logps = log_probs(ref_rejected_logits)

# Compute advantages
chosen_advantages = policy_chosen_logps - ref_chosen_logps
rejected_advantages = policy_rejected_logps - ref_rejected_logps

# policy_chosen_logps - ref_chosen_logps - policy_rejected_logps + ref_rejected_logps
logits_diff = (chosen_advantages - rejected_advantages) * beta

# DPO loss
losses = -F.logsigmoid(logits_diff)

Testing Done

Updated benchmarks:

Hardware Type: NVIDIA A100(40G)
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

Signed-off-by: Austin Liu <[email protected]>

austin362667 · 2024-11-22T03:19:30Z

This is wrong!! The correct implementation is in #405.

What I did wrong:

# This is incorrect:
ref_chosen_logps = torch.randn(B // 2, device="cuda", dtype=dtype)
ref_rejected_logps = torch.randn(B // 2, device="cuda", dtype=dtype)

Why I'm wrong:
I should not create random tensors for ref_chosen_logps and ref_rejected_logps. Here's why:

In DPO, the reference log probabilities MUST come from evaluating the reference model on the same inputs.
My random tensors break the crucial relationships between:

The input sequences
The policy model's predictions
The reference model's predictions

How to fix:
I need to add a reference model flag to switch on/off reference model usage, and compute proper reference logprobs when it's enabled.

Thanks to @shivam15s The correct implementation is already in PR #405. I'll close this PR to align with that approach.

austin362667 added 4 commits November 22, 2024 10:26

Fix DPO by adding reference model

39c9024

Signed-off-by: Austin Liu <[email protected]>

Fix DPO benchmark by adding reference logps

c19ccac

Signed-off-by: Austin Liu <[email protected]>

Format

c8db7e2

Signed-off-by: Austin Liu <[email protected]>

Format

4ddce2b

Signed-off-by: Austin Liu <[email protected]>

austin362667 force-pushed the fix/alignment/dpo branch from 56721f2 to 4ddce2b Compare November 22, 2024 02:31

austin362667 closed this Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DPO with Reference Model #387

Fix DPO with Reference Model #387

austin362667 commented Nov 15, 2024 •

edited

Loading

austin362667 commented Nov 22, 2024

Fix DPO with Reference Model #387

Fix DPO with Reference Model #387

Conversation

austin362667 commented Nov 15, 2024 • edited Loading

Summary

DPO Loss Formulation

Testing Done

austin362667 commented Nov 22, 2024

austin362667 commented Nov 15, 2024 •

edited

Loading