[RFC] Liger FlexChunkLoss: Alignment and Distillation loss #371

shivam15s · 2024-11-08T22:18:55Z

austin362667 · 2024-11-13T02:30:55Z

take DPO

hongpeng-guo · 2024-11-13T10:04:05Z

I can take fused linear kl div. BTW, really nice illustration on the chunk linear op fusion from the paper. Very clear to new contributors 😄

pramodith · 2024-11-13T11:46:34Z

@shivam15s @ByronHsu I think we should also consider including some of the loss functions commonly used for training embedding models, especially the popular ones supported in Sentence transformers.

It's quite common for embedding models to require large batch sizes to be trained well. Coupled with the fact that their batch/input structure is kind of similar to RLHF where we have positive and negative pairs, I believe that this can prove to be useful. I'd recommend supporting CoSENTLoss, MatryokshaLoss and TripleLoss for starters https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss. Perhaps this can be its own roadmap separate to this one although the idea of chunking and fusing remains the same.

ByronHsu · 2024-11-13T17:59:07Z

@pramodith that is a good idea! do you know if the models in embedding also has large vocab and suffer from memory bottleneck?

pramodith · 2024-11-13T18:48:36Z

@ByronHsu most embedding models have a final Linear layer of shape (hidden_dim, hidden_dim), so vocab size doesn't really come into the picture for them so you're right to point it out, but it is common to have an effective batch size of 65k

ByronHsu · 2024-11-13T22:16:39Z

Then i think chunk loss is still helpful given the large batch size

pramodith · 2024-11-13T22:22:04Z

Then i think chunk loss is still helpful given the large batch size

Yes, I think so too. I can give this a try after we wrap up all the important RLHF and distillation losses. I'll also get Tom Aarsen's perspective since he's the lead of Sentence Transformers.

@shivam15s

## Summary Add support for a fused, torch-compiled, and chunked DPO ([Direct Preference Optimization](https://arxiv.org/html/2305.18290v3)) loss kernel, as requested in #371. This implementation is largely based on the excellent work done on ORPO (#362) by @shivam15s. ### DPO Loss Formulation In a reference setting (not reference free): $$r_\theta(x,y_c) - r_\theta(x,y_r) = \log(\pi_\theta(y_c|x)) - \log(\pi_\theta(y_r|x))$$ $$-\log(\sigma((\log(\pi_\theta(y_c|x)) - \log(\pi_\theta(y_r|x)) - \log(\pi_{\theta_{\text{ref}}}(y_c|x)) + \log(\pi_{\theta_{\text{ref}}}(y_r|x)))/\beta))$$ Corresponds to: ```python # Policy model log probabilities policy_chosen_logps = log_probs(policy_chosen_logits) policy_rejected_logps = log_probs(policy_rejected_logits) # Reference model log probabilities ref_chosen_logps = log_probs(ref_chosen_logits) ref_rejected_logps = log_probs(ref_rejected_logits) # Compute advantages chosen_advantages = policy_chosen_logps - ref_chosen_logps rejected_advantages = policy_rejected_logps - ref_rejected_logps # DPO loss logits_diff = (chosen_advantages - rejected_advantages) / beta losses = -F.logsigmoid(logits_diff) ``` In this PR: 1. The above mathematical equation shows that to maximize the reward difference, we get formula: $$r_θ(x_c) - r_θ(x_r)$$ 2. This can be further optimized using just: $$-log(σ((π_θ(x_c) - π_θ(x_r))/β))$$ 3. So, the code implements: ```python logits_diff = (chosen_logps - rejected_logps) / beta # (π_θ(x_c) - π_θ(x_r))/β losses = -F.logsigmoid(logits_diff) # -log(σ(logits_diff)) ``` 4. Sum up DPO and NLL: $$L_{DPO+NLL} = L_{DPO}+αL_{NLL}$$ ## Testing Done ![dpo_loss_memory](https://github.com/user-attachments/assets/d48965a2-bab7-4a81-9872-a43826106731) ![dpo_loss_speed](https://github.com/user-attachments/assets/10ab33c3-a905-435f-886b-67c911b8fff6) - Hardware Type: **NVIDIA L40S (48G)** - [X] run `make test` to ensure correctness - [X] run `make checkstyle` to ensure code style - [X] run `make test-convergence` to ensure convergence --------- Signed-off-by: Austin Liu <[email protected]> Co-authored-by: shivam15s <[email protected]>

pramodith · 2024-11-15T11:46:03Z

#take Simpo and Irpo since they are just extensions of CPO.

vulkomilev · 2024-11-19T18:02:53Z

I will #take KTO as the next

vulkomilev · 2024-11-20T22:09:18Z

A little update on kto I am working now on the tests

ByronHsu · 2024-11-22T17:31:11Z

@Chillee FYI We are working on a set of post-training losses based on your compiled chunked loss implementation for CE. Thanks for the reference!

vulkomilev · 2024-11-23T15:13:21Z

Update on KTO loss I am done with the loss but I have problem with assertions.I am working on it.

hongpeng-guo · 2024-11-25T06:05:46Z

I was following this thread and working on a chunked, fused linear KL-divergence implementation for distillation use cases. Since distillation losses differ from preference losses, introducing a LigerFusedLinearDistillationBase parent class could be helpful.

In general, the distillation pipeline involves three key inputs: teacher_logits, student_logits, and ground_truth_label. The first two inputs are used to calculate the soft loss (KL divergence), while the latter two are used to compute the hard loss (cross-entropy). The final distillation loss is typically a weighted sum of these two components.

To leverage chunked, linear-fused optimizations, we could design the solution to accept inputs as teacher_tensor (BT, hidden_dim_teacher), student_tensor (BT, hidden_dim_student), and true_label (BT,). Using these inputs, we can apply the chunked, linear-fused approach to efficiently compute both the KL-divergence loss and the cross-entropy loss.

cc @ByronHsu, @shivam15s, @pramodith: What are your thoughts on this? Do you think it makes sense to include the cross-entropy loss as part of the DistillationBase class? Thanks for your feedback!

pramodith · 2024-11-25T10:19:51Z

@hongpeng-guo yes! I like your approach it's cleaner to create a new Base class for distillation losses, we're kind of doing the same for the Alignment losses to by computing the nll (cross-entropy loss of the accepted responses inside the Base class.)

Liger-Kernel/src/liger_kernel/chunked_loss/fused_linear_preference.py

Line 37 in 7e3683e

if compute_nll_loss:

.

ByronHsu · 2024-11-25T17:40:11Z

+1 on @hongpeng-guo proposal. @shivam15s can help polish the base class

shivam15s · 2024-11-25T19:40:19Z

Sounds good @hongpeng-guo, a separate base class for distillation is absolutely needed!

vulkomilev · 2024-11-28T20:59:29Z

please review and comment my PR on KTO here #410

vulkomilev · 2024-12-04T22:31:41Z

there is an update about #410

ccdv-ai · 2024-12-08T01:25:59Z

Is CPO-SimPO planned? This can be implemented in SimPO.

Reference: https://github.com/fe1ixxu/CPO_SIMPO

Quote

CPO and SimPO share similar objectives but have different goals. CPO adds a BC-regularizer to prevent the model from deviating too much from the preferred data distribution.

$L_{CPO}(\pi_\theta;U) = -E_{(x,y_w,y_l) \sim \mathcal{D}} \Big[ \log \sigma \Big( \beta \log \pi_{\theta}(y_w | x) - \beta \log \pi_{\theta}(y_l | x) \Big) + \log \pi_\theta(y_w| x)\Big]$

SimPO incorporates length normalization and target reward margin to improve model performance and prevent the generation of long but low-quality sequences:

$L_{SimPO}(\pi_\theta;U) = -E_{(x,y_w,y_l) \sim \mathcal{D}} \Big[ \log \sigma \Big( \frac{\beta}{|y_w|} \log \pi_{\theta}(y_w | x) - \frac{\beta}{|y_l|} \log \pi_{\theta}(y_l | x) - \gamma \Big) \Big]$

These two objectives can be jointly used, which we call CPO-SimPO:

$L_{CPO-SimPO}(\pi_\theta;U) = -E_{(x,y_w,y_l) \sim \mathcal{D}} \Big[ \log \sigma \Big( \frac{\beta}{|y_w|} \log \pi_{\theta}(y_w | x) - \frac{\beta}{|y_l|} \log \pi_{\theta}(y_l | x) - \gamma \Big)+ \alpha \log \pi_\theta(y_w| x)\Big]$

pramodith · 2024-12-08T15:25:46Z

@ccdv-ai I think this can be done via the existing set of hyperparams of setting compute_nll_loss=True and alpha for the BC regularizer and right now all our alignment loss functions do assume length normalization

Liger-Kernel/src/liger_kernel/chunked_loss/fused_linear_preference.py

Line 51 in bd65c47

average_log_prob = (per_token_logps * loss_mask).sum(-1) / loss_mask.sum(-1)

@Tcc0403

## Summary Made #417 from the main repo. Thanks to the nice suggestions from @Tcc0403 and @hongpeng-guo. This PR is the s first split from #408, focusing solely on introducing the Knowledge Distillation base class. As a result, this PR does not include any tests at the moment. #### Code Changes 1. Refactor `beta` into two weights: `weight_hard_loss` and `weight_soft_loss`, as coefficients between `hard_loss` and `soft_loss`. @Tcc0403 also pointed out that we could use `torch.lerp` if applicable. 2. Pass `teacher_logits` and `student_logits` directly to the divergence loss function. This avoids redundant computations of converting logits to log probabilities and then reverting them to raw logits. However note that we are not reusing the `student_log_probs` value calculated during `ce_loss` in distillation base. 1. Remove the unnecessary `get_batch_logps` in `test/utils.py`. 3. Modify `chunking` dimensions from `B` to `B * T`. Thanks to @hongpeng-guo's great advice. 1. Fix the loss calculation to use per-token values instead of averaging across the sequence length dimension. 4. Normalize the `distillation_loss` using `(full_target != ignore_index).sum()`. #### TODO 1. [X] Although a slightly slowdown is reasonable, we need to investigate why this PR's implementation is **significantly slower** compared to the naive approach. Thanks to @Tcc0403 's clarification. The issue arises because we are not properly configuring the `chunk_size` for the `B * T` dimension, which is extremely large (a few thousand). The previous default of 1 results in an excessive number of chunks. In contrast, this problem does not occur with the preference loss, as chunking is performed on the `B` dimension. This produces fewer than 10 chunks, which is efficient and works as expected. In conclusion, I set `chunk_size` to `1024` works pretty well in new benchmark results as shown in #425 2. [ ] #417 (comment) #### Knowledge Distillation Knowledge Distillation (KD; [Hinton et al. 2015](https://arxiv.org/abs/1503.02531), [Gou et al. 2020](https://arxiv.org/abs/2006.05525)) is a straightforward way to build a smaller, cheaper model (“student model”) to speed up inference by transferring skills from a pre-trained expensive model (“teacher model”) into the student. In knowledge distillation, a student model is trained to replicate the outputs of a teacher model using a distillation loss. Neural networks typically include a softmax layer; for instance, a large language model produces a probability distribution over tokens. Let `z_t` and `z_s` represent the logits before the softmax layer for the teacher and student models, respectively. The distillation loss reduces the discrepancy between the two softmax outputs at a high temperature `T`. When ground truth labels `y` are available, this approach can be combined with a supervised learning objective, such as cross-entropy, to compare the student’s outputs with the ground truth. The combined loss function is defined as: ```math \mathcal{L}_{\text{knowledge distillation}} = \mathcal{w}_{\text{soft}} \cdot \mathcal{L}_{\text{distill}}(\mathbf{z_t}, \mathbf{z_s}, T) + \mathcal{w}_{\text{hard}} \cdot \mathcal{L}_{\text{cross entropy}}(\mathbf{y}, \mathbf{z_s}), ``` Here, we directly pass in `logits` rather than `logpbs`. @Tcc0403 #### Shared `DistillationBase` To support various distillation learning objectives, this PR aims to add a `LigerFusedLinearDistillationBase` which is basically same as propose by @hongpeng-guo within this discussion #371 (comment). Thank you @hongpeng-guo for thinking through this. ## Testing Done I'll post JSD tests and benchmarks results in next PR: #425 - Hardware Type: L40S - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence --------- Signed-off-by: Austin Liu <[email protected]> Co-authored-by: shivam15s <[email protected]>

vulkomilev · 2024-12-11T21:24:30Z

there is an update about KTO on #410

ByronHsu assigned shivam15s Nov 8, 2024

ByronHsu mentioned this issue Nov 13, 2024

[Liger Kernel] Collab on efficient alignemnt kernels turbo-llm/turbo-alignment#54

Open

austin362667 mentioned this issue Nov 13, 2024

Support Chunked DPO Loss Kernel #378

Merged

3 tasks

ByronHsu pinned this issue Nov 15, 2024

ByronHsu changed the title ~~[RFC] Liger FlexChunkLoss: Supporting various alignment and distillation loss functions~~ [RFC] Liger FlexChunkLoss: Alignment and Distillation loss Nov 15, 2024

ByronHsu mentioned this issue Nov 15, 2024

2024 Q4 Roadmap #285

Open

ByronHsu mentioned this issue Nov 21, 2024

add nn.module support for chunked loss function #402

Merged

3 tasks

austin362667 mentioned this issue Nov 27, 2024

Introduce Distillation with a Chunked, Fused Linear JS-divergence Loss #408

Closed

5 tasks

austin362667 mentioned this issue Dec 2, 2024

Introduce Knowledge Distillation Base #417

Closed

4 tasks

ccdv-ai mentioned this issue Dec 6, 2024

Support ORPO/DPO Liger losses (and LigerORPOTrainer) axolotl-ai-cloud/axolotl#2141

Open

5 tasks

austin362667 mentioned this issue Dec 7, 2024

Introduce Knowledge Distillation Base #432

Merged

5 tasks

hebiao064 mentioned this issue Dec 13, 2024

Add KTO Loss #475

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Liger FlexChunkLoss: Alignment and Distillation loss #371

[RFC] Liger FlexChunkLoss: Alignment and Distillation loss #371

shivam15s commented Nov 8, 2024 •

edited

Loading

austin362667 commented Nov 13, 2024

hongpeng-guo commented Nov 13, 2024

pramodith commented Nov 13, 2024 •

edited

Loading

ByronHsu commented Nov 13, 2024

pramodith commented Nov 13, 2024

ByronHsu commented Nov 13, 2024

pramodith commented Nov 13, 2024

pramodith commented Nov 15, 2024

vulkomilev commented Nov 19, 2024

vulkomilev commented Nov 20, 2024

ByronHsu commented Nov 22, 2024 •

edited

Loading

vulkomilev commented Nov 23, 2024

hongpeng-guo commented Nov 25, 2024 •

edited

Loading

pramodith commented Nov 25, 2024

ByronHsu commented Nov 25, 2024

shivam15s commented Nov 25, 2024

vulkomilev commented Nov 28, 2024

vulkomilev commented Dec 4, 2024

ccdv-ai commented Dec 8, 2024

pramodith commented Dec 8, 2024

vulkomilev commented Dec 11, 2024

[RFC] Liger FlexChunkLoss: Alignment and Distillation loss #371

[RFC] Liger FlexChunkLoss: Alignment and Distillation loss #371

Comments

shivam15s commented Nov 8, 2024 • edited Loading

🚀 The feature, motivation and pitch

Progress

Alignment

Distillation

Design

Approach Overview:

Key Findings

Interface

Alternatives

Additional context

austin362667 commented Nov 13, 2024

hongpeng-guo commented Nov 13, 2024

pramodith commented Nov 13, 2024 • edited Loading

ByronHsu commented Nov 13, 2024

pramodith commented Nov 13, 2024

ByronHsu commented Nov 13, 2024

pramodith commented Nov 13, 2024

pramodith commented Nov 15, 2024

vulkomilev commented Nov 19, 2024

vulkomilev commented Nov 20, 2024

ByronHsu commented Nov 22, 2024 • edited Loading

vulkomilev commented Nov 23, 2024

hongpeng-guo commented Nov 25, 2024 • edited Loading

pramodith commented Nov 25, 2024

ByronHsu commented Nov 25, 2024

shivam15s commented Nov 25, 2024

vulkomilev commented Nov 28, 2024

vulkomilev commented Dec 4, 2024

ccdv-ai commented Dec 8, 2024

Quote

pramodith commented Dec 8, 2024

vulkomilev commented Dec 11, 2024

shivam15s commented Nov 8, 2024 •

edited

Loading

pramodith commented Nov 13, 2024 •

edited

Loading

ByronHsu commented Nov 22, 2024 •

edited

Loading

hongpeng-guo commented Nov 25, 2024 •

edited

Loading