🏎️ vllm for Online DPO #2558

qgallouedec · 2025-01-10T17:06:54Z

What does this PR do?

Use vLLM for generation. 2.2x faster 🚀🚀

Demo:

from datasets import load_dataset
from trl import OnlineDPOConfig, OnlineDPOTrainer, PairRMJudge
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
judge = PairRMJudge()
train_dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")

training_args = OnlineDPOConfig(output_dir="Qwen2-0.5B-OnlineDPO-vllm", logging_steps=10, use_vllm=True, gradient_accumulation_steps=8)
trainer = OnlineDPOTrainer(
    model=model, judge=judge, args=training_args, processing_class=tokenizer, train_dataset=train_dataset
)

trainer.train()

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2025-01-10T18:35:01Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2025-01-11T00:11:01Z

we get different results with vllm. probably linked to sampling param. investigating

qgallouedec · 2025-01-11T22:08:57Z

CI fails because in the latest transformers version release yesterday, transformers uses a python 3.10+ syntax (timeout: float | None = None). I'm not sure why it fails only for the cli test, but I think we can safely ignore it.

qgallouedec · 2025-01-12T14:25:36Z

trl/trainer/online_dpo_config.py

+        max_length (`int`, *optional*, defaults to `256`):
+            Maximum total length of the sequence (prompt + completion) used to compute log probabilities. If the
+            sequence exceeds this limit, the leftmost tokens will be truncated to preserve as much of the completion as
+            possible.


To avoid OOM for long prompts

qgallouedec · 2025-01-12T14:27:16Z

trl/trainer/online_dpo_trainer.py

+            # However, at this stage, the optimizer's weights are not yet loaded onto the GPU; they will be loaded
+            # after the first optimizer step and remain in GPU memory throughout training. So we must reserve enough
+            # space for them. Setting gpu_memory_utilization to 0.6 seems to work well in practice.
+            self.llm = LLM(model=model.name_or_path, gpu_memory_utilization=0.55, dtype=torch.float32)


dtype and gpu_memory_utilization are hardcoded but we can still make them as arg in the future.
See here why using torch.float32 is important.

qgallouedec · 2025-01-12T14:34:47Z

Surprisingly, the precision of the generator model seems to have a pretty high impact on the results:

When you keep the default precision (bfloat16), the results seem to be significantly worst. It may be related to higher noise in generation? Making the temperature vary might help to confirm this intuition. In the meantime, I've hard-coded float32 as model precision. It hurts a bit the speed (you've less space for KV cache) but it's still way faster than without vLLM.

2.2x faster 🚀🚀

…nto vllm-onlinedpo

…ilable

…nto vllm-onlinedpo

edbeeching · 2025-01-16T10:06:41Z

Awesome addition. If I understand correctly there are now two instances of the model, so this will only work with DDP and small models where memory capacity is not an issue?

qgallouedec · 2025-01-16T10:45:35Z

there are now two instances of the model

Correct: one trained model and one generation model. Note that the weights of these two models are always the same. I wonder if we could share them instead of duplicating them? @hmellor any idea?

small models where memory capacity is not an issue

Yes, parallelism (not supported yet) should make it possible to relax this constraint.

hmellor · 2025-01-16T11:28:51Z

vllm-project/vllm#10353 is probably the closest we can currently get to directly accessing vLLM's model, but this PR is expected to be superseded by work done in the V1 engine re-architecture that is ongoing (planned beta-release any day now, on by default by end of January).

vllm-project/vllm#12084 introduces some RLHF features to vLLM, but it follows the OpenRLHF model where training and inference processes live on different GPUs.

qgallouedec added 7 commits January 10, 2025 15:11

vllm online dpo

56f7865

new arg and add back generation config [skip ci]

c667990

import utils

cbe8083

optional import and comment

d4c16d5

is_vllm_available

cc9bd44

support conv and not conv [ci skip]

e0840ae

add old code back

99982e1

qgallouedec added 4 commits January 10, 2025 19:01

use func [skip ci]

04d595a

fix _generate call

c4af44f

fix and dedicated func

481006d

top k 50

444f29d

qgallouedec added 2 commits January 11, 2025 17:02

style

e88bcab

add import error

3081ec0

qgallouedec mentioned this pull request Jan 11, 2025

🏛️ Improve DPO configuration documentation structure #2561

Merged

5 tasks

qgallouedec added 5 commits January 12, 2025 13:53

new testing model

bde6fc3

Update OnlineDPOTrainer class with new features

385f81d

test vllm

3586d68

fix generate tiny script

dc6a024

max len arg

c4dee90

qgallouedec commented Jan 12, 2025

View reviewed changes

qgallouedec and others added 2 commits January 12, 2025 14:36

fix comment [ci skip]

637e653

Merge branch 'main' into vllm-onlinedpo

5da5d0b

qgallouedec marked this pull request as ready for review January 12, 2025 14:37

qgallouedec requested review from lewtun and removed request for lewtun January 12, 2025 14:37

qgallouedec requested review from kashif, lewtun and August-murr January 12, 2025 14:37

kashif approved these changes Jan 12, 2025

View reviewed changes

qgallouedec and others added 11 commits January 12, 2025 14:52

revert num_return_sequences

5c5fd5d

Merge branch 'vllm-onlinedpo' of https://github.com/huggingface/trl i…

6169984

…nto vllm-onlinedpo

vllm dep

4c191ff

Add require_torch_accelerator import and skip test if vllm is not ava…

6af5340

…ilable

proper require_torch_accelerator

b39f54a

Merge branch 'main' into vllm-onlinedpo

ac5e31f

add vllm section

e2fc58f

Merge branch 'vllm-onlinedpo' of https://github.com/huggingface/trl i…

b7112fb

…nto vllm-onlinedpo

Add hfoption sections to speeding_up_training.md

aeed6c5

no, an id

ed2fd05

Update vllm dependency to exclude Windows platform

4cc94d6

Note on future release

28af8af

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🏎️ vllm for Online DPO #2558

🏎️ vllm for Online DPO #2558

qgallouedec commented Jan 10, 2025 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 10, 2025

qgallouedec commented Jan 11, 2025

qgallouedec commented Jan 11, 2025

qgallouedec Jan 12, 2025

qgallouedec Jan 12, 2025 •

edited

Loading

qgallouedec commented Jan 12, 2025 •

edited

Loading

edbeeching commented Jan 16, 2025 •

edited

Loading

qgallouedec commented Jan 16, 2025 •

edited

Loading

hmellor commented Jan 16, 2025

🏎️ vllm for Online DPO #2558

Are you sure you want to change the base?

🏎️ vllm for Online DPO #2558

Conversation

qgallouedec commented Jan 10, 2025 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Jan 10, 2025

qgallouedec commented Jan 11, 2025

qgallouedec commented Jan 11, 2025

qgallouedec Jan 12, 2025

Choose a reason for hiding this comment

qgallouedec Jan 12, 2025 • edited Loading

Choose a reason for hiding this comment

qgallouedec commented Jan 12, 2025 • edited Loading

edbeeching commented Jan 16, 2025 • edited Loading

qgallouedec commented Jan 16, 2025 • edited Loading

hmellor commented Jan 16, 2025

qgallouedec commented Jan 10, 2025 •

edited

Loading

qgallouedec Jan 12, 2025 •

edited

Loading

qgallouedec commented Jan 12, 2025 •

edited

Loading

edbeeching commented Jan 16, 2025 •

edited

Loading

qgallouedec commented Jan 16, 2025 •

edited

Loading