-
Notifications
You must be signed in to change notification settings - Fork 491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
We Need a Better Training Pipeline: GRPO Trainer Struggles with Long Completion Lengths on H100x8 #65
Comments
When running zero3 with Qwen-7B on an H100x8 setup and
Given these limitations, it seems that a custom training pipeline might be necessary to address these challenges. You can reproduce with following comand :
|
Main QuestionHow can we fit at least 12k context (as referenced in DeepSeek Fig. 3) into a large-scale online RL pipeline without sacrificing simplicity and convenience, given that this project is open-sourced and aims to remain easy to use? Suggestions
|
Issue Description:
The current implementation of
GRPOTrainer
in the open-r1 repository encounters significant scalability issues when attempting to generate long completions (e.g.,max_completion_length=2048
) on an H100x8 setup, even with DeepSpeed Zero3 enabled. Reducing themax_completion_length
to 512 only barely allows training to proceed, which is insufficient for open-r1's goal of incentivizing long Chain-of-Thought (CoT) reasoning.This limitation is not resolved by increasing the number of nodes, as DeepSpeed Zero3 behaves more like data parallelism (DP) in this context, leading to inefficiencies. To address this, alternative approaches such as enabling tensor parallelism for GRPOTrainer or leveraging a different infrastructure may be necessary.
For example, Kimi uses a pod-level large-scale RL infrastructure for similar reasons, showcasing the need for specialized setups to support such workloads.
The text was updated successfully, but these errors were encountered: