Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We Need a Better Training Pipeline: GRPO Trainer Struggles with Long Completion Lengths on H100x8 #65

Open
SeungyounShin opened this issue Jan 27, 2025 · 2 comments

Comments

@SeungyounShin
Copy link

Issue Description:

The current implementation of GRPOTrainer in the open-r1 repository encounters significant scalability issues when attempting to generate long completions (e.g., max_completion_length=2048) on an H100x8 setup, even with DeepSpeed Zero3 enabled. Reducing the max_completion_length to 512 only barely allows training to proceed, which is insufficient for open-r1's goal of incentivizing long Chain-of-Thought (CoT) reasoning.

This limitation is not resolved by increasing the number of nodes, as DeepSpeed Zero3 behaves more like data parallelism (DP) in this context, leading to inefficiencies. To address this, alternative approaches such as enabling tensor parallelism for GRPOTrainer or leveraging a different infrastructure may be necessary.

For example, Kimi uses a pod-level large-scale RL infrastructure for similar reasons, showcasing the need for specialized setups to support such workloads.

@SeungyounShin
Copy link
Author

SeungyounShin commented Jan 27, 2025

When running zero3 with Qwen-7B on an H100x8 setup and max_completion_length set to 512, I observed the following issues:

  1. Accuracy Reward vs. Format Reward:
    The format reward, which requires <think>, </think>, <answer>, and </answer> in the generated text, remains at 0 even though the accuracy reward is non-zero. This suggests the model is constrained by the completion length limit.

  2. Lack of Tensor Parallelism Support in the Trainer:
    While transformers supports tensor parallelism and accelerate has a related PR, the current trainer setup does not support tensor parallel training.

Given these limitations, it seems that a custom training pipeline might be necessary to address these challenges.

Image 1 Image 2

You can reproduce with following comand :

accelerate launch \
  --config_file configs/zero3.yaml \
  src/open_r1/grpo.py \
  --output_dir DeepSeek-R1-Distill-Qwen-7B-GRPO \
  --model_name_or_path Seungyoun/Qwen2.5-7B-Open-R1-Distill \
  --dataset_name AI-MO/NuminaMath-TIR \
  --max_prompt_length 256 \
  --max_completion_length 512 \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 16 \
  --logging_steps 2 \
  --bf16

@SeungyounShin SeungyounShin changed the title GRPOTrainer Fails to Scale for Long Completion Lengths on H100x8 Nodes with DeepSpeed Zero3 GRPO Trainer Struggles with Long Completion Lengths on H100x8 with DeepSpeed Zero3 Jan 27, 2025
@SeungyounShin SeungyounShin changed the title GRPO Trainer Struggles with Long Completion Lengths on H100x8 with DeepSpeed Zero3 We Need a Better Training Pipeline: GRPO Trainer Struggles with Long Completion Lengths on H100x8 Jan 27, 2025
@SeungyounShin
Copy link
Author

SeungyounShin commented Jan 27, 2025

Main Question

How can we fit at least 12k context (as referenced in DeepSeek Fig. 3) into a large-scale online RL pipeline without sacrificing simplicity and convenience, given that this project is open-sourced and aims to remain easy to use?

Suggestions

  1. Tensor Parallelism (TP) for Intra-Node (e.g., H100x8) and Data Parallelism (DP) for Multi-Node

    • Is this feasible with Hugging Face-related repositories? If anyone has insights, please share.
  2. Pipeline Parallelism (PP) for Sub-Groups (e.g., H100x4) and Data Parallelism (DP) for Multi-Node and Groups

    • For example, using two H100x8 machines with 4 PP groups would result in 4 DP groups.
    • Is this approach supported by Hugging Face-related repositories? If anyone knows, please reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant