We Need a Better Training Pipeline: GRPO Trainer Struggles with Long Completion Lengths on H100x8 #65

SeungyounShin · 2025-01-27T03:12:32Z

Issue Description:

The current implementation of GRPOTrainer in the open-r1 repository encounters significant scalability issues when attempting to generate long completions (e.g., max_completion_length=2048) on an H100x8 setup, even with DeepSpeed Zero3 enabled. Reducing the max_completion_length to 512 only barely allows training to proceed, which is insufficient for open-r1's goal of incentivizing long Chain-of-Thought (CoT) reasoning.

This limitation is not resolved by increasing the number of nodes, as DeepSpeed Zero3 behaves more like data parallelism (DP) in this context, leading to inefficiencies. To address this, alternative approaches such as enabling tensor parallelism for GRPOTrainer or leveraging a different infrastructure may be necessary.

For example, Kimi uses a pod-level large-scale RL infrastructure for similar reasons, showcasing the need for specialized setups to support such workloads.

The text was updated successfully, but these errors were encountered:

SeungyounShin · 2025-01-27T05:47:30Z

When running zero3 with Qwen-7B on an H100x8 setup and max_completion_length set to 512, I observed the following issues:

Accuracy Reward vs. Format Reward:
The format reward, which requires <think>, </think>, <answer>, and </answer> in the generated text, remains at 0 even though the accuracy reward is non-zero. This suggests the model is constrained by the completion length limit.
Lack of Tensor Parallelism Support in the Trainer:
While transformers supports tensor parallelism and accelerate has a related PR, the current trainer setup does not support tensor parallel training.

Given these limitations, it seems that a custom training pipeline might be necessary to address these challenges.

You can reproduce with following comand :

accelerate launch \
  --config_file configs/zero3.yaml \
  src/open_r1/grpo.py \
  --output_dir DeepSeek-R1-Distill-Qwen-7B-GRPO \
  --model_name_or_path Seungyoun/Qwen2.5-7B-Open-R1-Distill \
  --dataset_name AI-MO/NuminaMath-TIR \
  --max_prompt_length 256 \
  --max_completion_length 512 \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 16 \
  --logging_steps 2 \
  --bf16

SeungyounShin · 2025-01-27T06:05:16Z

Main Question

How can we fit at least 12k context (as referenced in DeepSeek Fig. 3) into a large-scale online RL pipeline without sacrificing simplicity and convenience, given that this project is open-sourced and aims to remain easy to use?

Suggestions

Tensor Parallelism (TP) for Intra-Node (e.g., H100x8) and Data Parallelism (DP) for Multi-Node
- Is this feasible with Hugging Face-related repositories? If anyone has insights, please share.
Pipeline Parallelism (PP) for Sub-Groups (e.g., H100x4) and Data Parallelism (DP) for Multi-Node and Groups
- For example, using two H100x8 machines with 4 PP groups would result in 4 DP groups.
- Is this approach supported by Hugging Face-related repositories? If anyone knows, please reply.

SeungyounShin changed the title ~~GRPOTrainer Fails to Scale for Long Completion Lengths on H100x8 Nodes with DeepSpeed Zero3~~ GRPO Trainer Struggles with Long Completion Lengths on H100x8 with DeepSpeed Zero3 Jan 27, 2025

SeungyounShin changed the title ~~GRPO Trainer Struggles with Long Completion Lengths on H100x8 with DeepSpeed Zero3~~ We Need a Better Training Pipeline: GRPO Trainer Struggles with Long Completion Lengths on H100x8 Jan 27, 2025

SeungyounShin mentioned this issue Jan 27, 2025

Add Optional ZeRO-3 Weight Gathering for GRPO in Sequence Generation huggingface/trl#2667

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We Need a Better Training Pipeline: GRPO Trainer Struggles with Long Completion Lengths on H100x8 #65

We Need a Better Training Pipeline: GRPO Trainer Struggles with Long Completion Lengths on H100x8 #65

SeungyounShin commented Jan 27, 2025

SeungyounShin commented Jan 27, 2025 •

edited

Loading

SeungyounShin commented Jan 27, 2025 •

edited

Loading

We Need a Better Training Pipeline: GRPO Trainer Struggles with Long Completion Lengths on H100x8 #65

We Need a Better Training Pipeline: GRPO Trainer Struggles with Long Completion Lengths on H100x8 #65

Comments

SeungyounShin commented Jan 27, 2025

Issue Description:

SeungyounShin commented Jan 27, 2025 • edited Loading

SeungyounShin commented Jan 27, 2025 • edited Loading

Main Question

Suggestions

SeungyounShin commented Jan 27, 2025 •

edited

Loading

SeungyounShin commented Jan 27, 2025 •

edited

Loading