LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training

📢 A SMALLER AFFORDABLE MoE MODEL FOR EVERYONE!!

🚀 Updates

📆[2024-12-03] 🎈 We scale the training data to 8.4B token and release the new MLP-MoE (8top2) model. The new model can achieve near 59.6 on GSM8K and 57.1 on HumanEval.

🎉 Introduction

LLaMA-MoE-v2 is a series of open-sourced Mixture-of-Expert (MoE) models based on LLaMA3. We build LLaMA-MoE-v2 with the following two steps:

Partition LLaMA's FFN layers or Attention layers into sparse experts and insert top-K gate for each layer of experts.
Supervised fine-tuning the constructed MoE models using open-source data with a two-stage training.

🔥 Features

Support building Attention MoE and MLP MoE:
1. build Attention MoE models with attention layers
2. build MLP MoE models with MLP layers
Multiple Expert Construction Methods:
1. random MLP MoE construction (vanilla)
2. residual MLP MoE construction (residual)
Packed Padding Training
Support training with megablocks
Two-stage & Open-source data for SFT:
First-stage
Two-stage
- Infinity-Instruct
- MetaMathQA
Support building MoE for different Models
models
- Llama3-8B

🚀 QuickStart

# python>=3.10

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_dir = "LLaMA-MoE-v2/LLaMA-MoE-v2-3_5B-2_8"
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)
model.eval()
model.to("cuda:0")

input_text = "Suzhou is famous for?"

input_text = f"<|start_header_id|>user<|end_header_id|>\n\n{input_text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

inputs = tokenizer(input_text, return_tensors="pt")
inputs = inputs.to("cuda:0")

pred = model.generate(**inputs, max_length=50, temperature=0.0)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

⚙️ Installation

Prepare conda environment: conda create -n smoe python=3.11 (If your environment name is not smoe, you may need to change environment in launching scripts)

Add correct environment variables in ~/.bashrc (gcc is set to newer version for installing flash-attn). e.g.:

export PATH=/mnt/petrelfs/share/cuda-11.8/bin:$PATH
export LD_LIBRARY_PATH=/mnt/petrelfs/share/cuda-11.8/lib64:$LD_LIBRARY_PATH
export PATH=/mnt/petrelfs/share/gcc-10.1.0/bin:$PATH
export LD_LIBRARY_PATH=/mnt/petrelfs/share/gcc-10.1.0/lib64:$LD_LIBRARY_PATH

Take the variables into effect: source ~/.bashrc
Install PyTorch (CUDA-11.8): pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Install dependencies: pip install -r requirements.txt
Install flash-attn: pip install flash-attn==2.6.1 --no-build-isolation. You may need to follow the flash-attn installation instructions to avoid some errors.
Install the latest Git: conda install git
Clone the repo: git@github.com:LLaMA-MoE/LLaMA-MoE-v2.git (If you don't setup the ssh key to GitHub, you may not able to clone through ssh. Check the docs about it.)
Change current directory: cd LLaMA-MoE-v2
Install smoe in editable mode: pip install -e .[dev]
Setup pre-commit hooks: pre-commit install

📊 Model Performance

Model	#Activated Experts	#Experts	#Activated Params	SFT Model
LLaMA-MLP-MoE (2/8)	2	8	3.8B	🤗 SFT
LLaMA-MLP-MoE (1+1/7)	2	8	3.8B	🤗 SFT

Model	#Training Tokens	MMLU(5)	GSM8k(8)	HumanEval(pass@10)	IFEval	BoolQ(32)	SciQ	PIQA	ARC-c(25)	TruthfulQA	HellaSwag(10)
LLaMA3-8B	15T	67.2	76.5	71.4	76.5	83.0	93.2	78.5	61.9	51.7	78.8
INCITE-3B	1T	25.1	2.1	6.92	30.1	66.5	94.7	74.4	40.2	36.4	65.6
Sheared-LLaMA-2.7B	50B	28.2	1.9	3.2	28.8	67.6	75.8	41.1	47.6	71.2	39.0
Gemma-2-2b	2T	53.0	26.3	46.1	34.9	72.3	75.8	67.5	52.6	50.8	69.0
Salamandra-2b	7.8T	25.1	1.90	5.82	27.7	68.0	89.8	74.7	46.3	43.4	62.3
SmolLM2-1.7B	11T	50.4	38.5	39.1	29.0	68.2	84.3	76.0	53.2	39.9	72.6
OpenMoE-3B-9B	1T	26.5	1.36	1.01	31.2	61.7	68.4	65.7	33.3	40.5	56.5
LLaMA-MoE-3B-7B	200B	28.2	4.62	12.0	28.1	68.1	88.8	77.9	44.0	33.3	73.2
OLMoE-1B-7B	1T	53.8	40.9	40.5	35.5	80.9	94.9	80.1	55.6	43.3	79.6
MLP-MoE (8top2)	7B	40.6	53.1	53.5	32.7	74.6	90.6	69.3	42.8	45.6	59.0
MLP-MoE (8top2)	8.4B	41.0	59.6	57.1	31.7	74.5	90.2	69.5	43.3	46.9	58.1
MLP-MoE (1+7top1)	7B	42.7	55.0	51.2	36.0	76.9	88.8	67.9	40.2	46.9	53.7

🚧 Expert Construction for MLP MoE

Vanilla LLaMA-MoE-v2: sbatch scripts/expert_construction/convert/convert_mixtral_v2.sh
Residual LLaMA-MoE-v2: sbatch scripts/expert_construction/convert/convert_mixtral_residual_v2.sh

For more information, please refer to Expert Construction docs.

💬 Supervised Fine-Tuning (SFT)

NOTICE: Please create logs/ folder manually: mkdir -p logs

We provide simple examples of SFT to build chatbots. Please refer to SFT docs for more details.

📑 Citation

@misc{llama-moe-v2,
  title={LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training},
  author={Xiaoye Qu, Daize Dong, Xuyang Hu, Tong Zhu, Weigao Sun, Yu Cheng},
  year={2024},
  month={Nov},
  url={https://arxiv.org/abs/2411.15708}
}

LLaMA-MoE Team w/ ❤️

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training

🚀 Updates

🎉 Introduction

🔥 Features

🚀 QuickStart

⚙️ Installation

📊 Model Performance

🚧 Expert Construction for MLP MoE

💬 Supervised Fine-Tuning (SFT)

📑 Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training

🚀 Updates

🎉 Introduction

🔥 Features

🚀 QuickStart

⚙️ Installation

📊 Model Performance

🚧 Expert Construction for MLP MoE

💬 Supervised Fine-Tuning (SFT)

📑 Citation