LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training

📢 A SMALLER AFFORDABLE MoE MODEL FOR EVERYONE!!

🚀 Updates

📆[2024-12-03] 🎈 We scale the training data to 8.4B token and release the new MLP-MoE (8top2) model. The new model can achieve near 59.6 on GSM8K and 57.1 on HumanEval.

🎉 Introduction

LLaMA-MoE-v2 is a series of open-sourced Mixture-of-Expert (MoE) models based on LLaMA3. We build LLaMA-MoE-v2 with the following two steps:

Partition LLaMA's FFN layers or Attention layers into sparse experts and insert top-K gate for each layer of experts.
Supervised fine-tuning the constructed MoE models using open-source data with a two-stage training.

🔥 Features

Support building Attention MoE and MLP MoE:
1. build Attention MoE models with attention layers
2. build MLP MoE models with MLP layers
Multiple Expert Construction Methods:
1. random MLP MoE construction (vanilla)
2. residual MLP MoE construction (residual)
Packed Padding Training
Support training with megablocks
Two-stage & Open-source data for SFT:
First-stage
Two-stage
- Infinity-Instruct
- MetaMathQA
Support building MoE for different Models
models
- Llama3-8B

🚀 QuickStart

# python>=3.10

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_dir = "LLaMA-MoE-v2/LLaMA-MoE-v2-3_5B-2_8"
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)
model.eval()
model.to("cuda:0")

input_text = "Suzhou is famous for?"

input_text = f"<|start_header_id|>user<|end_header_id|>\n\n{input_text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

inputs = tokenizer(input_text, return_tensors="pt")
inputs = inputs.to("cuda:0")

pred = model.generate(**inputs, max_length=50, temperature=0.0)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

⚙️ Installation

Prepare conda environment: conda create -n smoe python=3.11 (If your environment name is not smoe, you may need to change environment in launching scripts)

Add correct environment variables in ~/.bashrc (gcc is set to newer version for installing flash-attn). e.g.:

export PATH=/mnt/petrelfs/share/cuda-11.8/bin:$PATH
export LD_LIBRARY_PATH=/mnt/petrelfs/share/cuda-11.8/lib64:$LD_LIBRARY_PATH
export PATH=/mnt/petrelfs/share/gcc-10.1.0/bin:$PATH
export LD_LIBRARY_PATH=/mnt/petrelfs/share/gcc-10.1.0/lib64:$LD_LIBRARY_PATH

Take the variables into effect: source ~/.bashrc
Install PyTorch (CUDA-11.8): pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Install dependencies: pip install -r requirements.txt
Install flash-attn: pip install flash-attn==2.6.1 --no-build-isolation. You may need to follow the flash-attn installation instructions to avoid some errors.
Install the latest Git: conda install git
Clone the repo: [email protected]:LLaMA-MoE/LLaMA-MoE-v2.git (If you don't setup the ssh key to GitHub, you may not able to clone through ssh. Check the docs about it.)
Change current directory: cd LLaMA-MoE-v2
Install smoe in editable mode: pip install -e .[dev]
Setup pre-commit hooks: pre-commit install

📊 Model Performance

Model	#Activated Experts	#Experts	#Activated Params	SFT Model
LLaMA-MLP-MoE (2/8)	2	8	3.8B	🤗 SFT
LLaMA-MLP-MoE (1+1/7)	2	8	3.8B	🤗 SFT

Model	#Training Tokens	MMLU(5)	GSM8k(8)	HumanEval(pass@10)	IFEval	BoolQ(32)	SciQ	PIQA	ARC-c(25)	TruthfulQA	HellaSwag(10)
LLaMA3-8B	15T	67.2	76.5	71.4	76.5	83.0	93.2	78.5	61.9	51.7	78.8
INCITE-3B	1T	25.1	2.1	6.92	30.1	66.5	94.7	74.4	40.2	36.4	65.6
Sheared-LLaMA-2.7B	50B	28.2	1.9	3.2	28.8	67.6	75.8	41.1	47.6	71.2	39.0
Gemma-2-2b	2T	53.0	26.3	46.1	34.9	72.3	75.8	67.5	52.6	50.8	69.0
Salamandra-2b	7.8T	25.1	1.90	5.82	27.7	68.0	89.8	74.7	46.3	43.4	62.3
SmolLM2-1.7B	11T	50.4	38.5	39.1	29.0	68.2	84.3	76.0	53.2	39.9	72.6
OpenMoE-3B-9B	1T	26.5	1.36	1.01	31.2	61.7	68.4	65.7	33.3	40.5	56.5
LLaMA-MoE-3B-7B	200B	28.2	4.62	12.0	28.1	68.1	88.8	77.9	44.0	33.3	73.2
OLMoE-1B-7B	1T	53.8	40.9	40.5	35.5	80.9	94.9	80.1	55.6	43.3	79.6
MLP-MoE (8top2)	7B	40.6	53.1	53.5	32.7	74.6	90.6	69.3	42.8	45.6	59.0
MLP-MoE (8top2)	8.4B	41.0	59.6	57.1	31.7	74.5	90.2	69.5	43.3	46.9	58.1
MLP-MoE (1+7top1)	7B	42.7	55.0	51.2	36.0	76.9	88.8	67.9	40.2	46.9	53.7

🚧 Expert Construction for MLP MoE

Vanilla LLaMA-MoE-v2: sbatch scripts/expert_construction/convert/convert_mixtral_v2.sh
Residual LLaMA-MoE-v2: sbatch scripts/expert_construction/convert/convert_mixtral_residual_v2.sh

For more information, please refer to Expert Construction docs.

💬 Supervised Fine-Tuning (SFT)

NOTICE: Please create logs/ folder manually: mkdir -p logs

We provide simple examples of SFT to build chatbots. Please refer to SFT docs for more details.

📑 Citation

@misc{llama-moe-v2,
  title={LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training},
  author={Xiaoye Qu, Daize Dong, Xuyang Hu, Tong Zhu, Weigao Sun, Yu Cheng},
  year={2024},
  month={Nov},
  url={https://arxiv.org/abs/2411.15708}
}

LLaMA-MoE Team w/ ❤️

Name		Name	Last commit message	Last commit date
Latest commit History 272 Commits
.vscode		.vscode
conf		conf
docs		docs
scripts		scripts
smoe		smoe
tests		tests
tools		tools
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
VERSION		VERSION
example.py		example.py
requirements.txt		requirements.txt
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training

🚀 Updates

🎉 Introduction

🔥 Features

🚀 QuickStart

⚙️ Installation

📊 Model Performance

🚧 Expert Construction for MLP MoE

💬 Supervised Fine-Tuning (SFT)

📑 Citation

About

Releases

Packages

Contributors 4

Languages

License

OpenSparseLLMs/LLaMA-MoE-v2

Folders and files

Latest commit

History

Repository files navigation

LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training

🚀 Updates

🎉 Introduction

🔥 Features

🚀 QuickStart

⚙️ Installation

📊 Model Performance

🚧 Expert Construction for MLP MoE

💬 Supervised Fine-Tuning (SFT)

📑 Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages