We propose Rodimus*, including Rodimus and Rodimus+, which tries to break the accuracy-efficency trade-off existing in Vanilla tranformers by introducing several innovative features.
Rodimus:
- Linear attention-based, purely recurrent model.
- Incorporates Data-Dependent Tempered Selection (DDTS) for semantic compression.
- Reduced memory usage.
Rodimus+:
- Hybrid model combining Rodimus with Sliding Window Shared-Key Attention (SW-SKA).
- Enhances semantic, token, and head compression.
- Constant memory footprint but better language modeling performance.
- Better scaling performance than Transformer.
- A real lite model, without memory complexity O(T) in KV cache.
The models enhanced by code and math datasets.
Model | Contexts | HuggingFace |
---|---|---|
Rodimus+-1.6B-Base | 4096 | |
Rodimus+-1.6B-Instruct | 4096 |
- The latest version of
transformers
is recommended (at least 4.37.0). - We evaluate our models with
python=3.8
andtorch==2.1.2
. - If you use Rodimus, you need to install
flash-linear-attention
andtriton>=2.2.0
. If you use Rodimus+, you need to further installflash-attention
.
In examples/generation_script.py
, we show a code snippet to show you how to use the model to generate:
import os
import torch
from modeling_rodimus import RodimusForCausalLM
from tokenization_rodimus_fast import RodimusTokenizer
# load model
ckpt_dir = "model_path"
tokenizer = RodimusTokenizer.from_pretrained(ckpt_dir)
model = RodimusForCausalLM.from_pretrained(
ckpt_dir,
torch_dtype=torch.float16,
device_map="cuda"
).eval()
# inference
input_prompt = "你好!你是谁?"
model_inputs = tokenizer(input_prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**model_inputs, max_length=32)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)
In examples/chat_script.py
, we further show how to chat with Rodimus+:
import os
import torch
from modeling_rodimus import RodimusForCausalLM
from tokenization_rodimus_fast import RodimusTokenizer
# load model
ckpt_dir = "model_path"
tokenizer = RodimusTokenizer.from_pretrained(ckpt_dir)
model = RodimusForCausalLM.from_pretrained(
ckpt_dir,
torch_dtype=torch.float16,
device_map="cuda"
).eval()
# inference
input_prompt = "简单介绍一下大型语言模型。"
messages = [
{"role": "HUMAN", "content": input_prompt}
]
text = tokenizer.apply_chat_template(
messages,
system='You are Rodimus$+$, created by AntGroup. You are a helpful assistant.',
tokenize=False,
)
print(text)
model_inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**model_inputs, max_length=2048)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)
If you find our work helpful, feel free to give us a cite.
@misc{he2024rodimusbreakingaccuracyefficiencytradeoff,
title={Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions},
author={Zhihao He and Hang Yu and Zi Gong and Shizhan Liu and Jianguo Li and Weiyao Lin},
year={2024},
eprint={2410.06577},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.06577},
}