GitHub - codefuse-ai/rodimus

Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Overview

We propose Rodimus*, including Rodimus and Rodimus+, which tries to break the accuracy-efficency trade-off existing in Vanilla tranformers by introducing several innovative features.

Rodimus:

Linear attention-based, purely recurrent model.
Incorporates Data-Dependent Tempered Selection (DDTS) for semantic compression.
Reduced memory usage.

Rodimus+:

Hybrid model combining Rodimus with Sliding Window Shared-Key Attention (SW-SKA).
Enhances semantic, token, and head compression.

Highlights

Constant memory footprint but better language modeling performance.

Better scaling performance than Transformer.

A real lite model, without memory complexity O(T) in KV cache.

Pretrained Checkpoints

The models enhanced by code and math datasets.

Model	Contexts	HuggingFace
Rodimus+-1.6B-Base	4096
Rodimus+-1.6B-Instruct	4096

Quick Starts

Installation

The latest version of transformers is recommended (at least 4.37.0).
We evaluate our models with python=3.8 and torch==2.1.2.
If you use Rodimus, you need to install flash-linear-attention and triton>=2.2.0. If you use Rodimus+, you need to further install flash-attention.

Examples

In examples/generation_script.py, we show a code snippet to show you how to use the model to generate:

import os
import torch
from modeling_rodimus import RodimusForCausalLM
from tokenization_rodimus_fast import RodimusTokenizer

# load model
ckpt_dir = "model_path"
tokenizer = RodimusTokenizer.from_pretrained(ckpt_dir)
model = RodimusForCausalLM.from_pretrained(
    ckpt_dir,
    torch_dtype=torch.float16,
    device_map="cuda"
).eval()

# inference
input_prompt = "你好！你是谁？"
model_inputs = tokenizer(input_prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**model_inputs, max_length=32)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

print(response)

In examples/chat_script.py, we further show how to chat with Rodimus+:

import os
import torch
from modeling_rodimus import RodimusForCausalLM
from tokenization_rodimus_fast import RodimusTokenizer

# load model
ckpt_dir = "model_path"
tokenizer = RodimusTokenizer.from_pretrained(ckpt_dir)
model = RodimusForCausalLM.from_pretrained(
    ckpt_dir,
    torch_dtype=torch.float16,
    device_map="cuda"
).eval()

# inference
input_prompt = "简单介绍一下大型语言模型。"
messages = [
    {"role": "HUMAN", "content": input_prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    system='You are Rodimus$+$, created by AntGroup. You are a helpful assistant.',
    tokenize=False,
)
print(text)
model_inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**model_inputs, max_length=2048)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

print(response)

Citation

If you find our work helpful, feel free to give us a cite.

@misc{he2024rodimusbreakingaccuracyefficiencytradeoff,
      title={Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions}, 
      author={Zhihao He and Hang Yu and Zi Gong and Shizhan Liu and Jianguo Li and Weiyao Lin},
      year={2024},
      eprint={2410.06577},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.06577}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Overview

Highlights

Pretrained Checkpoints

Quick Starts

Installation

Examples

Citation

About

Releases

Packages

Contributors 2

codefuse-ai/rodimus

Folders and files

Latest commit

History

Repository files navigation

Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Overview

Highlights

Pretrained Checkpoints

Quick Starts

Installation

Examples

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages