Skip to content

codefuse-ai/rodimus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

If you like our project, please give us a star ⭐ on GitHub for the latest update.

hf arXiv License

Overview

We propose Rodimus*, including Rodimus and Rodimus+, which tries to break the accuracy-efficency trade-off existing in Vanilla tranformers by introducing several innovative features.

Rodimus:

  • Linear attention-based, purely recurrent model.
  • Incorporates Data-Dependent Tempered Selection (DDTS) for semantic compression.
  • Reduced memory usage.

Rodimus+:

  • Hybrid model combining Rodimus with Sliding Window Shared-Key Attention (SW-SKA).
  • Enhances semantic, token, and head compression.

Highlights

  • Constant memory footprint but better language modeling performance.
  • Better scaling performance than Transformer.
  • A real lite model, without memory complexity O(T) in KV cache.

Pretrained Checkpoints

The models enhanced by code and math datasets.

Model Contexts HuggingFace
Rodimus+-1.6B-Base 4096
Rodimus+-1.6B-Instruct 4096

Quick Starts

Installation

  1. The latest version of transformers is recommended (at least 4.37.0).
  2. We evaluate our models with python=3.8 and torch==2.1.2.
  3. If you use Rodimus, you need to install flash-linear-attention and triton>=2.2.0. If you use Rodimus+, you need to further install flash-attention.

Examples

In examples/generation_script.py, we show a code snippet to show you how to use the model to generate:

import os
import torch
from modeling_rodimus import RodimusForCausalLM
from tokenization_rodimus_fast import RodimusTokenizer

# load model
ckpt_dir = "model_path"
tokenizer = RodimusTokenizer.from_pretrained(ckpt_dir)
model = RodimusForCausalLM.from_pretrained(
    ckpt_dir,
    torch_dtype=torch.float16,
    device_map="cuda"
).eval()

# inference
input_prompt = "你好!你是谁?"
model_inputs = tokenizer(input_prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**model_inputs, max_length=32)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

print(response)

In examples/chat_script.py, we further show how to chat with Rodimus+:

import os
import torch
from modeling_rodimus import RodimusForCausalLM
from tokenization_rodimus_fast import RodimusTokenizer

# load model
ckpt_dir = "model_path"
tokenizer = RodimusTokenizer.from_pretrained(ckpt_dir)
model = RodimusForCausalLM.from_pretrained(
    ckpt_dir,
    torch_dtype=torch.float16,
    device_map="cuda"
).eval()

# inference
input_prompt = "简单介绍一下大型语言模型。"
messages = [
    {"role": "HUMAN", "content": input_prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    system='You are Rodimus$+$, created by AntGroup. You are a helpful assistant.',
    tokenize=False,
)
print(text)
model_inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**model_inputs, max_length=2048)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

print(response)

Citation

If you find our work helpful, feel free to give us a cite.

@misc{he2024rodimusbreakingaccuracyefficiencytradeoff,
      title={Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions}, 
      author={Zhihao He and Hang Yu and Zi Gong and Shizhan Liu and Jianguo Li and Weiyao Lin},
      year={2024},
      eprint={2410.06577},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.06577}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published