mamba-ssm

Optimized inference-only implementation of Mamba [1] written in Rust.

Description

The primary goal of this project is to provide an inference backend that can run Mamba on an Apple Silicon Macbook without having dependencies on CUDA entangled with the code. The initial development specifically targets CPU-only as a first-class citizen, with linear algebra routines supported by Accelerate or Intel MKL.

The main dependency of this project is Candle, so supported platforms are mainly decided by their implementation in that framework.

Supported Platforms

CPU
Accelerate framework (via --features accelerate)
Intel MKL (via features mkl)
- It probably works, but I haven't tested it yet
Metal
- Still relatively unoptimized
CUDA (via features cuda)
- It works but no optimization was done for CUDA yet.

Supported Features

Getting Started

Prepare a Mamba safetensors model, config.json, and tokenizer.json and move these to the /.models directory.
- Run ./download.sh to download mamba-2.8b-slimpj and the tokenizer from gpt-neox-20b
Install Rust, then run:

cargo build --release
target/release/mamba-cli --prompt "Mamba is the"

You can also specify the model and config.json used by passing flags:

target/release/mamba-cli -m models/mamba-2.8b-slimpj/model.safetensors -c models/mamba-2.8b-slimpj/config.json -prompt "Mamba is the"

For other usage options such as passing the prompt by file, see the usage:

target/release/mamba-cli --help

Building with Apple Accelerate Framework support

cargo build --release --features accelerate

Building with Intel MKL framework support

cargo build --release --features mkl

Generation speed with CPU

Currently, with the Mamba 2.8b model, it generates at about 6.5 tokens/s with FP32 on CPU only on a M3 Max MBP.

$ target/release/mamba-cli --temperature 0 -n 50 -f prompt.txt
avx: false, neon: true, simd128: false, f16c: false, num_threads: 16, cuda: false, metal: false, accelerate: true, mkl: false
temp: 0.00 repeat-penalty: 1.10 repeat-last-n: 64
loaded the model in 1.605674125s
generating 50 tokens with seed 16889006583945703583

Prompt processing time (98 tokens at 24.68 token/s)
I am that merry wanderer of the night.
I jest to Oberon and make him smile
When I a fat and bean-fed horse beguile,
Neighing in likeness of a filly foal:
And sometime lurk I in a gossip’s bowl,
In very likeness of a roasted crab,
And when she drinks, against her lips I bob
And on her wither’d dewlap pour the ale.
I am that merry jester of the night;
When he is sick and sad, I make him smile:
If his wife be angry with him, then I
Make him laugh, as if a fool were free.
But when she
50 tokens generated (6.50 token/s)

References

[1] "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" Albert Gu and Tri Dao https://arxiv.org/abs/2312.00752

[2] "The Annotated S4" Sasha Rush and Sidd Karamcheti https://srush.github.io/annotated-s4

[3] "Error Analysis and Improving the Accuracy of Winograd Convolution for Deep Neural Networks" Barbara Barabasz, Andrew Anderson, Kirk M. Soodhalter, David Gregg https://arxiv.org/abs/1803.10986

[4] "Winograd Convolution for Deep Neural Networks: Efficient Point Selection" Syed Asad Alam, Andrew Anderson, Barbara Barabasz, David Gregg https://arxiv.org/pdf/2201.10369.pdf

Code references

Original implementation: https://github.com/state-spaces/mamba
This repo was initially adapted from code from the mamba-minimal candle-example: https://github.com/huggingface/candle/tree/main/candle-examples/examples/mamba-minimal
Instructive minimal implementation: https://github.com/johnma2006/mamba-minimal

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
mamba-ssm		mamba-ssm
models		models
prompts		prompts
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
download.sh		download.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mamba-ssm

Description

Supported Platforms

Supported Features

Getting Started

Building with Apple Accelerate Framework support

Building with Intel MKL framework support

Generation speed with CPU

References

Code references

About

Releases

Packages

Contributors 2

Languages

License

flawedmatrix/mamba-ssm

Folders and files

Latest commit

History

Repository files navigation

mamba-ssm

Description

Supported Platforms

Supported Features

Getting Started

Building with Apple Accelerate Framework support

Building with Intel MKL framework support

Generation speed with CPU

References

Code references

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages