Quantize llama3 70B on 3x 4090 #1772

Unanswered

teis-e asked this question in Q&A

teis-e
Jun 12, 2024

I right now run inference thru Transformers with on the fly 4 bit quantization.

Can i create an engine with 4 bit quantization without having to fit the whole unquantized model on the gpu's?

I already have the 8b model running with trition and a working environment! I just need some help with the commands for 70B

Replies: 0 comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment