You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am deploying Qwen/QwQ-32B-Preview on the a 4 X L4 (24GB per card) environment. I have an fp8 quantised model (I used the llmapi to quantise it) which fits in memory using 32GB, and the remaining memory for the kv cache. I see redrafter supports fp8 in the support matrix.
I have a fp32 redrafter which was trained on the bf16 version of the base model. I would like to convert a quantised fp8 base model (modelopt format) and the fp32 redrafter together. However I see that the convert script only accepts a base model with fp16/fp32/bf16 (link). A bf16 model would allocate too much memory and leave little remaining for kv cache.
I am wondering should it be possible to make it work with fp8 base model (already quantised to fp8 as below); I am happy to modify the conversion script as needed.
I am deploying
Qwen/QwQ-32B-Preview
on the a 4 X L4 (24GB per card) environment. I have an fp8 quantised model (I used the llmapi to quantise it) which fits in memory using 32GB, and the remaining memory for the kv cache. I see redrafter supports fp8 in the support matrix.I have a fp32 redrafter which was trained on the bf16 version of the base model. I would like to convert a quantised fp8 base model (modelopt format) and the fp32 redrafter together. However I see that the convert script only accepts a base model with fp16/fp32/bf16 (link). A bf16 model would allocate too much memory and leave little remaining for kv cache.
I am wondering should it be possible to make it work with fp8 base model (already quantised to fp8 as below); I am happy to modify the conversion script as needed.
Quantisation params of base model
fp8 base model config
The text was updated successfully, but these errors were encountered: