You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Correct me if I'm wrong but quantizing would require loading the models in their unquantized form (as per torch.load in https://github.com/saharNooby/rwkv.cpp/blob/master/rwkv/convert_pytorch_to_ggml.py, line 126). Not to mention how much heavier the unquantized models are on bandwidths.
The text was updated successfully, but these errors were encountered:
Only PyTorch -> rwkv.cpp conversion would require to load the whole model in the RAM; quantization is done tensor-by-tensor. You are right about the bandwidth tho.
I have uploaded some quantized RWKV-4-Raven models to HuggingFace at LoganDark/rwkv-4-raven-ggml. Conversion took about 2 hours, and upload took about 24 hours and 500GB of disk space.
At the time of writing, the available models are:
Name
f32
f16
Q4_0
Q4_1
Q4_2
Q5_1
Q8_0
RWKV-4-Raven-1B5-v11-Eng99-20230425-ctx4096
Yes
Yes
Yes
No
Yes
Yes
Yes
RWKV-4-Raven-3B-v11-Eng99-20230425-ctx4096
Yes
Yes
Yes
No
Yes
Yes
Yes
RWKV-4-Raven-7B-v11x-Eng99-20230429-ctx8192
Yes
Yes
Yes
No
Yes
Yes
Yes
RWKV-4-Raven-14B-v11x-Eng99-20230501-ctx8192
Split
Yes
Yes
No
Yes
Yes
Yes
Feel free to create a discussion if you have a request.
Correct me if I'm wrong but quantizing would require loading the models in their unquantized form (as per
torch.load
inhttps://github.com/saharNooby/rwkv.cpp/blob/master/rwkv/convert_pytorch_to_ggml.py
, line126
). Not to mention how much heavier the unquantized models are on bandwidths.The text was updated successfully, but these errors were encountered: