llama : support RWKV v6 models (#8980)
* convert_hf_to_gguf: Add support for RWKV v6
Signed-off-by: Molly Sophia <[email protected]>
* Add RWKV tokenization
* Fix build
Signed-off-by: Molly Sophia <[email protected]>
* Do not use special tokens when matching in RWKV tokenizer
* Fix model loading
* Add (broken) placeholder graph builder for RWKV
* Add workaround for kv cache
* Add logits conversion to rwkv5
* Add rwkv5 layer norms
* Add time mix KVRG & correct merge mistake
* Add remaining time mix parameters
* Add time mix output loading
* Add placeholder llm_build_time_mix
* Fix build
Signed-off-by: Molly Sophia <[email protected]>
* Load more tensors for rwkv v6
Signed-off-by: Molly Sophia <[email protected]>
* Fix rwkv tokenizer
Signed-off-by: Molly Sophia <[email protected]>
* ggml: Add unary operator Exp
Signed-off-by: Molly Sophia <[email protected]>
* RWKV v6 graph building
Signed-off-by: Molly Sophia <[email protected]>
* Add ``rescale_every_n_layers`` parameter
Signed-off-by: Molly Sophia <[email protected]>
* Add ``wkv.head_size`` key for RWKV
so it doesn't reuse Mamba ssm parameters
Signed-off-by: Molly Sophia <[email protected]>
* Fix offloading layers to CUDA
Signed-off-by: Molly Sophia <[email protected]>
* Fix parallel inferencing for RWKV
Signed-off-by: Molly Sophia <[email protected]>
* Remove trailing whitespaces
Signed-off-by: Molly Sophia <[email protected]>
* build_rwkv: Avoid using inplace operations
Signed-off-by: Molly Sophia <[email protected]>
* convert_hf_to_gguf: rwkv: Avoid using ``eval``
Signed-off-by: Molly Sophia <[email protected]>
* convert_hf_to_gguf: rwkv tokenizer: Don't escape sequences manually
Signed-off-by: Molly Sophia <[email protected]>
* Update convert_hf_to_gguf.py
Co-authored-by: compilade <[email protected]>
* ggml: Add backward computation for unary op ``exp``
Signed-off-by: Molly Sophia <[email protected]>
* Update convert_hf_to_gguf.py
Co-authored-by: compilade <[email protected]>
* Update convert_hf_to_gguf.py
Co-authored-by: compilade <[email protected]>
* Use MODEL_ARCH.RWKV6 instead of MODEL_ARCH.RWKV
Signed-off-by: Molly Sophia <[email protected]>
* build_rwkv6: Simplify graph
Signed-off-by: Molly Sophia <[email protected]>
* llama: rwkv6: Detect model.type
Signed-off-by: Molly Sophia <[email protected]>
* llama: rwkv6: Fix tensor loading for 7B/14B models
Signed-off-by: Molly Sophia <[email protected]>
* llama: rwkv6: Fix group_norm assertion failure with Metal
Signed-off-by: Molly Sophia <[email protected]>
* llama: rwkv6: Clean up
Signed-off-by: Molly Sophia <[email protected]>
* llama: rwkv6: Add quantization tensor exclusion
Signed-off-by: Molly Sophia <[email protected]>
* llama: rwkv6: Use the new advanced batch splits
Signed-off-by: Molly Sophia <[email protected]>
* Update src/llama.cpp
Co-authored-by: compilade <[email protected]>
* llama: rwkv6: Use ``ggml_norm`` instead of ``ggml_group_norm``
Co-authored-by: compilade <[email protected]>
* llama: rwkv6: Apply code style and misc changes
Signed-off-by: Molly Sophia <[email protected]>
* converter: Use class name ``Rwkv6Model``
Signed-off-by: Molly Sophia <[email protected]>
* llama: rwkv6: Make use of key ``feed_forward_length``
Signed-off-by: Molly Sophia <[email protected]>
* llama: rwkv6: Add kv ``time_mix_extra_dim`` and ``time_decay_extra_dim``
Signed-off-by: Molly Sophia <[email protected]>
* converter: Match ``new_name`` instead of ``name`` for float32 explicit tensors
Signed-off-by: Molly Sophia <[email protected]>
* llama: rwkv6: Keep ``time_mix_w1/w2`` as F32
Signed-off-by: Molly Sophia <[email protected]>
* llama: rwkv6: Remove unused nodes
Signed-off-by: Molly Sophia <[email protected]>
* llama: rwkv6: Apply code format changes
Signed-off-by: Molly Sophia <[email protected]>
* llama: rwkv6: Add lora for some supported tensors
Currently att.key/receptance/value/gate/output, ffn.receptance/key/value, as well as head.weight
Signed-off-by: Molly Sophia <[email protected]>
* rwkv : speed-up tokenization using trie
* minor : style + indentation
* llama: rwkv6: Avoid division by zero
Co-authored-by: compilade <[email protected]>
* ggml: rwkv_wkv: Avoid copying the state
Signed-off-by: Molly Sophia <[email protected]>
---------
Signed-off-by: Molly Sophia <[email protected]>
Co-authored-by: Layl Bongers <[email protected]>
Co-authored-by: compilade <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>