提一个您llama.cpp适配rwkv的问题 #7

guoguo1314 · 2024-12-12T08:54:09Z

有段代码不太理解，我的浅薄理解如下，还请指教一下：
既然，ggml_rwkv_wkv算子是逐个token计算的，那么我觉得
`
struct ggml_tensor * r = ggml_reshape_4d(ctx, llm_build_lora_mm(lctx, ctx, layer->time_mix_receptance, xr), head_size, 1, head_count, n_tokens);

struct ggml_tensor * k = ggml_reshape_4d(ctx, llm_build_lora_mm(lctx, ctx, layer->time_mix_key,        xk), 1,         head_size, head_count, n_tokens);

struct ggml_tensor * v = ggml_reshape_4d(ctx, llm_build_lora_mm(lctx, ctx, layer->time_mix_value,      xv), head_size, 1,         head_count, n_tokens);`

这个1是没有用的吧，因为ggml_rwkv_wkv算子代码是遍历的n_tokens,head_count,head_size，甚至上面的rkv都可以reshape成head_size,head_count，n_tokens。

The text was updated successfully, but these errors were encountered:

MollySophia · 2024-12-12T09:02:43Z

是的，当时提PR的时候这样写大概是按习惯来了，实际上确实可以像你说的那样写。

rwkv-qualcomm/rwkv_src/rwkv_v6_modules.py

Line 167 in 3f4cbbe

key = key.view(self.num_heads * seq_length, self.head_size)

这个repo里涉及到合并的自定义wkv算子时，也是reshape成[T, H, S]的

guoguo1314 · 2024-12-12T09:08:40Z

是的，pytroch这种写法是为了让最后两个维度能够进行矩阵乘法。谢谢博主哈，别close，后面我还有问题可能。

MollySophia · 2024-12-13T12:30:14Z

我今天试了一下，把上面提到的reshape方式改掉，并去掉k/v/r上的ggml_transpose，性能上似乎并没有太大的影响

(py310) molly@lxc-ubuntu:~/workspace/llama.cpp$ ./build/bin/llama-bench -m ../models/rwkv-6-world-7b/rwkv-6-world-7B-F16.gguf 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| rwkv6 7B F16                   |  14.45 GiB |     7.64 B | CUDA       |  99 |         pp512 |       1948.48 ± 5.60 |
| rwkv6 7B F16                   |  14.45 GiB |     7.64 B | CUDA       |  99 |         tg128 |         48.20 ± 0.06 |

build: 5555c0c1 (4310)
(py310) molly@lxc-ubuntu:~/workspace/llama.cpp$ ./build-test/bin/llama-bench -m ../models/rwkv-6-world-7b/rwkv-6-world-7B-F16.gguf 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| rwkv6 7B F16                   |  14.45 GiB |     7.64 B | CUDA       |  99 |         pp512 |       1949.13 ± 4.63 |
| rwkv6 7B F16                   |  14.45 GiB |     7.64 B | CUDA       |  99 |         tg128 |         48.24 ± 0.07 |

build: 60bbd4eb (4313)
(py310) molly@lxc-ubuntu:~/workspace/llama.cpp$ ./build/bin/llama-bench -m ../models/rwkv-6-world-7b/rwkv-6-world-7B-F16.gguf 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| rwkv6 7B F16                   |  14.45 GiB |     7.64 B | CUDA       |  99 |         pp512 |       1945.05 ± 3.87 |
| rwkv6 7B F16                   |  14.45 GiB |     7.64 B | CUDA       |  99 |         tg128 |         48.21 ± 0.07 |

build: 5555c0c1 (4310)
(py310) molly@lxc-ubuntu:~/workspace/llama.cpp$ ./build-test/bin/llama-bench -m ../models/rwkv-6-world-7b/rwkv-6-world-7B-F16.gguf 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| rwkv6 7B F16                   |  14.45 GiB |     7.64 B | CUDA       |  99 |         pp512 |       1946.02 ± 1.18 |
| rwkv6 7B F16                   |  14.45 GiB |     7.64 B | CUDA       |  99 |         tg128 |         48.29 ± 0.10 |

build: 60bbd4eb (4313)

guoguo1314 · 2024-12-25T01:40:41Z

早上好啊！博主，我来了，我有几个问题，如下：
1、ggml_mul需要维度对齐，如a.shape=[h_d,n_h,n_tokens]，b的shape也必须是[h_d,n_h,n_tokens]，如果交换了维度，如把b的维度变成[n_h,h_d,n_tokens]，它就不是对应的维度进行相乘。
ggml_tensor是ne[0]的值在最里面哪个维度，即变化最快的那个维度。所以上面ggml_mul就导致n_d和n_h对应的值相乘。
两个输入tensor维度一旦对齐了，ggml_mul交换位置是可以的，因为它是点乘，对应点相乘。

2、llama.cpp中n_embd， n_seqs, seq_tokens相当于pytorch中的 embed_dim, batch, seqs_len？

3、我适配的模型kv变换下面这样写的：

k = ggml_reshape_3d(ctx, k, k->ne[0] / head_count, head_count, n_tokens);
v = ggml_reshape_3d(ctx, v, v->ne[0] / head_count,  head_count, n_tokens);

因为我觉得在wkv中它是按照ntokens，heads，head_dim去遍历的。
对了，还有就是为什么在llama.cpp中喜欢把head_dim，heads，ntokens这维度顺序写呀？这与pytorch中batch_size，seqs_len，embed_dim常规的相反。

您看看我上述观点有什么不妥的吗，以及回答下我的疑问，谢谢哥。

MollySophia · 2024-12-25T02:02:10Z

早上好啊！

早上好呀

博主，我来了，我有几个问题，如下： 1、ggml_mul需要维度对齐，如a.shape=[h_d,n_h,n_tokens]，b的shape也必须是[h_d,n_h,n_tokens]，如果交换了维度，如把b的维度变成[n_h,h_d,n_tokens]，它就不是对应的维度进行相乘。

是的，对应dim相乘。如果a和b的shape不一样的话，如果可以broadcast也是可以相乘的，具体可以看看ggml文档或者代码

ggml_tensor是ne[0]的值在最里面哪个维度，即变化最快的那个维度。所以上面ggml_mul就导致n_d和n_h对应的值相乘。
两个输入tensor维度一旦对齐了，ggml_mul交换位置是可以的，因为它是点乘，对应点相乘。

没错

2、llama.cpp中n_embd， n_seqs, seq_tokens相当于pytorch中的 embed_dim, batch, seqs_len？

3、我适配的模型kv变换下面这样写的：
k = ggml_reshape_3d(ctx, k, k->ne[0] / head_count, head_count, n_tokens);
v = ggml_reshape_3d(ctx, v, v->ne[0] / head_count,  head_count, n_tokens);  
因为我觉得在wkv中它是按照ntokens，heads，head_dim去遍历的。对了，还有就是为什么在llama.cpp中喜欢把head_dim，heads，ntokens这维度顺序写呀？这与pytorch中batch_size，seqs_len，embed_dim常规的相反。

因为ggml的shape是反过来的吧，然后ntokens就等于batch_size * seqs_len，所以pytorch就是[B, T, C]，ggml里面是[head_dim, head_size, ntokens]（head_dim * head_size = embed_dim)

guoguo1314 · 2024-12-25T02:08:39Z

好的，等我研究研究出来一些问题再来向你请教，谢谢哈，博主。

guoguo1314 · 2024-12-25T09:44:24Z

博主，您好，当我在llama.cpp写好我自己适配的计算图，以及能够正常编译，然后能够推理，但是推理时出现问题， ./build/bin/llama-cli -m model/xxxx.gguf -p "you are" --no-context-shift。它会一直出现
you are are are are are are are are are are are are are are are are are are are are are are are
还有llama.cpp中执行计算图forward的时候，我怎么打印出我想要的指定层的结果，以及shape，这样我才好与pytorch对齐，才知道我错在哪里了。

MollySophia · 2024-12-26T03:53:39Z

抱歉回复晚了

还有llama.cpp中执行计算图forward的时候，我怎么打印出我想要的指定层的结果，以及shape，这样我才好与pytorch对齐，才知道我错在哪里了。

可以参考这里ggerganov/ggml#655 (comment)

guoguo1314 · 2024-12-26T05:42:14Z

抱歉回复晚了
emm，我只能说你不仅是个好人，还是个很好的好人
可以参考这里ggerganov/ggml#655 (comment)
好我试试，谢谢哈

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

提一个您llama.cpp适配rwkv的问题 #7

提一个您llama.cpp适配rwkv的问题 #7

guoguo1314 commented Dec 12, 2024 •

edited

Loading

MollySophia commented Dec 12, 2024

guoguo1314 commented Dec 12, 2024

MollySophia commented Dec 13, 2024

guoguo1314 commented Dec 25, 2024

MollySophia commented Dec 25, 2024

guoguo1314 commented Dec 25, 2024 •

edited

Loading

guoguo1314 commented Dec 25, 2024

MollySophia commented Dec 26, 2024

guoguo1314 commented Dec 26, 2024 •

edited

Loading

提一个您llama.cpp适配rwkv的问题 #7

提一个您llama.cpp适配rwkv的问题 #7

Comments

guoguo1314 commented Dec 12, 2024 • edited Loading

MollySophia commented Dec 12, 2024

guoguo1314 commented Dec 12, 2024

MollySophia commented Dec 13, 2024

guoguo1314 commented Dec 25, 2024

MollySophia commented Dec 25, 2024

guoguo1314 commented Dec 25, 2024 • edited Loading

guoguo1314 commented Dec 25, 2024

MollySophia commented Dec 26, 2024

guoguo1314 commented Dec 26, 2024 • edited Loading

guoguo1314 commented Dec 12, 2024 •

edited

Loading

guoguo1314 commented Dec 25, 2024 •

edited

Loading

guoguo1314 commented Dec 26, 2024 •

edited

Loading