Replies: 1 comment
-
It seems that llama actually does a column parallel approach which is misleading since the option passed is |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, I am struggling to find where partial results from matrix multiply are reduced together. My understanding is when I use
-sm row
, that a row-wise tensor parallel approach is employed, where the result of every single matrix multiply needs to be all reduced. However, I only really see awarp_reduce_sum
which I assume is used for the tiling that happens across threads in a warp on a single gpu, but not an operation between GPUs reducing the whole matrix.Beta Was this translation helpful? Give feedback.
All reactions