sync : llama.cpp #967

ggerganov · 2024-09-24T08:05:16Z

No description provided.

* Avoid using saved CUDA graph if scale changes and reset nodes/params on update Fixes ggerganov/llama.cpp#9451 * clear before resize

…(llama/9573)

* ggml: CUDA unary op EXP Signed-off-by: Molly Sophia <[email protected]> * ggml: rwkv_wkv op CUDA impl Signed-off-by: Molly Sophia <[email protected]> --------- Signed-off-by: Molly Sophia <[email protected]>

Signed-off-by: Molly Sophia <[email protected]>

…e Flash Attention on QY1 (MTT S80) (llama/9526) * mtgpu: add mp_21 support Signed-off-by: Xiaodong Ye <[email protected]> * mtgpu: disable flash attention on qy1 (MTT S80); disable q3_k and mul_mat_batched_cublas Signed-off-by: Xiaodong Ye <[email protected]> * mtgpu: enable unified memory Signed-off-by: Xiaodong Ye <[email protected]> * mtgpu: map cublasOperation_t to mublasOperation_t (sync code to latest) Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

This reverts commit 50addec9a532a6518146ab837a85504850627316.

ggml-ci

* AVX512 version of ggml_gemm_q4_0_8x8_q8_0 * Remove zero vector parameter passing * Rename functions and rearrange order of macros * Edit commments * style : minor adjustments * Update x to start from 0 --------- Co-authored-by: Georgi Gerganov <[email protected]>

…lama/9598) Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing. This optimization shows performance improvements even for n_threads <= 8 cases. Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write in the normal case and just use thread-fence as originally intended.

llama: enable K-shift for quantized KV cache It will fail on unsupported backends or quant types.

We're missing atomic_thread_fence() in MSVC builds when openmp is disabled.

ggml-ci

agray3 and others added 15 commits September 24, 2024 11:03

Update CUDA graph on scale change plus clear nodes/params (llama/9550)

ad9467d

* Avoid using saved CUDA graph if scale changes and reset nodes/params on update Fixes ggerganov/llama.cpp#9451 * clear before resize

ggml-alloc : fix list of allocated tensors with GGML_ALLOCATOR_DEBUG …

4d4fc34

…(llama/9573)

RWKV v6: RWKV_WKV op CUDA implementation (llama/9454)

029ceec

* ggml: CUDA unary op EXP Signed-off-by: Molly Sophia <[email protected]> * ggml: rwkv_wkv op CUDA impl Signed-off-by: Molly Sophia <[email protected]> --------- Signed-off-by: Molly Sophia <[email protected]>

CUDA: enable Gemma FA for HIP/Pascal (llama/9581)

ab8ac9c

Fix merge error in #9454 (llama/9589)

1eb587a

Signed-off-by: Molly Sophia <[email protected]>

Revert "[SYCL] fallback mmvq (#9088)" (llama/9579)

d028aa5

This reverts commit 50addec9a532a6518146ab837a85504850627316.

metal : use F32 prec for K*Q in vec FA (llama/9595)

48825ee

ggml-ci

cuda: add q8_0->f32 cpy operation (llama/9571)

0fc5d12

llama: enable K-shift for quantized KV cache It will fail on unsupported backends or quant types.

threads: fix msvc build without openmp (llama/9615)

ed7ed65

We're missing atomic_thread_fence() in MSVC builds when openmp is disabled.

log : add CONT level for continuing previous log entry (llama/9610)

842ec9c

ggml : add AVX512DQ requirement for AVX512 builds (llama/9622)

70ec806

sync : llama.cpp

d6dc06b

ggml-ci

ggerganov force-pushed the sync branch from 1813922 to d6dc06b Compare September 24, 2024 08:06

ggerganov merged commit 6cb634a into master Sep 24, 2024
9 checks passed

ggerganov deleted the sync branch September 24, 2024 10:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync : llama.cpp #967

sync : llama.cpp #967

ggerganov commented Sep 24, 2024

sync : llama.cpp #967

sync : llama.cpp #967

Conversation

ggerganov commented Sep 24, 2024