Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync : llama.cpp #967

Merged
merged 15 commits into from
Sep 24, 2024
Merged

sync : llama.cpp #967

merged 15 commits into from
Sep 24, 2024

Conversation

ggerganov
Copy link
Owner

No description provided.

agray3 and others added 15 commits September 24, 2024 11:03
* Avoid using saved CUDA graph if scale changes and reset nodes/params on update

Fixes ggerganov/llama.cpp#9451

* clear before resize
* ggml: CUDA unary op EXP

Signed-off-by: Molly Sophia <[email protected]>

* ggml: rwkv_wkv op CUDA impl

Signed-off-by: Molly Sophia <[email protected]>

---------

Signed-off-by: Molly Sophia <[email protected]>
…e Flash Attention on QY1 (MTT S80) (llama/9526)

* mtgpu: add mp_21 support

Signed-off-by: Xiaodong Ye <[email protected]>

* mtgpu: disable flash attention on qy1 (MTT S80); disable q3_k and mul_mat_batched_cublas

Signed-off-by: Xiaodong Ye <[email protected]>

* mtgpu: enable unified memory

Signed-off-by: Xiaodong Ye <[email protected]>

* mtgpu: map cublasOperation_t to mublasOperation_t (sync code to latest)

Signed-off-by: Xiaodong Ye <[email protected]>

---------

Signed-off-by: Xiaodong Ye <[email protected]>
This reverts commit 50addec9a532a6518146ab837a85504850627316.
* AVX512 version of ggml_gemm_q4_0_8x8_q8_0

* Remove zero vector parameter passing

* Rename functions and rearrange order of macros

* Edit commments

* style : minor adjustments

* Update x to start from 0

---------

Co-authored-by: Georgi Gerganov <[email protected]>
…lama/9598)

Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing.
This optimization shows performance improvements even for n_threads <= 8 cases.

Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write
in the normal case and just use thread-fence as originally intended.
llama: enable K-shift for quantized KV cache
It will fail on unsupported backends or quant types.
We're missing atomic_thread_fence() in MSVC builds when openmp is disabled.
@ggerganov ggerganov merged commit 6cb634a into master Sep 24, 2024
9 checks passed
@ggerganov ggerganov deleted the sync branch September 24, 2024 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants