sync : llama.cpp #939

ggerganov · 2024-08-27T18:50:10Z

No description provided.

…ronization overhead. (llama/8943) * Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. - Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove. - ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors. * Fix small typo --------- Co-authored-by: 0cc4m <[email protected]>

Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70724 In order to access the above bug you need to login using one of the emails in https://github.com/google/oss-fuzz/blob/master/projects/llamacpp/project.yaml#L3-L5 Signed-off-by: David Korczynski <[email protected]>

* ggml : move rope type enum to ggml.h This commit moves the `llama_rope_type` enum from `llama.h` to `ggml.h` and changes its name to `ggml_rope_type`. The motivation for this change is to address the TODO in `llama.h` and use the enum in ggml. Note: This commit does not change the `mode` parameter to be of type `enum ggml_rope_type`. The name `mode` and its usage suggest that it might be more generic and possibly used as a bit field for multiple flags. Further investigation/discussion may be needed to determine if `mode` should be restricted to RoPE types. * squash! ggml : move rope type enum to ggml.h This commit removes GGML_ROPE_TYPE_NONE and GGML_ROPE_TYPE_GLM from ggml.h, and back the llama_rope_type enum. I've kept the assert for GGML_ROPE_TYPE_GLM as I'm not sure if it is safe to remove it yet. * squash! ggml : move rope type enum to ggml.h This commit removes the enum ggml_rope_type from ggml.h and replaces it with a define (GGML_ROPE_TYPE_NEOX). This define is used in the code to check if the mode is set to GPT-NeoX. Also the enum llama_rope_type has been updated to reflect this change. * squash! ggml : move rope type enum to ggml.h This commit contains a suggestion enable the GGML_ROPE_TYPE_NEOX macro/define to be passed to the shader compiler. * squash! ggml : move rope type enum to ggml.h This commit fixes the editorconfig-checker warnings. * squash! ggml : move rope type enum to ggml.h Update comment for ggml_rope function. * Revert "squash! ggml : move rope type enum to ggml.h" This reverts commit 6261222bd0dc0efd51f0fb0435ad3f16a5b52fd6. * squash! ggml : move rope type enum to ggml.h Add GGML_ROPE_TYPE_NEOX to rope_common.comp. * remove extra line --------- Co-authored-by: slaren <[email protected]>

* ggml : Dynamic ggml_sched_max_splits based on graph_size * Fixed and readded debug code for causes

Add more checks which prevent RPC server from crashing if invalid input is received from client

* sycl: fix im2col overflow and sync with cuda Signed-off-by: zhentaoyu <[email protected]> * sycl: fix convert overflow Signed-off-by: zhentaoyu <[email protected]> * sycl: fix convert and dequantize Signed-off-by: zhentaoyu <[email protected]> * sycl: fix ib in dmmv Signed-off-by: zhentaoyu <[email protected]> * sycl:refine convert Signed-off-by: zhentaoyu <[email protected]> * sycl: move downsample global_range into common Signed-off-by: zhentaoyu <[email protected]> * test: add im2col and convert test cases Signed-off-by: zhentaoyu <[email protected]> * test: make new cases only in sycl Signed-off-by: zhentaoyu <[email protected]> * test: comment new test_cases for only local testing Signed-off-by: zhentaoyu <[email protected]> --------- Signed-off-by: zhentaoyu <[email protected]>

* fallback mmvq to mul_mat * mmvq in cuda path * Update ggml/src/ggml-sycl.cpp Co-authored-by: Alberto Cabrera Pérez <[email protected]> --------- Co-authored-by: Alberto Cabrera Pérez <[email protected]>

* llama : advanced batch splits This includes equal-sequence-length batch splits which are useful to simplify recurrent model operators. * llama : always make recurrent state slots contiguous * ggml : simplify mamba operators * llama : fix integer signedness mixing * llama : logits_all has priority over batch->logits Otherwise, the server embeddings tests failed. This was likely an existing problem but was only detected here because of an additional assertion. * llama : apply suggestions Co-authored-by: Georgi Gerganov <[email protected]> * llama : fix t5 segfault * llama : fix Mamba session save and restore * llama : minor cosmetic changes * llama : rename llama_reorder_outputs to llama_output_reorder Also move it closer to llama_output_reserve. * llama : fix pooled embeddings when using batches with equal_seqs * minor : add struct members for clarity ggml-ci * llama : fix T5 segfault again * llama : fix Mamba pooled embeddings with multiple sequences Until the pooled embeddings are refactored to allow splitting across ubatches for causal embeddings, recurrent models can only process a single sequence per ubatch when calculating pooled embeddings. * llama : add llama_model_is_recurrent to simplify figuring that out This will make it easier to more cleanly support RWKV-v6 and Mamba-2. * llama : fix simple splits when the batch contains embeddings --------- Co-authored-by: Georgi Gerganov <[email protected]>

* add onednn * add sycl_f16 * add dnnl stream * add engine map * use dnnl for intel only * use fp16fp16fp16 * update doc

* CPU/CUDA: Gemma 2 FlashAttention support * apply logit_softcap to scale in kernel * disable logit softcapping tests on Metal * remove metal check

* ggml : add ggml_ssm_conv metal impl * ggml : add ssm_scan metal impl ggml-ci

* metal : separate scale and mask from QKT in FA kernel * metal : ne01 check no longer necessary * metal : keep data in local memory

ggml-ci

mtavenrath and others added 20 commits August 27, 2024 21:34

cmake : remove unused option GGML_CURL (llama/9011)

84b060b

ggml : dynamic ggml_sched_max_splits based on graph_size (llama/9047)

893beb2

* ggml : Dynamic ggml_sched_max_splits based on graph_size * Fixed and readded debug code for causes

rpc : prevent crashes on invalid input (llama/9040)

9550007

Add more checks which prevent RPC server from crashing if invalid input is received from client

rpc : print error message when failed to connect endpoint (llama/9042)

ad56a42

fallback mmvq (llama/9088)

0474811

* fallback mmvq to mul_mat * mmvq in cuda path * Update ggml/src/ggml-sycl.cpp Co-authored-by: Alberto Cabrera Pérez <[email protected]> --------- Co-authored-by: Alberto Cabrera Pérez <[email protected]>

Add oneDNN primitive support (llama/9091)

d2ddfd0

* add onednn * add sycl_f16 * add dnnl stream * add engine map * use dnnl for intel only * use fp16fp16fp16 * update doc

Add a space to supress a cmake warning (llama/9133)

9300171

CPU/CUDA: Gemma 2 FlashAttention support (llama/8542)

e084b3d

* CPU/CUDA: Gemma 2 FlashAttention support * apply logit_softcap to scale in kernel * disable logit softcapping tests on Metal * remove metal check

metal : gemma2 flash attention support (llama/9159)

aca2c78

ggml : add SSM Metal kernels (llama/8546)

cc1ad6c

* ggml : add ggml_ssm_conv metal impl * ggml : add ssm_scan metal impl ggml-ci

metal : separate scale and mask from QKT in FA kernel (llama/9189)

4beb504

* metal : separate scale and mask from QKT in FA kernel * metal : ne01 check no longer necessary * metal : keep data in local memory

ggml : do not crash when quantizing q4_x_x with an imatrix (llama/9192)

44bc33d

sync : llama.cpp

a23fc97

sync : vulkan (skip) (llama/0)

234d153

ci : disable mnist test

b849c25

ggml-ci

ggerganov force-pushed the sync branch from 6e26bcb to b849c25 Compare August 27, 2024 18:57

ggerganov merged commit 28b7633 into master Aug 27, 2024
9 checks passed

ggerganov deleted the sync branch August 27, 2024 19:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync : llama.cpp #939

sync : llama.cpp #939

ggerganov commented Aug 27, 2024

sync : llama.cpp #939

sync : llama.cpp #939

Conversation

ggerganov commented Aug 27, 2024