Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync : llama.cpp #939

Merged
merged 20 commits into from
Aug 27, 2024
Merged

sync : llama.cpp #939

merged 20 commits into from
Aug 27, 2024

Conversation

ggerganov
Copy link
Owner

No description provided.

mtavenrath and others added 20 commits August 27, 2024 21:34
…ronization overhead. (llama/8943)

* Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead.

- Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove.
- ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors.

* Fix small typo

---------

Co-authored-by: 0cc4m <[email protected]>
* ggml : move rope type enum to ggml.h

This commit moves the `llama_rope_type` enum from `llama.h` to
`ggml.h` and changes its name to `ggml_rope_type`.

The motivation for this change is to address the TODO in `llama.h` and
use the enum in ggml.

Note: This commit does not change the `mode` parameter to be of type
`enum ggml_rope_type`. The name `mode` and its usage suggest that it
might be more generic and possibly used as a bit field for multiple
flags. Further investigation/discussion may be needed to determine
if `mode` should be restricted to RoPE types.

* squash! ggml : move rope type enum to ggml.h

This commit removes GGML_ROPE_TYPE_NONE and GGML_ROPE_TYPE_GLM from
ggml.h, and back the llama_rope_type enum.

I've kept the assert for GGML_ROPE_TYPE_GLM as I'm not sure if it is
safe to remove it yet.

* squash! ggml : move rope type enum to ggml.h

This commit removes the enum ggml_rope_type from ggml.h and replaces it
with a define (GGML_ROPE_TYPE_NEOX). This define is used in the code to
check if the mode is set to GPT-NeoX. Also the enum llama_rope_type has
been updated to reflect this change.

* squash! ggml : move rope type enum to ggml.h

This commit contains a suggestion enable the GGML_ROPE_TYPE_NEOX
macro/define to be passed to the shader compiler.

* squash! ggml : move rope type enum to ggml.h

This commit fixes the editorconfig-checker warnings.

* squash! ggml : move rope type enum to ggml.h

Update comment for ggml_rope function.

* Revert "squash! ggml : move rope type enum to ggml.h"

This reverts commit 6261222bd0dc0efd51f0fb0435ad3f16a5b52fd6.

* squash! ggml : move rope type enum to ggml.h

Add GGML_ROPE_TYPE_NEOX to rope_common.comp.

* remove extra line

---------

Co-authored-by: slaren <[email protected]>
* ggml : Dynamic ggml_sched_max_splits based on graph_size

* Fixed and readded debug code for causes
Add more checks which prevent RPC server from crashing if invalid input
is received from client
* sycl: fix im2col overflow and sync with cuda

Signed-off-by: zhentaoyu <[email protected]>

* sycl: fix convert overflow

Signed-off-by: zhentaoyu <[email protected]>

* sycl: fix convert and dequantize

Signed-off-by: zhentaoyu <[email protected]>

* sycl: fix ib in dmmv

Signed-off-by: zhentaoyu <[email protected]>

* sycl:refine convert

Signed-off-by: zhentaoyu <[email protected]>

* sycl: move downsample global_range into common

Signed-off-by: zhentaoyu <[email protected]>

* test: add im2col and convert test cases

Signed-off-by: zhentaoyu <[email protected]>

* test: make new cases only in sycl

Signed-off-by: zhentaoyu <[email protected]>

* test: comment new test_cases for only local testing

Signed-off-by: zhentaoyu <[email protected]>

---------

Signed-off-by: zhentaoyu <[email protected]>
* fallback mmvq to mul_mat

* mmvq in cuda path

* Update ggml/src/ggml-sycl.cpp

Co-authored-by: Alberto Cabrera Pérez <[email protected]>

---------

Co-authored-by: Alberto Cabrera Pérez <[email protected]>
* llama : advanced batch splits

This includes equal-sequence-length batch splits which are useful
to simplify recurrent model operators.

* llama : always make recurrent state slots contiguous

* ggml : simplify mamba operators

* llama : fix integer signedness mixing

* llama : logits_all has priority over batch->logits

Otherwise, the server embeddings tests failed.
This was likely an existing problem but was only detected here
because of an additional assertion.

* llama : apply suggestions

Co-authored-by: Georgi Gerganov <[email protected]>

* llama : fix t5 segfault

* llama : fix Mamba session save and restore

* llama : minor cosmetic changes

* llama : rename llama_reorder_outputs to llama_output_reorder

Also move it closer to llama_output_reserve.

* llama : fix pooled embeddings when using batches with equal_seqs

* minor : add struct members for clarity

ggml-ci

* llama : fix T5 segfault again

* llama : fix Mamba pooled embeddings with multiple sequences

Until the pooled embeddings are refactored to allow splitting
across ubatches for causal embeddings,
recurrent models can only process a single sequence per ubatch
when calculating pooled embeddings.

* llama : add llama_model_is_recurrent to simplify figuring that out

This will make it easier to more cleanly support RWKV-v6 and Mamba-2.

* llama : fix simple splits when the batch contains embeddings

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* add onednn

* add sycl_f16

* add dnnl stream

* add engine map

* use dnnl for intel only

* use fp16fp16fp16

* update doc
* CPU/CUDA: Gemma 2 FlashAttention support

* apply logit_softcap to scale in kernel

* disable logit softcapping tests on Metal

* remove metal check
* ggml : add ggml_ssm_conv metal impl

* ggml : add ssm_scan metal impl

ggml-ci
* metal : separate scale and mask from QKT in FA kernel

* metal : ne01 check no longer necessary

* metal : keep data in local memory
@ggerganov ggerganov merged commit 28b7633 into master Aug 27, 2024
9 checks passed
@ggerganov ggerganov deleted the sync branch August 27, 2024 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.