Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync : llama.cpp #939

Merged
merged 20 commits into from
Aug 27, 2024
Merged

sync : llama.cpp #939

merged 20 commits into from
Aug 27, 2024

Commits on Aug 27, 2024

  1. Optimize Vulkan backend for better CPU performance and less GPU synch…

    …ronization overhead. (llama/8943)
    
    * Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead.
    
    - Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove.
    - ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors.
    
    * Fix small typo
    
    ---------
    
    Co-authored-by: 0cc4m <[email protected]>
    2 people authored and ggerganov committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    d0f3a0e View commit details
    Browse the repository at this point in the history
  2. ggml: fix div-by-zero (llama/9003)

    Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70724
    
    In order to access the above bug you need to login using one of the
    emails in
    https://github.com/google/oss-fuzz/blob/master/projects/llamacpp/project.yaml#L3-L5
    
    Signed-off-by: David Korczynski <[email protected]>
    DavidKorczynski authored and ggerganov committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    5de81c9 View commit details
    Browse the repository at this point in the history
  3. ggml : move rope type enum to ggml.h (llama/8949)

    * ggml : move rope type enum to ggml.h
    
    This commit moves the `llama_rope_type` enum from `llama.h` to
    `ggml.h` and changes its name to `ggml_rope_type`.
    
    The motivation for this change is to address the TODO in `llama.h` and
    use the enum in ggml.
    
    Note: This commit does not change the `mode` parameter to be of type
    `enum ggml_rope_type`. The name `mode` and its usage suggest that it
    might be more generic and possibly used as a bit field for multiple
    flags. Further investigation/discussion may be needed to determine
    if `mode` should be restricted to RoPE types.
    
    * squash! ggml : move rope type enum to ggml.h
    
    This commit removes GGML_ROPE_TYPE_NONE and GGML_ROPE_TYPE_GLM from
    ggml.h, and back the llama_rope_type enum.
    
    I've kept the assert for GGML_ROPE_TYPE_GLM as I'm not sure if it is
    safe to remove it yet.
    
    * squash! ggml : move rope type enum to ggml.h
    
    This commit removes the enum ggml_rope_type from ggml.h and replaces it
    with a define (GGML_ROPE_TYPE_NEOX). This define is used in the code to
    check if the mode is set to GPT-NeoX. Also the enum llama_rope_type has
    been updated to reflect this change.
    
    * squash! ggml : move rope type enum to ggml.h
    
    This commit contains a suggestion enable the GGML_ROPE_TYPE_NEOX
    macro/define to be passed to the shader compiler.
    
    * squash! ggml : move rope type enum to ggml.h
    
    This commit fixes the editorconfig-checker warnings.
    
    * squash! ggml : move rope type enum to ggml.h
    
    Update comment for ggml_rope function.
    
    * Revert "squash! ggml : move rope type enum to ggml.h"
    
    This reverts commit 6261222bd0dc0efd51f0fb0435ad3f16a5b52fd6.
    
    * squash! ggml : move rope type enum to ggml.h
    
    Add GGML_ROPE_TYPE_NEOX to rope_common.comp.
    
    * remove extra line
    
    ---------
    
    Co-authored-by: slaren <[email protected]>
    2 people authored and ggerganov committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    3210933 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    84b060b View commit details
    Browse the repository at this point in the history
  5. ggml : dynamic ggml_sched_max_splits based on graph_size (llama/9047)

    * ggml : Dynamic ggml_sched_max_splits based on graph_size
    
    * Fixed and readded debug code for causes
    nicoboss authored and ggerganov committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    893beb2 View commit details
    Browse the repository at this point in the history
  6. rpc : prevent crashes on invalid input (llama/9040)

    Add more checks which prevent RPC server from crashing if invalid input
    is received from client
    rgerganov authored and ggerganov committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    9550007 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    ad56a42 View commit details
    Browse the repository at this point in the history
  8. Fix SYCL im2col and convert Overflow with Large Dims (llama/9052)

    * sycl: fix im2col overflow and sync with cuda
    
    Signed-off-by: zhentaoyu <[email protected]>
    
    * sycl: fix convert overflow
    
    Signed-off-by: zhentaoyu <[email protected]>
    
    * sycl: fix convert and dequantize
    
    Signed-off-by: zhentaoyu <[email protected]>
    
    * sycl: fix ib in dmmv
    
    Signed-off-by: zhentaoyu <[email protected]>
    
    * sycl:refine convert
    
    Signed-off-by: zhentaoyu <[email protected]>
    
    * sycl: move downsample global_range into common
    
    Signed-off-by: zhentaoyu <[email protected]>
    
    * test: add im2col and convert test cases
    
    Signed-off-by: zhentaoyu <[email protected]>
    
    * test: make new cases only in sycl
    
    Signed-off-by: zhentaoyu <[email protected]>
    
    * test: comment new test_cases for only local testing
    
    Signed-off-by: zhentaoyu <[email protected]>
    
    ---------
    
    Signed-off-by: zhentaoyu <[email protected]>
    zhentaoyu authored and ggerganov committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    e60fc00 View commit details
    Browse the repository at this point in the history
  9. fallback mmvq (llama/9088)

    * fallback mmvq to mul_mat
    
    * mmvq in cuda path
    
    * Update ggml/src/ggml-sycl.cpp
    
    Co-authored-by: Alberto Cabrera Pérez <[email protected]>
    
    ---------
    
    Co-authored-by: Alberto Cabrera Pérez <[email protected]>
    2 people authored and ggerganov committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    0474811 View commit details
    Browse the repository at this point in the history
  10. llama : simplify Mamba with advanced batch splits (llama/8526)

    * llama : advanced batch splits
    
    This includes equal-sequence-length batch splits which are useful
    to simplify recurrent model operators.
    
    * llama : always make recurrent state slots contiguous
    
    * ggml : simplify mamba operators
    
    * llama : fix integer signedness mixing
    
    * llama : logits_all has priority over batch->logits
    
    Otherwise, the server embeddings tests failed.
    This was likely an existing problem but was only detected here
    because of an additional assertion.
    
    * llama : apply suggestions
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    
    * llama : fix t5 segfault
    
    * llama : fix Mamba session save and restore
    
    * llama : minor cosmetic changes
    
    * llama : rename llama_reorder_outputs to llama_output_reorder
    
    Also move it closer to llama_output_reserve.
    
    * llama : fix pooled embeddings when using batches with equal_seqs
    
    * minor : add struct members for clarity
    
    ggml-ci
    
    * llama : fix T5 segfault again
    
    * llama : fix Mamba pooled embeddings with multiple sequences
    
    Until the pooled embeddings are refactored to allow splitting
    across ubatches for causal embeddings,
    recurrent models can only process a single sequence per ubatch
    when calculating pooled embeddings.
    
    * llama : add llama_model_is_recurrent to simplify figuring that out
    
    This will make it easier to more cleanly support RWKV-v6 and Mamba-2.
    
    * llama : fix simple splits when the batch contains embeddings
    
    ---------
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    compilade and ggerganov committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    0781ca2 View commit details
    Browse the repository at this point in the history
  11. Add oneDNN primitive support (llama/9091)

    * add onednn
    
    * add sycl_f16
    
    * add dnnl stream
    
    * add engine map
    
    * use dnnl for intel only
    
    * use fp16fp16fp16
    
    * update doc
    luoyu-intel authored and ggerganov committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    d2ddfd0 View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    9300171 View commit details
    Browse the repository at this point in the history
  13. CPU/CUDA: Gemma 2 FlashAttention support (llama/8542)

    * CPU/CUDA: Gemma 2 FlashAttention support
    
    * apply logit_softcap to scale in kernel
    
    * disable logit softcapping tests on Metal
    
    * remove metal check
    JohannesGaessler authored and ggerganov committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    e084b3d View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    aca2c78 View commit details
    Browse the repository at this point in the history
  15. ggml : add SSM Metal kernels (llama/8546)

    * ggml : add ggml_ssm_conv metal impl
    
    * ggml : add ssm_scan metal impl
    
    ggml-ci
    ggerganov committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    cc1ad6c View commit details
    Browse the repository at this point in the history
  16. metal : separate scale and mask from QKT in FA kernel (llama/9189)

    * metal : separate scale and mask from QKT in FA kernel
    
    * metal : ne01 check no longer necessary
    
    * metal : keep data in local memory
    ggerganov committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    4beb504 View commit details
    Browse the repository at this point in the history
  17. Configuration menu
    Copy the full SHA
    44bc33d View commit details
    Browse the repository at this point in the history
  18. sync : llama.cpp

    ggerganov committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    a23fc97 View commit details
    Browse the repository at this point in the history
  19. Configuration menu
    Copy the full SHA
    234d153 View commit details
    Browse the repository at this point in the history
  20. ci : disable mnist test

    ggml-ci
    ggerganov committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    b849c25 View commit details
    Browse the repository at this point in the history