Releases · Septa2112/llama.cpp

23 Sep 03:48

c35e586

b3804 Latest

Latest

musa: enable building fat binaries, enable unified memory, and disabl…

…e Flash Attention on QY1 (MTT S80) (#9526)

* mtgpu: add mp_21 support

Signed-off-by: Xiaodong Ye <[email protected]>

* mtgpu: disable flash attention on qy1 (MTT S80); disable q3_k and mul_mat_batched_cublas

Signed-off-by: Xiaodong Ye <[email protected]>

* mtgpu: enable unified memory

Signed-off-by: Xiaodong Ye <[email protected]>

* mtgpu: map cublasOperation_t to mublasOperation_t (sync code to latest)

Signed-off-by: Xiaodong Ye <[email protected]>

---------

Signed-off-by: Xiaodong Ye <[email protected]>

Assets 22

cudart-llama-bin-win-cu11.7.1-x64.zip

293 MB 2024-09-23T03:48:48Z
cudart-llama-bin-win-cu12.2.0-x64.zip

413 MB 2024-09-23T03:48:54Z
llama-b1-bin-win-hip-x64-gfx1030.zip

236 MB 2024-09-23T03:49:02Z
llama-b1-bin-win-hip-x64-gfx1100.zip

238 MB 2024-09-23T03:49:07Z
llama-b1-bin-win-hip-x64-gfx1101.zip

238 MB 2024-09-23T03:49:12Z
llama-b3804-bin-macos-arm64.zip

55.1 MB 2024-09-23T03:49:16Z
llama-b3804-bin-macos-x64.zip

55.3 MB 2024-09-23T03:49:18Z
llama-b3804-bin-ubuntu-x64.zip

60.1 MB 2024-09-23T03:49:19Z
llama-b3804-bin-win-avx-x64.zip

7.98 MB 2024-09-23T03:49:21Z
llama-b3804-bin-win-avx2-x64.zip

7.98 MB 2024-09-23T03:49:21Z
Source code (zip)

2024-09-22T14:55:49Z
Source code (tar.gz)

2024-09-22T14:55:49Z

18 Sep 09:38

github-actions

b3782

8a30835

b3782

server : match OAI structured output response (#9527)

Assets 19

11 Sep 10:19

github-actions

b3728

5af118e

b3728

CUDA: fix --split-mode row race condition (#9413)

Assets 19

02 Sep 04:52

github-actions

b3651

8f1d81a

b3651

llama : support RWKV v6 models (#8980)

* convert_hf_to_gguf: Add support for RWKV v6

Signed-off-by: Molly Sophia <[email protected]>

* Add RWKV tokenization

* Fix build

Signed-off-by: Molly Sophia <[email protected]>

* Do not use special tokens when matching in RWKV tokenizer

* Fix model loading

* Add (broken) placeholder graph builder for RWKV

* Add workaround for kv cache

* Add logits conversion to rwkv5

* Add rwkv5 layer norms

* Add time mix KVRG & correct merge mistake

* Add remaining time mix parameters

* Add time mix output loading

* Add placeholder llm_build_time_mix

* Fix build

Signed-off-by: Molly Sophia <[email protected]>

* Load more tensors for rwkv v6

Signed-off-by: Molly Sophia <[email protected]>

* Fix rwkv tokenizer

Signed-off-by: Molly Sophia <[email protected]>

* ggml: Add unary operator Exp

Signed-off-by: Molly Sophia <[email protected]>

* RWKV v6 graph building

Signed-off-by: Molly Sophia <[email protected]>

* Add ``rescale_every_n_layers`` parameter

Signed-off-by: Molly Sophia <[email protected]>

* Add ``wkv.head_size`` key for RWKV

so it doesn't reuse Mamba ssm parameters

Signed-off-by: Molly Sophia <[email protected]>

* Fix offloading layers to CUDA

Signed-off-by: Molly Sophia <[email protected]>

* Fix parallel inferencing for RWKV

Signed-off-by: Molly Sophia <[email protected]>

* Remove trailing whitespaces

Signed-off-by: Molly Sophia <[email protected]>

* build_rwkv: Avoid using inplace operations

Signed-off-by: Molly Sophia <[email protected]>

* convert_hf_to_gguf: rwkv: Avoid using ``eval``

Signed-off-by: Molly Sophia <[email protected]>

* convert_hf_to_gguf: rwkv tokenizer: Don't escape sequences manually

Signed-off-by: Molly Sophia <[email protected]>

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <[email protected]>

* ggml: Add backward computation for unary op ``exp``

Signed-off-by: Molly Sophia <[email protected]>

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <[email protected]>

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <[email protected]>

* Use MODEL_ARCH.RWKV6 instead of MODEL_ARCH.RWKV

Signed-off-by: Molly Sophia <[email protected]>

* build_rwkv6: Simplify graph

Signed-off-by: Molly Sophia <[email protected]>

* llama: rwkv6: Detect model.type

Signed-off-by: Molly Sophia <[email protected]>

* llama: rwkv6: Fix tensor loading for 7B/14B models

Signed-off-by: Molly Sophia <[email protected]>

* llama: rwkv6: Fix group_norm assertion failure with Metal

Signed-off-by: Molly Sophia <[email protected]>

* llama: rwkv6: Clean up

Signed-off-by: Molly Sophia <[email protected]>

* llama: rwkv6: Add quantization tensor exclusion

Signed-off-by: Molly Sophia <[email protected]>

* llama: rwkv6: Use the new advanced batch splits

Signed-off-by: Molly Sophia <[email protected]>

* Update src/llama.cpp

Co-authored-by: compilade <[email protected]>

* llama: rwkv6: Use ``ggml_norm`` instead of ``ggml_group_norm``

Co-authored-by: compilade <[email protected]>

* llama: rwkv6: Apply code style and misc changes

Signed-off-by: Molly Sophia <[email protected]>

* converter: Use class name ``Rwkv6Model``

Signed-off-by: Molly Sophia <[email protected]>

* llama: rwkv6: Make use of key ``feed_forward_length``

Signed-off-by: Molly Sophia <[email protected]>

* llama: rwkv6: Add kv ``time_mix_extra_dim`` and ``time_decay_extra_dim``

Signed-off-by: Molly Sophia <[email protected]>

* converter: Match ``new_name`` instead of ``name`` for float32 explicit tensors

Signed-off-by: Molly Sophia <[email protected]>

* llama: rwkv6: Keep ``time_mix_w1/w2`` as F32

Signed-off-by: Molly Sophia <[email protected]>

* llama: rwkv6: Remove unused nodes

Signed-off-by: Molly Sophia <[email protected]>

* llama: rwkv6: Apply code format changes

Signed-off-by: Molly Sophia <[email protected]>

* llama: rwkv6: Add lora for some supported tensors

Currently att.key/receptance/value/gate/output, ffn.receptance/key/value, as well as head.weight

Signed-off-by: Molly Sophia <[email protected]>

* rwkv : speed-up tokenization using trie

* minor : style + indentation

* llama: rwkv6: Avoid division by zero

Co-authored-by: compilade <[email protected]>

* ggml: rwkv_wkv: Avoid copying the state

Signed-off-by: Molly Sophia <[email protected]>

---------

Signed-off-by: Molly Sophia <[email protected]>
Co-authored-by: Layl Bongers <[email protected]>
Co-authored-by: compilade <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>

Assets 19

22 Aug 03:19

github-actions

b3614

a1631e5

b3614

llama : simplify Mamba with advanced batch splits (#8526)

* llama : advanced batch splits

This includes equal-sequence-length batch splits which are useful
to simplify recurrent model operators.

* llama : always make recurrent state slots contiguous

* ggml : simplify mamba operators

* llama : fix integer signedness mixing

* llama : logits_all has priority over batch->logits

Otherwise, the server embeddings tests failed.
This was likely an existing problem but was only detected here
because of an additional assertion.

* llama : apply suggestions

Co-authored-by: Georgi Gerganov <[email protected]>

* llama : fix t5 segfault

* llama : fix Mamba session save and restore

* llama : minor cosmetic changes

* llama : rename llama_reorder_outputs to llama_output_reorder

Also move it closer to llama_output_reserve.

* llama : fix pooled embeddings when using batches with equal_seqs

* minor : add struct members for clarity

ggml-ci

* llama : fix T5 segfault again

* llama : fix Mamba pooled embeddings with multiple sequences

Until the pooled embeddings are refactored to allow splitting
across ubatches for causal embeddings,
recurrent models can only process a single sequence per ubatch
when calculating pooled embeddings.

* llama : add llama_model_is_recurrent to simplify figuring that out

This will make it easier to more cleanly support RWKV-v6 and Mamba-2.

* llama : fix simple splits when the batch contains embeddings

---------

Co-authored-by: Georgi Gerganov <[email protected]>

Assets 19

21 Aug 11:11

github-actions

b3613

fc54ef0

b3613

server : support reading arguments from environment variables (#9105)

* server : support reading arguments from environment variables

* add -fa and -dt

* readme : specify non-arg env var

Assets 3

21 Aug 04:23

github-actions

b3609

2f3c146

b3609

llava: Add ACC OP for GPU acceleration to the Vulkan backend in the L…

…LAVA CLIP model. (#8984)

* llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model.

- The CLIP model now prioritizes the Vulkan backend over the CPU when vulkan available.
- A GGML_OP_ACC shader has been added.
- The encoding performance of the CLIP model improved from 4.2s on the CPU to 0.9s on the GPU.

Signed-off-by: Changyeon Kim <[email protected]>

* fix-up coding style.

Signed-off-by: Changyeon Kim <[email protected]>

* Fix-up the missing initial parameter to resolve the compilation warning.

Signed-off-by: Changyeon Kim <[email protected]>

* [fix] Add missing parameters.

Signed-off-by: Changyeon Kim <[email protected]>

* [fix] Use nb1 and nb2 for dst.

Signed-off-by: Changyeon Kim <[email protected]>

* Fix check results ggml_acc call

---------

Signed-off-by: Changyeon Kim <[email protected]>
Co-authored-by: 0cc4m <[email protected]>

Assets 19

19 Aug 08:54

github-actions

b3604

1b6ff90

b3604

rpc : print error message when failed to connect endpoint (#9042)

Assets 19

19 Aug 04:09

github-actions

b3602

554b049

b3602

flake.lock: Update (#9068)

Assets 19

16 Aug 08:45

github-actions

b3595

23fd453

b3595

gguf-py : bump version from 0.9.1 to 0.10.0 (#9051)

Assets 19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: Septa2112/llama.cpp

b3804

b3782

b3728

b3651

b3614

b3613

b3609

b3604

b3602

b3595