sync : llama.cpp #965

* add check malloc result on device * update for review comments, check all malloc_device() result --------- Co-authored-by: arthw <[email protected]>

…by submitting smaller cmdbuffers early. (llama/9118) * Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. * fix compile issues * Fix issues where the last submit wasn't executed or handled properly. * remove trailing whitespace * Repair GGML_VULKAN_CHECK_RESULTS * Increase submit counter only if actual work has been submitted and increase submit count to 100. * Fix some nodes are not checked with GGML_VULKAN_CHECK_RESULTS enabled.

* Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths * Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths * Removed WhiteSpaces * ggml : style changes + fix 512-bit nb loop check - fix local scope in switch cases - consistent predicate names - empty lines when necessary - opening braces, spaces - const-correctness - add asserts * Update ggml/src/ggml-quants.c Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

* rpc : fix nkvo * rpc : buf_size must not be static ref: #9337 --------- Co-authored-by: slaren <[email protected]>

* sycl : update support condition to im2col Signed-off-by: Alberto Cabrera <[email protected]> * Added TODO to remind supporting FP32 im2col --------- Signed-off-by: Alberto Cabrera <[email protected]>

Signed-off-by: Xiaodong Ye <[email protected]>

…/9442) - Added ggml_cpu_has_riscv_v() in GGML to print system info in log - Modified Makefile to only use flag when cross compiling for RISC-V

* feat: Add host buffer type for Ascend NPU(CANN backend) * fix some checking errors * Add a few comments

…463) * cmake : use list(APPEND ...) instead of set() + dedup linker ggml-ci * cmake : try fix sycl * cmake : try to fix sycl 2 * cmake : fix sycl build (llama/9469) * try fix sycl build * use CMAKE_CXX_FLAGS as a string variable --------- Co-authored-by: Georgi Gerganov <[email protected]> * one more CMAKE_CXX_FLAGS fix (llama/9471) --------- Co-authored-by: Michael Podvitskiy <[email protected]>

When running on Windows, the quantization utility attempts to print the types that are not set which leads to a crash.

ggerganov/llama.cpp#9418

* squashed readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before

* cmake : do not hide GGML options ggml-ci * build : rename flag GGML_CUDA_USE_GRAPHS -> GGML_CUDA_GRAPHS for consistency ggml-ci

* threadpool: skip polling for unused threads Currently all threads do N polling rounds even if only 1 thread is active (n_threads_cur == 1). This commit adds a check to skip the polling for unused threads (ith >= n_threads_cur). n_threads_cur is now an atomic_int to explicitly tell thread sanitizer that it is written from one thread and read from other threads (not a race conditions). * threadpool: further simplify and improve ggml_barrier Avoid using strict memory order while polling, yet make sure that all threads go through full memory barrier (memory fence) on ggml_barrier entrace and exit. * threads: add simple barrier test This test does lots of small, parallel matmul ops where the barriers in between dominate the overhead. * threadpool: improve thread sync for new-graphs Using the same tricks as ggml_barrier. All the polling is done with relaxed memory order to keep it efficient, once the new graph is detected we do full fence using read-modify-write with strict memory order. * threadpool: improve abort handling Do not use threadpool->ec (exit code) to decide whether to exit the compute loop. threadpool->ec is not atomic which makes thread-sanitizer rightfully unhappy about it. Instead introduce atomic threadpool->abort flag used for this. This is consistent with how we handle threadpool->stop or pause. While at it add an explicit atomic_load for n_threads_cur for consistency. * test-barrier: release threadpool before releasing the context fixes use-after-free detected by gcc thread-sanitizer on x86-64 for some reason llvm sanitizer is not detecting this issue.

* ggml : fix n_threads_cur initialization with one thread * Update ggml/src/ggml.c --------- Co-authored-by: Max Krasnyansky <[email protected]>

ggml-ci

- d6a04f87 - 23e0d70b

ggml-ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync : llama.cpp #965

sync : llama.cpp #965

Commits on Sep 20, 2024