-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[experimental]backend: add new oneDNN backend #855
base: master
Are you sure you want to change the base?
Conversation
Looks interesting, though if it is not possible to implement support for quantized types using oneDNN, its usefulness may be limited.
This will be supported through
Operation fusing: there isn't a common framework to implement this at the moment, but it is something that we would like to do in the future. For now, you could analyze the graph and look for opportunities to fuse multiple operations in the call to Weights pre-packing: in principle it should be possible to do any transformations to the data during the call to |
Very good idea @rfsaliev . |
Thank you @slaren for your response.
oneDNN supports at least int8 quantization. Unfortunately oneDNN quantization method (per-tensor or per-dimension) differ than GGML (per-block). Anyway I will look for opportunities to support quantizations.
Thanks, it looks like possible to do some fusing like MatMul+BiasAdd in
In case of oneDNN, weights, buffer layout depends on an operation type which uses weights. Can you please point me a method I can follow to identify user operation in |
src/CMakeLists.txt
Outdated
set(GGML_HEADERS_DNNL ggml-dnnl.h) | ||
set(GGML_SOURCES_DNNL ggml-dnnl.cpp) | ||
|
||
set(GGML_EXTRA_INCS ${GGML_EXTRA_INCS} ${CLBLAST_INC} ${OPENCL_INC}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CLBLAST vars look out of place here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you - it was copy-pasted with mistake.
I've fixed it and some other parts of this file.
@slaren, can you please help me understand how a backend should work in
(tf_env) rfsaliev:~$ git clone https://github.com/ggerganov/ggml.git
(tf_env) rfsaliev:~$ cd ggml
(tf_env) rfsaliev:~/ggml$ git apply < ~/ggml-blas-debug.patch
(tf_env) rfsaliev:~/ggml$ mkdir build && cd build
(tf_env) rfsaliev:~/ggml/build$ ../examples/gpt-2/download-model.sh 117M
(tf_env) rfsaliev:~/ggml/build$ python ../examples/gpt-2/convert-ckpt-to-ggml.py models/gpt-2-117M 0
(tf_env) rfsaliev:~/ggml/build$ cmake .. -DGGML_BLAS=ON && cmake --build . --target gpt-2-sched
(tf_env) rfsaliev:~/ggml/build$ ./bin/gpt-2-sched -m models/gpt-2-117M/ggml-model-f32.bin -p "This is an example of" -n 1 -ngl 32 -s 1 And got the number of
BLAS backend debug print patch
|
BLAS is only used with batches of at least 32 tokens. The "OP supported" you are seeing are probably from the reserve run, which is never executed. Try a larger prompt, or always return true from |
Thank you, |
* Backend logic is based on BLAS backend * Implemented support for MUL_MAT operation * Implemented MUL_MAT fusing with subsequential ADD as bias-add * Implemented weights 'pre-packing'(reordering) for MUL_MAT operation Notes: * This it is the second version of the DNNL-backend based on refactored ggml backend support implemented together with BLAS-backend * It is recommended to enable GGML_OPENMP when oneDNN compiled with DNNL_CPU_RUNTIME=OMP(default)
Hello, Added also simple MUL_MAT+ADD fusing and weights 'pre-packing' (reordering) features. |
Is this PR dead? |
RFC please |
This PR is the Proof-of-Concept for oneDNN (DNNL) library integration to GGML.
I created this PR rather than an Issue to start discussion about oneDNN backend from working demo.
Motivation: oneDNN is optimized for Intel(R) Architecture Processors, Intel Graphics, and Arm* 64-bit Architecture (AArch64)-based processors. The backend will allow GGML to utilize latest Intel CPU/GPU instructions sets performance features (e.g. AMX) out-of-box.
Known issues and TODOs:
Functionality:
Performance:
@ggerganov, @slaren, can you please advice proper method to effectively implement operations fusing and weights pre-packing?
Some technical details:
ggml-dnnl.h
,ggml-dnnl.cpp
. The backend re-uses CPU buffer type - custom buffer type is under development an wrapped byUSE_DNNL_BACKEND
macros.GGML_DNNL
configuration option.gpt2-sched
is modified to convert model weights fromFP16
toFP32
if DNNL backend enabled - current oneDNN release version does not support MatMul cases withsrc_type=dst_type=f32
andweights_type=fp16