-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Intel®-AMX/oneDNN to accelerate IndexFlatIP search #3266
base: main
Are you sure you want to change the base?
Conversation
Hi @guangzegu! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
@guangzegu this patch is in extremely early stage.
Thanks @mdouze Is Intel Xeon 4th gen available for CI? |
@alexanderguzhva Thank you very much for your comments.
|
|
@guangzegu Thanks, I'll take a look |
Great! How's it going? Have you run into any issues? |
@guangzegu Hi, it is stil in my plans, together with zilliztech/knowhere#535 . Sorry that it is taking too long, I get constantly distracted :( |
We are looking into compiling this in the CI |
@guangzegu Could you please rebase this? We can try a test CI build next and go from there. Thanks! |
@ramilbakhshyiev Sure, I will rebase it. Thanks! |
@alexanderguzhva No worries, I understand. Thank you for the update and for your efforts! |
Thanks @guangzegu! We will be trying this out soon. |
Hi @guangzegu and @ramilbakhshyiev I'm trying to build this PR on the github CI :) @guangzegu I'm following the documentation you have provided in
If that is the case, I managed to get everything to build on CI but we have a C++ unit test failing, complaining about memleak (see build log) Is this something you can reproduce locally and expect? The actual test case source code is here |
@mengdilin Thank you for verifying this PR and uncovering potential issues 😄, I'm going to try to reproduce this issue in my environment. |
Hi @guangzegu After combing through the PR, I'm not seeing anything obvious that would cause the memory leak (besides my nit comment), but obviously I will defer to you on the dnnl memory management aspect. I ended up running the failing mem_leak test through valgrind (diffing test result from master commit vs your PR) and it looks like your PR did not introduce any new leak (valgrind produced consistent analysis between your PR and the master commit). We will look into the possibility of disabling this test or omit it from dnnl build to unblock merging your PR |
@guangzegu After omitting the memory leak test from your PR, it looks like we have encountered precision issue in several unit tests when it comes to inner product computation. Is this something expected? A source for one of the failing tests is faiss/tests/test_residual_quantizer.py Line 694 in 34bbe5e
The test failure stacktrace looks like
You can reproduce the failure on your PR by cloning this PR #3615 and run the following after coming faiss with DNNL mode on:
|
@asadoughi pointed out that it looks like this PR is trading off precision for speed from https://github.com/facebookresearch/faiss/pull/3266/files#diff-9228cbbdef764c34694b0b5d637c05058ccc6c6b3279469a1b3421633e7feb3fR57 If that is the case, can you provide some tests covering the low precision scenario. We can gate these tests behind an explicit flag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's restructure the AMX integration with faiss so that bulk of its complexity can live inside a dedicated folder cppcontrib/amx/
since this feature is off by default and requires the users to turn it on explicitly (trading off precision for performance). I made some suggestions on how to accomplish that.
Following up on the previous comment, do you mind adding a few dedicated low precision tests for this PR?
faiss/utils/distances.cpp
Outdated
#ifdef ENABLE_DNNL | ||
// use AMX to accelerate if available | ||
if (is_amxbf16_supported()) { | ||
float* res_arr = (float*)malloc(nx * ny * sizeof(float)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float* res_arr = (float*)malloc(nx * ny * sizeof(float)); | |
float* res_arr = new float[nx * ny]; |
nit: delete [] res_arr
should be paired with new []
otherwise it's undefined behavior.
faiss/utils/onednn/onednn_utils.h
Outdated
@@ -0,0 +1,141 @@ | |||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's reorganize the code a little bit and move bulk of the logic for AMX to a centralized location.
Can you move this file and the other relevant pieces of dnn code in distance computation to faiss/cppcontrib/amx
: https://github.com/facebookresearch/faiss/tree/4eeaa42b930363b7087f1ad39db8adaa8267d61a/faiss/cppcontrib
|
||
/// Getter of block sizes value for oneDNN/AMX distance computations | ||
int faiss_get_distance_compute_dnnl_query_bs(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for you to move these to cpi/cppcontrib/amx/distances_dnnl_c.h
and if not feasible, gate it behind a compilation flag?
void faiss_set_distance_compute_dnnl_query_bs(int value) { | ||
faiss::distance_compute_dnnl_query_bs = value; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for you to move these to cpi/cppcontrib/amx/distances_dnnl_c.h
and if not feasible, gate it behind a compilation flag?
faiss/utils/distances.cpp
Outdated
@@ -145,26 +149,60 @@ void exhaustive_inner_product_seq( | |||
|
|||
FAISS_ASSERT(use_sel == (sel != nullptr)); | |||
|
|||
#ifdef ENABLE_DNNL | |||
// use AMX to accelerate if available | |||
if (is_amxbf16_supported()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move the AMX specific inner product implementations to cppcontrib/amx/distances_dnnl.cpp
Can you have variants of the 2 functions for DNNL: exhaustive_inner_product_seq_dnnl
and exhaustive_inner_product_blas_dnnl
and they can live inside cppcontrib/amx
alongside onednn_utils.h
. Then you can dispatch to use these two functions here:
faiss/faiss/utils/distances.cpp
Lines 611 to 614 in 4eeaa42
exhaustive_inner_product_seq(x, y, d, nx, ny, res); | |
} else { | |
exhaustive_inner_product_blas(x, y, d, nx, ny, res); | |
} |
ENABLE_DNNL
and is_amxbf16_supported()
as that is the only place calling these two functions.
faiss/utils/distances.cpp
Outdated
@@ -650,6 +709,8 @@ int distance_compute_blas_threshold = 20; | |||
int distance_compute_blas_query_bs = 4096; | |||
int distance_compute_blas_database_bs = 1024; | |||
int distance_compute_min_k_reservoir = 100; | |||
int distance_compute_dnnl_query_bs = 10240; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move these to cppcontrib/amx/distances_dnnl.cpp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to move these two extern variables to cppcontrib/amx/distances_dnnl.h
?
faiss/utils/distances.h
Outdated
@@ -281,6 +281,10 @@ FAISS_API extern int distance_compute_blas_threshold; | |||
FAISS_API extern int distance_compute_blas_query_bs; | |||
FAISS_API extern int distance_compute_blas_database_bs; | |||
|
|||
// block sizes for oneDNN/AMX distance computations | |||
FAISS_API extern int distance_compute_dnnl_query_bs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you extern it in a separate header file for cppcontrib/amx/distances_dnnl.h
? If not, can you gate it behind ENABLE_DNNL
flag?
Hi @guangzegu and @xtangxtang what is the status of this PR? ? Let me know if you are blocked on anything :) |
|
can I get an update on this merge? |
@guangzegu, @xtangxtang, I've played a bit with AMX code. What are the advantages of using Intel libraries for AMX? I was able to write a functional AMX-based code without any Intel libraries. Thanks. |
@alexanderguzhva Because we should split big matrix into fitted size so that AMX could process. This work deal with some optimization method that could improve AMX performance. Also, we may start multiple threads to fully utilize the AMX. All this work is warped by oneDNN library. |
|
I am not authorized to merge the pull request, I was just working with the the intel OPEA team / linux foundation / LAION , to optimize the retrieval times on large datasets, and integrang with the the intel OPEA team / linux foundation / LAION , to optimize the retrieval times on large datasets, and integrate this te this into the opea project. I can run a test for some different hardware platforms to ensure bug testing, but I cannot dictate the design goals for this repository. |
@xtangxtang acked, will take a look next week! Before looking deeper, have you taken a look at the unit test concerns from the past comment. Basically when compiling with dnnl optimization, our unit tests are failing due to higher precision requirements (you should be able to reproduce these locally if you run them on the python tests, let me know if you need help reproducing). Can you please provide some tests covering the low precision scenario. We can dedicate these tests to cover for the DNNL changes |
Yes, we will provide low precision UT alone with this PR. Thanks for your feadback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on the change! Some code comments and rebase
- Can you rebase off of the current main branch? Some of the code path is out of date?
- Do you mind patching changes in [ignore] Test CI AMX #3900 to your PR after you have added the low precision tests so that AMX CI signals show up? Once you have these tests, you will need to provide a flag such that by default, we run tests in high precision cases, but for AMX cases, the flag can be toggled to cover the low precision cases.
faiss/utils/distances.cpp
Outdated
#ifdef ENABLE_DNNL | ||
/* Find the nearest neighbors for nx queries in a set of ny vectors using oneDNN/AMX */ | ||
template <class BlockResultHandler, bool use_sel = false> | ||
void exhaustive_inner_product_seq_dnnl( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go a step further and move exhaustive_inner_product_seq_dnnl
and exhaustive_inner_product_blas_dnnl
(this should be renamed to exhaustive_inner_product_dnnl
instead) to cppcontrib/amx/distances_dnnl.h
?
So that the only DNNL logic remaining in distances.cpp
is the dispatching mechanism between blas and dnnl in knn_inner_product_select
faiss/utils/distances.cpp
Outdated
#ifdef ENABLE_DNNL | ||
/** Find the nearest neighbors for nx queries in a set of ny vectors using oneDNN/AMX */ | ||
template <class BlockResultHandler> | ||
void exhaustive_inner_product_blas_dnnl( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename this to exhaustive_inner_product_dnnl
faiss/utils/distances.cpp
Outdated
@@ -650,6 +709,8 @@ int distance_compute_blas_threshold = 20; | |||
int distance_compute_blas_query_bs = 4096; | |||
int distance_compute_blas_database_bs = 1024; | |||
int distance_compute_min_k_reservoir = 100; | |||
int distance_compute_dnnl_query_bs = 10240; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to move these two extern variables to cppcontrib/amx/distances_dnnl.h
?
faiss/utils/distances.cpp
Outdated
|
||
FAISS_ASSERT(use_sel == (sel != nullptr)); | ||
|
||
float* res_arr = (float*)malloc(nx * ny * sizeof(float)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: std::unique_ptr
Thank you for the feedback and the helpful comments 😄 ! I’ve rebased the code on the current main branch and addressed the code comments you provided, Could you please review it again? |
Thank you for all the hard work. |
@guangzegu thanks for the updates! Can you also merge the low precision unit test changes in this PR as well incorporating the CI change in #3900 which enables test coverage for your PR? The reasoning here is we want to have test coverage for this new mode to make sure everything works as expected before merging. As for precision control flag, I recommend the following (cc @asadoughi if you have some more thoughts):
|
Description
Intel® AMX, which is an AI acceleration engine deeply embedded into every core of our 4th/5th Gen Intel® Xeon® Scalable processor. Intel® AMX(Intel Advanced Matrix Extensions) is a set of programming extensions designed to enhance the performance of matrix operations. Intel oneAPI Deep Neural Network Library (oneDNN) is an open-source performance library designed to accelerate deep learning frameworks on Intel architectures. oneDNN is able to leverage the efficient matrix computation extensions provided by AMX to accelerate the performance of deep learning frameworks on Intel architectures, especially for computation-intensive matrix operations.
IndexFlatIP search performance accelerated by oneDNN/AMX improves by 1.7X to 5X compared to the default inner_product, In scenarios with 1 query, dimensions ranging from 64 to 1024, and 1,000,000 vectors.
IndexFlatIP search performance accelerated by oneDNN/AMX improves by up to 4X compared to the Blas inner_product, In scenarios with 1000 query, dimensions ranging from 64 to 1024, and 1,000,000 vectors.
How to use
When invoking Cmake , add an option as follows:
-DFAISS_ENABLE_DNNL=OFF
Enable support for oneDNN to accelerate IndexFlatIP search(possible values areON
andOFF
)When you want to use Intel®-AMX/oneDNN to accelerate the search of indexFlatIP, set
FAISS_ENABLE_DNNL
to ON and run on 4th/5th Gen Intel® Xeon® Scalable processor, the exhaustive_inner_product_seq method will be accelerated.Co-authored-by: @xtangxtang [email protected]