-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Regression in cohere-10m force merge latency after switching to NativeEngines990KnnVectorsWriter #2134
Comments
To reproduce easily cohere 1m dataset was used for benchmarking for the below table
Estimated bottlenecksKNNVectorValues CreationKNNVectorValues are created 3 times currently, we cannot reuse the same object as there is no way we could reset the iterator and putting effort into logic for resetting the iterator might not result in latency improvements
Currently we are creating KNNVectorValues when quantization is not needed. Exp 2 in the above table shows some improvement in force merge time TotalLiveDocs computesThere is a linear time complexity to compute total live docs. TotalLiveDocs value is currently needed to
Flush caseFor flush we can avoid this calculation as there are no deleted docs involved and we can rely on KNNVectorValues or vectors in the field to give us the right result for totalLiveDocs Merge caseMerge involves removing deleted docs, While merging the segments the deleted docs aren’t considered. To do that current code path is using APIs in MergedVectorValues to have an iterator that can iterate while skipping the deleted docs. The APIs here does not give an iterator which considered deleted docs in its size count. As a result even KNNVectorValues cannot return the right result as it relies on the iterator provided by the MergedVectorValues to compute total live docs |
@shatejas one way to avoid the linear complexity for totalLives does when there are no deleted docs is we can write our custom FloatVectorValues merger. We already have something like this in BinaryDocValues: ref: https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/codec/KNN80Codec/KNN80DocValuesReader.java#L38-L64 If we do this, we can remove the complexity for total live docs. We can also add some clone interfaces on those merged values which will help us remove complexity from code. |
@shatejas The table doesnt show repro for 2.17 vs 2.16 right? |
@jmazanec15 So first row is 2.17 but on main branch as the code path is the same. To mimic 2.16 code path this change was made while running the bench mark |
@shatejas but isnt 2.17 time same as 2.16 - so can we not repro it with the setup? |
Not exactly same, the number of segments being merged is ~15% higher for 2.16 compared to 2.17 so there is some difference |
What is the bug?
After the switch to
NativeEngines990KnnVectorsWriter
we saw force merge latencies increased approximately by 20% in nightly runsThe increase has been consistent
How can one reproduce the bug?
Running a benchmark for force-merge against 2.17 vs 2.16 with cohere 10m dataset takes 2000seconds (~30 mins) more
What is the expected behavior?
It should take the same time or less
What is your host/environment?
Nightlies dashboard
Do you have any screenshots?
NA
Do you have any additional context?
NA
The text was updated successfully, but these errors were encountered: