[BUG] Regression in cohere-10m force merge latency after switching to NativeEngines990KnnVectorsWriter #2134

shatejas · 2024-09-20T19:16:17Z

What is the bug?

After the switch to NativeEngines990KnnVectorsWriter we saw force merge latencies increased approximately by 20% in nightly runs

The increase has been consistent

How can one reproduce the bug?
Running a benchmark for force-merge against 2.17 vs 2.16 with cohere 10m dataset takes 2000seconds (~30 mins) more

What is the expected behavior?
It should take the same time or less

What is your host/environment?
Nightlies dashboard

Do you have any screenshots?
NA

Do you have any additional context?
NA

The text was updated successfully, but these errors were encountered:

shatejas · 2024-09-20T19:18:46Z

To reproduce easily cohere 1m dataset was used for benchmarking for the below table

	number of index segments	force merge time (minutes)	force merge segments
2.17 code	75	15.68263	3
#2133 code	92	12.51252	3
2.16 code (minor code change to mimic it)	88	15.68209	3

Estimated bottlenecks

KNNVectorValues Creation

KNNVectorValues are created 3 times currently, we cannot reuse the same object as there is no way we could reset the iterator and putting effort into logic for resetting the iterator might not result in latency improvements

Computing totalLiveDocs
Training for quantization
Building index

Currently we are creating KNNVectorValues when quantization is not needed. Exp 2 in the above table shows some improvement in force merge time

TotalLiveDocs computes

There is a linear time complexity to compute total live docs. TotalLiveDocs value is currently needed to

Mean calculations during quantization training
Memory allocation computations while building graph for HNSW

Flush case

For flush we can avoid this calculation as there are no deleted docs involved and we can rely on KNNVectorValues or vectors in the field to give us the right result for totalLiveDocs

Merge case

Merge involves removing deleted docs, While merging the segments the deleted docs aren’t considered. To do that current code path is using APIs in MergedVectorValues to have an iterator that can iterate while skipping the deleted docs. The APIs here does not give an iterator which considered deleted docs in its size count. As a result even KNNVectorValues cannot return the right result as it relies on the iterator provided by the MergedVectorValues to compute total live docs

navneet1v · 2024-09-20T19:50:30Z

@shatejas one way to avoid the linear complexity for totalLives does when there are no deleted docs is we can write our custom FloatVectorValues merger. We already have something like this in BinaryDocValues: ref: https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/codec/KNN80Codec/KNN80DocValuesReader.java#L38-L64

If we do this, we can remove the complexity for total live docs. We can also add some clone interfaces on those merged values which will help us remove complexity from code.

jmazanec15 · 2024-09-23T16:06:01Z

@shatejas The table doesnt show repro for 2.17 vs 2.16 right?

shatejas · 2024-09-23T16:50:45Z

@jmazanec15 So first row is 2.17 but on main branch as the code path is the same. To mimic 2.16 code path this change was made while running the bench mark

jmazanec15 · 2024-09-23T16:53:05Z

@shatejas but isnt 2.17 time same as 2.16 - so can we not repro it with the setup?

shatejas · 2024-09-23T18:40:36Z

@shatejas but isnt 2.17 time same as 2.16 - so can we not repro it with the setup?

Not exactly same, the number of segments being merged is ~15% higher for 2.16 compared to 2.17 so there is some difference

shatejas added bug Something isn't working untriaged labels Sep 20, 2024

navneet1v assigned shatejas Sep 20, 2024

shatejas mentioned this issue Sep 20, 2024

Makes sure KNNVectorValues aren't recreated unnecessarily when #2133

Merged

5 tasks

shatejas mentioned this issue Sep 23, 2024

Optimizes live docs computes for force merge in NativeEngines990KNNVectorWriter #2135

Closed

5 tasks

jmazanec15 removed the untriaged label Sep 28, 2024

shatejas linked a pull request Oct 4, 2024 that will close this issue

Preloads .vec and .vex files #2186

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Regression in cohere-10m force merge latency after switching to NativeEngines990KnnVectorsWriter #2134

[BUG] Regression in cohere-10m force merge latency after switching to NativeEngines990KnnVectorsWriter #2134

shatejas commented Sep 20, 2024

shatejas commented Sep 20, 2024 •

edited

Loading

navneet1v commented Sep 20, 2024

jmazanec15 commented Sep 23, 2024

shatejas commented Sep 23, 2024 •

edited

Loading

jmazanec15 commented Sep 23, 2024

shatejas commented Sep 23, 2024 •

edited

Loading

[BUG] Regression in cohere-10m force merge latency after switching to NativeEngines990KnnVectorsWriter #2134

[BUG] Regression in cohere-10m force merge latency after switching to NativeEngines990KnnVectorsWriter #2134

Comments

shatejas commented Sep 20, 2024

shatejas commented Sep 20, 2024 • edited Loading

Estimated bottlenecks

KNNVectorValues Creation

TotalLiveDocs computes

Flush case

Merge case

navneet1v commented Sep 20, 2024

jmazanec15 commented Sep 23, 2024

shatejas commented Sep 23, 2024 • edited Loading

jmazanec15 commented Sep 23, 2024

shatejas commented Sep 23, 2024 • edited Loading

shatejas commented Sep 20, 2024 •

edited

Loading

shatejas commented Sep 23, 2024 •

edited

Loading

shatejas commented Sep 23, 2024 •

edited

Loading