Performance difference between files getting opened with IOContext.RANDOM vs IOContext.READ during merges #13920

navneet1v · 2024-10-16T07:53:01Z

Description

For past 1 month we have been testing difference in performance for a files getting opened with IOContext.RANDOM vs IOContext.READ specially during merges with Lucene version 9.11.1 and Opensearch version 2.17. We started this deep-dive when we saw an increase in time for our force merges.

Background

Opensearch k-NN plugin provides the vector search capabilities with Opensearch. The architecture of k-NN plugin is very similar to how Lucene implements the vector search with few small key differences. Before 2.17 version of Opensearch, opensearch was using dvd files to store the raw the vectors and graphs files are stored separately. These graphs are not built using Lucene HNSW but with some external libraries like Faiss.
In 2.17 version of Opensearch, we started using KNNFlatVectorFormat to store and read the raw vectors in place of dvd files. As reading vectors from .vec files as float[] is more efficient than reading byte[] and then converting to float[]. Ref: opensearch-project/k-NN#1853

Observations

After doing the switch what we observed that our merge time for a 10M 768D dataset is increased by 20%. We did an extensive deep-dive/experiments on the root cause(ref1, ref2) and code difference between dvd files and .vec files format and was able to see that IOContext.RANDOM with .vec files is causing this regression.

This regression comes because during merges for every Lucene99FlatVectorsReader there are some operations happens like checkIntegrity(which does checksum of whole file), reading of all vector values to create new segment which are more of sequential reads than random reads.

I do believe that having a RANDOM madvise on .vec file is more beneficial for search and graph merges as Lucene uses this file as a way to store raw vectors for HNSW. BTW This PR added the capability: #13267 for Random IOContext to .vec files.

Solutions Tried:

We have been trying multiple solution(all on Lucene version 9.11.1) and have been in touch with @uschindler and @mikemccand over
[email protected] :

Switching the IOContext from RANDOM to READ: This actually helped in reducing the merges time so our original 20% increase in merge time was reduced to 0%. But making this change from RANDOM to READ for .vec added the extra latency for Lucene HNSW search. We saw 2x increase in latency because of this on a 10M 768D dataset. Code ref
Creating a new IndexInput in checkIntegrity function: This solution[ref] creates a new IndexInput in checkIntegrity function of Lucene99FlatVectorsReader with madvise as SEQUENTIAL. One of the biggest con it has is we are creating 2 indexinputs for same file in different threads and it not been recommended as per this.
We also tried enabling the preload functionality for .vec file, it really helped in reducing the merge time similar to LUCENE-9375: check github actions #1, but suffered 2x latency increase during search.
We borrowed some code from latest version of Lucene of changing the mdvise before doing integrity checks and merging of flat vectors and applied it on 9.11.1 version. ref: shatejas@4de3872#diff-e0a29611df21f6d32a461e2d24db1585cdf3a8590f08d93b097f0dd84684ebc8R316, but with this we expect any search that is happening during the merges will have high latencies. This increase is a hypothesis we have not run the benchmarks around this.

What is an ideal solution?

In my mind an ideal solution will be the one which takes the advantage of different type of madvise and changes the madvise for the file based on the need(if merge of flatvectors is happening use Sequential, but if HNSW graph is building/searching then flip back to RANDOM). I am not sure what could be a consequence of this would like to know what community thinks about it. Similar to option 4.

Also, I do believe that since Lucene99FlatVectorsFormat now extends KNNVectorsFormat thanks to this PR: #13469, having an ability to change the madvise from consumers of this format is needed. So that Lucene99FlatVectorsFormat can be used independently and not always tied with HNSW.

FAQ

On what machines the benchmarks were performed with 10M 768D dataset?
Since Opensearch is distributed Lucene, the setup was like this:
1. 3 Opensearch nodes was used.
2. All nodes had 32GB of Heap
3. All nodes had 128GB of RAM and 16 vCPUs.
4. The 10M dataset was divided in 6 lucene indices aka primary shards. So per lucene index there was 1.6M docs.
5. Only vector field was indexed.
Is the benchmarks performed on Lucene library independently?
No, we have not performed any benchmarks with Lucene library independently, but I am working on having a reproducible setup. If there are some easy way to setup and reproduce, please share.

cc: @shatejas, @vamshin, @jmazanec15

The text was updated successfully, but these errors were encountered:

uschindler · 2024-10-17T10:50:29Z

Thanks for opening the issue. I already made similar suggestion in another PR and also the mailing list.

I'd go the route and temporarily change the IOContext to SEQUENTIAL. This may of course slow down random reads, but on the other hand once the whole file is merged away (and was therefor read) it should be in FS cache anyways. If not, you have too less memory, like @s1monw says: "Add more RUM" :-)

Users of the old segment which was merged away will only use it till the next IndexReader reopen, soby signaling that we read it only once it's a good idea to get rid of it from cache soon.

So my proposal is:

Add a method to Indexinput to change the IOContext (without cloning it), but document it in a valid way that all clones or slices opened at same time are also affected.
Before merging of segments, we should add a hook to the codec so it can call some special method on the incoming CodecReader to "make it ready for merging" and "revert to normal use". This could instruct the codec to apply different madvise advices or restore them. I am not sure what the best API for that is, was just a quick idea (haven't looked at the different codec components). In general the hooks should be available for all codecs components, not only DocValues and Vectors. Because also merging of stored fields may be improved by switching to SEQUENTIAL to to higher read-ahead and less paging requests in kernel.

uschindler · 2024-10-17T10:51:32Z

@jpountz ping.

jpountz · 2024-10-17T13:53:46Z

That makes sense to me.

I wonder if we actually need to revert the advice back to normal in the end, or if we can optimistically assume that it's unlikely that we'll need to reclaim that RAM for something else before the next refresh picks up this segment.

In terms of hooking into existing APIs, the getMergeInstance() method of our file formats feels like a natural place to change the read advice.

shatejas · 2024-10-18T01:48:28Z

@uschindler On a high level it makes sense to me. I have a couple of questions so that I understand this better

Add a method to Indexinput to change the IOContext (without cloning it),

I am interested in understanding when its appropriate to clone. Based on the javadoc for IndexInput, for multithreaded use IndexInput must be cloned. My understanding is that merges will have a separate thread. Since the Readers are pooled and the same instance is used - I cloned it when I was trying to benchmark the solution. Would appreciate if you can give insights on why the indexInput shouldn't be cloned in this case.

"revert to normal use"

In the benchmarking code, I did not revert it back thinking the reader will be closed and a new reader will be opened with the intended IOContext in this case random. Would you be able to share insights on reverting it back considering there will be new reader.

uschindler · 2024-10-18T07:58:48Z

@uschindler On a high level it makes sense to me. I have a couple of questions so that I understand this better

Add a method to Indexinput to change the IOContext (without cloning it),

I am interested in understanding when its appropriate to clone. Based on the javadoc for IndexInput, for multithreaded use IndexInput must be cloned. My understanding is that merges will have a separate thread. Since the Readers are pooled and the same instance is used - I cloned it when I was trying to benchmark the solution. Would appreciate if you can give insights on why the indexInput shouldn't be cloned in this case.

Cloning a reader wont clone any input which is a fully different thing. Don't care about cloning or not for implementing this issue. The Javadocs and usage pattern (when to clone) is more about stateful use of IndexInputs (they have a read position which can't be updated from multiple threads). If you are about pure random access, you can get a view on it as RandomAccessInput (which is sometimes used for Lucene). In all cases: The cloned inputs use exactly the same MemorySegments behind scenes (in former days it was ByteBuffers, those were also duped for clones).

What I wanted to say is: When you change the read advice, it will affect all clones, too. Therefor it is not needed to create a clone of the IndexInput. So basically it simplifies thigs: The CodecReader that is used for merging (and used at same time also for searching) can just be instucted to change the read advice on its backing IndexInput. Thats relatively simple to implenment and won't affect the current behaviour how merging works.

"revert to normal use"

In the benchmarking code, I did not revert it back thinking the reader will be closed and a new reader will be opened with the intended IOContext in this case random. Would you be able to share insights on reverting it back considering there will be new reader.

As this is just an advise: When done with merging, just revert back. Opening new readers is too expensive and mostly not useful. In addition, you can't control if searchers using the index in parallel reopen the readers soon. @jpountz already discussed that. This largely depends on how often you reopen IndexReaders. So in general it would be a good idea to revert back to the original state if the IndexReader is still used for longer time. This really depends on the usage pattern.

Reverting to normal use is as simple to implement as the initial change. Just change the advice on any of the open IndexInputs, no matter if it is shared by multiple readers and revert back at end. Thats totally uncoupled from the internals of IndexReaders.

getMergeInstance() provided by the codec/formats api is the ideal place to change the read advice. The question is only if and how we revert back when done. Thois could be done by a wrapper (on close) returned instead of the original reader.

Uwe

shatejas · 2024-10-18T16:49:07Z

Understood, so with not cloning we are avoiding any ambiguity whether the original and other clones are getting affected or not. Not cloning makes it clear that any thread using the reader will get affected

With regards to reverting, its restoring the previous state. Wrapper approach sounds good to me, I will try it out.

Thanks @uschindler and @jpountz for your inputs!

navneet1v · 2024-10-24T06:26:32Z

@uschindler and @jpountz thanks for your inputs and detailed explanation. @shatejas I think all the required details are present, so are you going to raise a PR for this?

@uschindler and @jpountz 1 more thing I would like to confirm once the PR is raised, till what version of Lucene the change can be backported to? Ideally I would like to get it backported till 9.12 version. But would like to know your inputs too.

navneet1v · 2024-10-24T07:24:48Z

Opening new readers is too expensive and mostly not useful.

@uschindler one question on this, the reason why you say opening new readers is expensive because readers mostly open new IndexInput on a file? Also, one more question not related to particularly to this gh issue, does doing multiple times new IndexInput(fileName) opens the underline file multiple times and maps it at different memory address(if MMapDirectory is used) or the file is still mapped at same memory address?

shatejas · 2024-10-24T17:39:21Z

@shatejas I think all the required details are present, so are you going to raise a PR for this?

Yeah I am working on it, I have the changes and I am trying to figure out a good way to benchmark lucene

uschindler · 2024-10-30T20:48:39Z

Opening new readers is too expensive and mostly not useful.

@uschindler one question on this, the reason why you say opening new readers is expensive because readers mostly open new IndexInput on a file? Also, one more question not related to particularly to this gh issue, does doing multiple times new IndexInput(fileName) opens the underline file multiple times and maps it at different memory address(if MMapDirectory is used) or the file is still mapped at same memory address?

When you open the same file multiple times it mmaps several times. Lucene does not do this and should not do this. For this we have "clone()". So as said before: Just change the madvise on already open indexinput.
It is expensive to open new index inputs for the reasons mentioned before.

Please stick with the approach: Plan to implement an API to tell an existing IndexReader to switch to "merge" mode and the underlying codec then can optionally change the madvise for the already open indexinputs. When done, switch back.

navneet1v · 2024-10-30T22:27:05Z

So as said before: Just change the madvise on already open indexinput.

@uschindler, sorry for the confusion. The plan was never to open multiple files. The question was more of general question from my side to understand the behavior.

The implementation plan is still the same which you have recommended. I hope this clarify things.

shatejas mentioned this issue Oct 18, 2024

[BUG] Regression in cohere-10m force merge latency after switching to NativeEngines990KnnVectorsWriter opensearch-project/k-NN#2134

Open

shatejas mentioned this issue Nov 9, 2024

Introduces IndexInput#updateReadAdvice to change the readadvice while #13985

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance difference between files getting opened with IOContext.RANDOM vs IOContext.READ during merges #13920

Performance difference between files getting opened with IOContext.RANDOM vs IOContext.READ during merges #13920

navneet1v commented Oct 16, 2024 •

edited

Loading

uschindler commented Oct 17, 2024 •

edited

Loading

uschindler commented Oct 17, 2024

jpountz commented Oct 17, 2024

shatejas commented Oct 18, 2024

uschindler commented Oct 18, 2024 •

edited

Loading

shatejas commented Oct 18, 2024

navneet1v commented Oct 24, 2024

navneet1v commented Oct 24, 2024

shatejas commented Oct 24, 2024

uschindler commented Oct 30, 2024 •

edited

Loading

navneet1v commented Oct 30, 2024

Performance difference between files getting opened with IOContext.RANDOM vs IOContext.READ during merges #13920

Performance difference between files getting opened with IOContext.RANDOM vs IOContext.READ during merges #13920

Comments

navneet1v commented Oct 16, 2024 • edited Loading

Description

Background

Observations

Solutions Tried:

What is an ideal solution?

FAQ

uschindler commented Oct 17, 2024 • edited Loading

uschindler commented Oct 17, 2024

jpountz commented Oct 17, 2024

shatejas commented Oct 18, 2024

uschindler commented Oct 18, 2024 • edited Loading

shatejas commented Oct 18, 2024

navneet1v commented Oct 24, 2024

navneet1v commented Oct 24, 2024

shatejas commented Oct 24, 2024

uschindler commented Oct 30, 2024 • edited Loading

navneet1v commented Oct 30, 2024

navneet1v commented Oct 16, 2024 •

edited

Loading

uschindler commented Oct 17, 2024 •

edited

Loading

uschindler commented Oct 18, 2024 •

edited

Loading

uschindler commented Oct 30, 2024 •

edited

Loading