-
-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elasticsearch: Async handling of indexing/deletion requests #8465
base: main
Are you sure you want to change the base?
Conversation
This improves indexing time, especially when the ES connection has a high latency or the ES load is high.
…in case on of not properly closed submittors
702dc20
to
83ea448
Compare
This causes issues with some tests. This issue was already present before the changes
ddb7766
to
948575b
Compare
767194b
to
61fb7ad
Compare
Interesting work @tobias-hotz. We are also investigating how to improve indexing performances for GeoNetwork 5. See draft work geonetwork/geonetwork#19 and |
Hi @fxprunayre My first approach was to just return the Future of the index response to the caller, but that was getting pretty messy and it was easy to miss a call site. That's why I chose this approach. This change allows allows the multithreaded reindexing to work again (which is somewhat broken at the moment, mainly because of concurrency issues with the single document buffer for the bulk requests). This reduces the time spend on reindexing by a lot. So support for multithreaded indexing is something GN5 should also provide out of the box. |
…indexing # Conflicts: # services/src/main/java/org/fao/geonet/api/processing/DatabaseProcessUtils.java
Currently, indexing is batched as a global queue of 200 elements (except when forceRefreshReaders is true). When the threshold is reached, the entries are submitted, and the thread submitting the 200th element waits until the elasticsearch returns the result of the request.
With deletion, we currently always send one request per deletion request and always use the deleteByQuery method.
The current design has a number of flaws:
This PR solves all of these problems. The main takeaway is that it significantly improves the performance of deleting and indexing many entries.
This is accomplished by introducing a
IIndexSubmitter
andIDeletionSubmitter
. These new classes handle how new entries are sent to the index. The direct implementations (DirectIndexSubmitter
andDirectDeletionSubmitter
) are similar to how the oldforceIndexChanges
parameter worked in that they directly send the data to the index.With the use of the
BatchingIndexSubmitter
andBatchingDeletionSubmittor
, chunks are sent periodically to the elasticsearch (just as before), but a local queue is used, and we do not wait for the elasticearch. The index responses are handled asynchronously on a different thread instead. We still guarantee that the indexing will be complete once the whole block is done, as the close method sends the rest of the local queue and waits for all async responses to be complete.We made some performance measurements on a smaller scale. Here is the average result of a bunch of runs with different CSW harvesters:
As you can see, there are very significant performance gains. These numbers were recorded on a local machine, if you use a remote index on a different machine, the effect may be even higher due to latency/throughput limitations.
Checklist
main
branch, backports managed with labelREADME.md
filespom.xml
dependency management. Update build documentation with intended library use and library tutorials or documentationFunded by LGL BW