Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch: Async handling of indexing/deletion requests #8465

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

tobias-hotz
Copy link
Contributor

@tobias-hotz tobias-hotz commented Oct 24, 2024

Currently, indexing is batched as a global queue of 200 elements (except when forceRefreshReaders is true). When the threshold is reached, the entries are submitted, and the thread submitting the 200th element waits until the elasticsearch returns the result of the request.
With deletion, we currently always send one request per deletion request and always use the deleteByQuery method.

The current design has a number of flaws:

  • When multiple threads index metadata and both do not use forceRefreshReaders, the queue could grow past 200 (e.g., two threads add at about the same time, and at the time the queue size is checked, the queue already has 201 elements). This causes the queue to grow indefinitely, meaning no further entries are submitted before an explicit call to sendDocumentsToIndex. This can cause Out Of Memory errors (we observed this when using multiple reindexing threads)
  • An indexing thread is spending a significant amount of time waiting for the elasticsearch, while it could already prepare the next batch for the ES to be consumed
  • As we wait after every delete call, deleting many entries takes a lot more time as needed
  • Due to deletion always using deleteByQuery (even when we could delete by uuid), no batching is possible
  • When multiple threads index metadata and both do not use forceRefreshReaders, the queue contains mixed entries of Thread 1 and Thread 2. At most call sites, after all entries have been indexed, sendDocumentsToIndex is called. It is currently possible that Thread 1 is currently sending a batch that contains entries from Thread 2 as well and Thread 2 already finished submitting the rest of the batch, causing Thread 2 to think that all its entries have been sent to Elasticsearch, but they are still being submitted by Thread 1. This is a very rare problem, though

This PR solves all of these problems. The main takeaway is that it significantly improves the performance of deleting and indexing many entries.

This is accomplished by introducing a IIndexSubmitter and IDeletionSubmitter. These new classes handle how new entries are sent to the index. The direct implementations (DirectIndexSubmitter and DirectDeletionSubmitter) are similar to how the old forceIndexChanges parameter worked in that they directly send the data to the index.
With the use of the BatchingIndexSubmitter and BatchingDeletionSubmittor, chunks are sent periodically to the elasticsearch (just as before), but a local queue is used, and we do not wait for the elasticearch. The index responses are handled asynchronously on a different thread instead. We still guarantee that the indexing will be complete once the whole block is done, as the close method sends the rest of the local queue and waits for all async responses to be complete.

We made some performance measurements on a smaller scale. Here is the average result of a bunch of runs with different CSW harvesters:

Operation Average time before change Average time after change Improvement in %
Harvesting ~3600 entries 11 minutes and 17 seconds 06 minutes and 30 seconds ~74%
Harvesting ~1250 entries 06 minutes and 33 seconds 04 minutes and 28 seconds ~47%
Harvesting ~700 entries 03 minutes and 14 seconds 01 minutes and 30 seconds ~131%
Harvesting ~100 entries 00 minutes and 29 seconds 00 minutes and 29 seconds ~35%
Reindexing ~5700 entries 04 minutes and 39 seconds 02 minutes and 40 seconds ~74%
Deleting ~3600 entries 01 minutes and 35 seconds 00 minutes and 20 seconds ~475%
Deleting ~1250 entries 00 minutes and 38 seconds 00 minutes and 09 seconds ~433%
Deleting ~700 entries 00 minutes and 20 seconds 00 minutes and 04 seconds ~500%
Deleting ~100 entries 00 minutes and 05 seconds 00 minutes and 01 seconds ~500%

As you can see, there are very significant performance gains. These numbers were recorded on a local machine, if you use a remote index on a different machine, the effect may be even higher due to latency/throughput limitations.

Checklist

  • I have read the contribution guidelines
  • Pull request provided for main branch, backports managed with label
  • Good housekeeping of code, cleaning up comments, tests, and documentation
  • Clean commit history broken into understandable chucks, avoiding big commits with hundreds of files, cautious of reformatting and whitespace changes
  • Clean commit messages, longer verbose messages are encouraged
  • API Changes are identified in commit messages
  • Testing provided for features or enhancements using automatic tests
  • User documentation provided for new features or enhancements in manual
  • Build documentation provided for development instructions in README.md files
  • Library management using pom.xml dependency management. Update build documentation with intended library use and library tutorials or documentation

Funded by LGL BW

This improves indexing time, especially when the ES connection has a high latency or the ES load is high.
@tobias-hotz tobias-hotz marked this pull request as draft October 24, 2024 13:51
This causes issues with some tests. This issue was already present before the changes
@fxprunayre
Copy link
Member

Interesting work @tobias-hotz. We are also investigating how to improve indexing performances for GeoNetwork 5. See draft work geonetwork/geonetwork#19 and
https://github.com/geonetwork/geonetwork/blob/main/src/modules/indexing/src/main/java/org/geonetwork/indexing/IndexingService.java#L108
Maybe some ideas can be shared, or tracking some GN4 hot spots to avoid to make similar mistakes in GN5 can be nice.

@tobias-hotz
Copy link
Contributor Author

Hi @fxprunayre
thanks for taking a look.
The idea for this is to improve performance of indexing while breaking none of the existing behaviour. Some code paths assume that an element is available in the index right after the call to index, so this has to be taken into account.
Also, the index preparation is still done in the main thread, as it is not unlikely that some code relies on this, and given how big the GN4 Codebase is, I'd rather go with this.

My first approach was to just return the Future of the index response to the caller, but that was getting pretty messy and it was easy to miss a call site. That's why I chose this approach.
For GN5, it could also be benificial to move the index preparation stuff to another thread.

This change allows allows the multithreaded reindexing to work again (which is somewhat broken at the moment, mainly because of concurrency issues with the single document buffer for the bulk requests). This reduces the time spend on reindexing by a lot. So support for multithreaded indexing is something GN5 should also provide out of the box.

@CLAassistant
Copy link

CLAassistant commented Dec 8, 2024

CLA assistant check
All committers have signed the CLA.

@tobias-hotz tobias-hotz changed the title Elasticsearch Indexing: Handle index responses asynchronously Elasticsearch: Async handling of indexing/deletion requests Dec 10, 2024
@tobias-hotz tobias-hotz marked this pull request as ready for review December 10, 2024 14:36
…indexing

# Conflicts:
#	services/src/main/java/org/fao/geonet/api/processing/DatabaseProcessUtils.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants