Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More optimized data structures in indexes & more granular storage parts #760

Open
novoj opened this issue Dec 6, 2024 · 0 comments
Open
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@novoj
Copy link
Collaborator

novoj commented Dec 6, 2024

Currently the database cannot handle high cardinality entities in some of the indexes. The problem is that write operations become slower and slower as the cardinality increases. The database has been optimised for reading, so everything has been taken into account. We often use simple arrays in our indexes, which need to be reallocated on each change - and this becomes a bottleneck when the array size goes to hundreds of thousands of elements or more.

A very simple spike test confirmed this obvious fact. Inserting 1 million elements into a simple array, when each iteration requires allocating array.length + 1 of the new array size, gets 50% slower every 100k elements - so inserting 1m elements takes almost 3 minutes on my machine, while inserting the same amount of elements into CompositeIntArray takes a few hundred milliseconds.

It's time to switch to B+ trees in our indexes in places that use simple arrays (like InvertedIndex) to speed up writes. Other changes may be found necessary along the way, but this issue is the spark that should ignite this movement.

Another change that should take place is the revision of the granularity of the index storage parts. Currently, when a single indexed attribute changes in a transaction, the entire attribute index must be replaced in storage at the end of the transaction, which quickly contaminates the storage layer and triggers a vacuuming process that rewrites the contents of the entire file. If only part of the index needed to be stored, this would greatly reduce the load on the system and prolong the time available in time machine.

@novoj novoj added the enhancement New feature or request label Dec 6, 2024
@novoj novoj added this to the Beta milestone Dec 6, 2024
@novoj novoj self-assigned this Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant