Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: teach SegmentReader to lazily open/read its various SegmentComponents #20

Merged
merged 1 commit into from
Jan 17, 2025

Conversation

eeeebbbbrrrr
Copy link

This overhauls SegmentReader to put its various components behind OnceLocks such that they can be opened and read on their first use, as oppoed when a SegmentReader is constructed -- which is once for every segment when an Index is opened.

This has a negative impact on some of Tantivy's expectations in that an existing SegementReader can still read from physical files that were deleted by a merge. This isn't true now that the segment's physical files aren't opened until needed. As such, I've #[ignore]'d six tests that expose this problem.

From our (pg_search's) side of things, we don't really have physical files and don't need to rely on the filesystem/kernel to allow reading unlinked files that are still open.

Overall, this cuts down a signficiant number of disk reads during pg_search's query planning. With my test data it goes from 808 individual reads totalling 999,799 bytes, to 18 reads totalling 814,514 bytes.

This reduces the time it takes to plan a simple query from about 1.4ms to 0.436ms -- roughly a 3.2x improvement.

…mponents

This overhauls `SegmentReader` to put its various components behind
`OnceLock`s such that they can be opened and read on their first use, as
oppoed when a SegmentReader is constructed -- which is once for every
segment when an Index is opened.

This has a negative impact on some of Tantivy's expectations in that an
existing SegementReader can still read from physical files that were
deleted by a merge.  This isn't true now that the segment's physical
files aren't opened until needed.  As such, I've `#[ignore]`'d six tests
that expose this problem.

From our (pg_search's) side of things, we don't really have physical
files and don't need to rely on the filesystem/kernel to allow reading
unlinked files that are still open.

Overall, this cuts down a signficiant number of disk reads during
pg_search's query planning.  With my test data it goes from 808
individual reads totalling 999,799 bytes, to 18 reads totalling 814,514
bytes.

This reduces the time it takes to plan a simple query from about 1.4ms
to 0.436ms -- roughly a 3.2x improvement.
@eeeebbbbrrrr eeeebbbbrrrr merged commit 7f86b47 into main Jan 17, 2025
5 checks passed
@eeeebbbbrrrr eeeebbbbrrrr deleted the eebbrr/segment-reader-lazy-reading branch January 17, 2025 20:29
eeeebbbbrrrr added a commit to paradedb/paradedb that referenced this pull request Jan 17, 2025
…and update tantivy rev

We defer loading a `LinkedBytesList`'s `BlockList` until needed.  This
eliminates quite some overhead during query planning.

Additionally, we update our tantivy dependency rev to pick up the recent
lazy loading changes from paradedb/tantivy#20
eeeebbbbrrrr added a commit to paradedb/paradedb that referenced this pull request Jan 17, 2025
…and update tantivy rev (#2116)

We defer loading a `LinkedBytesList`'s `BlockList` until needed.  This eliminates quite some overhead during query planning.

Additionally, we update our tantivy dependency rev to pick up the recent lazy loading changes from paradedb/tantivy#20
philippemnoel pushed a commit that referenced this pull request Jan 24, 2025
…mponents (#20)

This overhauls `SegmentReader` to put its various components behind `OnceLock`s such that they can be opened and read on their first use, as oppoed when a SegmentReader is constructed -- which is once for every segment when an Index is opened.

This has a negative impact on some of Tantivy's expectations in that an existing SegementReader can still read from physical files that were deleted by a merge.  This isn't true now that the segment's physical files aren't opened until needed.  As such, I've `#[ignore]`'d six tests that expose this problem.

From our (pg_search's) side of things, we don't really have physical files and don't need to rely on the filesystem/kernel to allow reading unlinked files that are still open.

Overall, this cuts down a signficiant number of disk reads during pg_search's query planning.  With my test data it goes from 808 individual reads totalling 999,799 bytes, to 18 reads totalling 814,514 bytes.

This reduces the time it takes to plan a simple query from about 1.4ms to 0.436ms -- roughly a 3.2x improvement.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants