perf: teach `SegmentReader` to lazily open/read its various SegmentComponents #20

eeeebbbbrrrr · 2025-01-17T20:02:28Z

This overhauls SegmentReader to put its various components behind OnceLocks such that they can be opened and read on their first use, as oppoed when a SegmentReader is constructed -- which is once for every segment when an Index is opened.

This has a negative impact on some of Tantivy's expectations in that an existing SegementReader can still read from physical files that were deleted by a merge. This isn't true now that the segment's physical files aren't opened until needed. As such, I've #[ignore]'d six tests that expose this problem.

From our (pg_search's) side of things, we don't really have physical files and don't need to rely on the filesystem/kernel to allow reading unlinked files that are still open.

Overall, this cuts down a signficiant number of disk reads during pg_search's query planning. With my test data it goes from 808 individual reads totalling 999,799 bytes, to 18 reads totalling 814,514 bytes.

This reduces the time it takes to plan a simple query from about 1.4ms to 0.436ms -- roughly a 3.2x improvement.

…mponents This overhauls `SegmentReader` to put its various components behind `OnceLock`s such that they can be opened and read on their first use, as oppoed when a SegmentReader is constructed -- which is once for every segment when an Index is opened. This has a negative impact on some of Tantivy's expectations in that an existing SegementReader can still read from physical files that were deleted by a merge. This isn't true now that the segment's physical files aren't opened until needed. As such, I've `#[ignore]`'d six tests that expose this problem. From our (pg_search's) side of things, we don't really have physical files and don't need to rely on the filesystem/kernel to allow reading unlinked files that are still open. Overall, this cuts down a signficiant number of disk reads during pg_search's query planning. With my test data it goes from 808 individual reads totalling 999,799 bytes, to 18 reads totalling 814,514 bytes. This reduces the time it takes to plan a simple query from about 1.4ms to 0.436ms -- roughly a 3.2x improvement.

…and update tantivy rev We defer loading a `LinkedBytesList`'s `BlockList` until needed. This eliminates quite some overhead during query planning. Additionally, we update our tantivy dependency rev to pick up the recent lazy loading changes from paradedb/tantivy#20

…and update tantivy rev (#2116) We defer loading a `LinkedBytesList`'s `BlockList` until needed. This eliminates quite some overhead during query planning. Additionally, we update our tantivy dependency rev to pick up the recent lazy loading changes from paradedb/tantivy#20

…mponents (#20) This overhauls `SegmentReader` to put its various components behind `OnceLock`s such that they can be opened and read on their first use, as oppoed when a SegmentReader is constructed -- which is once for every segment when an Index is opened. This has a negative impact on some of Tantivy's expectations in that an existing SegementReader can still read from physical files that were deleted by a merge. This isn't true now that the segment's physical files aren't opened until needed. As such, I've `#[ignore]`'d six tests that expose this problem. From our (pg_search's) side of things, we don't really have physical files and don't need to rely on the filesystem/kernel to allow reading unlinked files that are still open. Overall, this cuts down a signficiant number of disk reads during pg_search's query planning. With my test data it goes from 808 individual reads totalling 999,799 bytes, to 18 reads totalling 814,514 bytes. This reduces the time it takes to plan a simple query from about 1.4ms to 0.436ms -- roughly a 3.2x improvement.

rebasedming approved these changes Jan 17, 2025

View reviewed changes

eeeebbbbrrrr merged commit 7f86b47 into main Jan 17, 2025
5 checks passed

eeeebbbbrrrr deleted the eebbrr/segment-reader-lazy-reading branch January 17, 2025 20:29

eeeebbbbrrrr mentioned this pull request Jan 17, 2025

perf: improve query planning times by loading BlockLists on demand and update tantivy rev paradedb/paradedb#2116

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: teach `SegmentReader` to lazily open/read its various SegmentComponents #20

perf: teach `SegmentReader` to lazily open/read its various SegmentComponents #20

eeeebbbbrrrr commented Jan 17, 2025

perf: teach SegmentReader to lazily open/read its various SegmentComponents #20

perf: teach SegmentReader to lazily open/read its various SegmentComponents #20

Conversation

eeeebbbbrrrr commented Jan 17, 2025

perf: teach `SegmentReader` to lazily open/read its various SegmentComponents #20

perf: teach `SegmentReader` to lazily open/read its various SegmentComponents #20