-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Update compactor to flush total records upon compaction #3483
[ENH] Update compactor to flush total records upon compaction #3483
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
This stack of pull requests is managed by Graphite. Learn more about stacking. |
b0e05a7
to
6b32b1b
Compare
@@ -133,12 +133,23 @@ impl ArrowUnorderedBlockfileWriter { | |||
Box::new(ArrowBlockfileError::MigrationError(e)) as Box<dyn ChromaError> | |||
})?; | |||
|
|||
let total_keys = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already loop over the blocks above, do we need a second iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the suggestion to calculate the total before the loop and then iteratively add to the total for each delta?
@@ -733,6 +744,7 @@ mod tests { | |||
writer.set(prefix_2, key2, value2).await.unwrap(); | |||
|
|||
let flusher = writer.commit::<&str, Vec<u32>>().await.unwrap(); | |||
assert_eq!(Some(2_u64), flusher.total_keys()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need more testing around this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, wanted to make sure I wasn't missing the mark directionally before going deeper. Will write next
LogPosition int64 | ||
CurrentCollectionVersion int32 | ||
FlushSegmentCompactions []*FlushSegmentCompaction | ||
TotalRecordsPostCompaction uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will you have to backfill, or is the frontend expected to handle null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to backfill as the field defaults to 0 and over time will become correct as collections are updated. The potential overage is bounded by compaction frequency. Happy to discuss here
Only one main comment: update the record count together with the log offset and collection version in the same query instead of doing two queries to DB |
b794d14
to
25c1fb0
Compare
Description of changes
Updates the compactor to calculate the total records per collection and flush to the sysdb upon every compaction.
Summarize the changes made by this PR.
FlushCollectionCompaction
struct includesTotalRecordsPostCompaction
SysDB
populates thetotal_records_post_compaction
column when receiving a flushArrowBlockfileFlusher
contains a new attribute,total_keys
ArrowUnorderedBlockfileWriter
sums the total keys using theSparseIndexWriter
and returns anArrowBlockfileFlusher
with the summed countRegisterInput
contains a new attribute,total_records_post_compaction
CompactOrchestrator
, when handlingCommitSegmentWriterOutput
, receives aChromaSegmentFlusher::RecordSegment
, it readstotal_keys()
and sets it as an attribute on itself.ChromaSegmentFlusher::RecordSegment
hastotal_keys()
through itsArrowBlockfileFlusher
CompactOrchestrator
sends itsnum_records_last_compaction
value to aRegisterInput
to be flushed to theSysDB
Test plan
How are these changes tested?
pytest
for python,yarn test
for js,cargo test
for rustSysDB
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?