-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable file system buffer reuse for compaction prefetches #13187
base: main
Are you sure you want to change the base?
Conversation
772efa0
to
a7ec31f
Compare
@archang19 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@archang19 has updated the pull request. You must reimport the pull request before landing. |
cb16e90
to
32d2d17
Compare
@archang19 has updated the pull request. You must reimport the pull request before landing. |
@archang19 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@archang19 has updated the pull request. You must reimport the pull request before landing. |
@archang19 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
3fd9280
to
9a0a2c4
Compare
@archang19 has updated the pull request. You must reimport the pull request before landing. |
@archang19 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@archang19 has updated the pull request. You must reimport the pull request before landing. |
1 similar comment
@archang19 has updated the pull request. You must reimport the pull request before landing. |
f87e358
to
88ebd8b
Compare
@archang19 has updated the pull request. You must reimport the pull request before landing. |
88ebd8b
to
ab57916
Compare
@archang19 has updated the pull request. You must reimport the pull request before landing. |
1 similar comment
@archang19 has updated the pull request. You must reimport the pull request before landing. |
315bfc0
to
5e4483f
Compare
@archang19 has updated the pull request. You must reimport the pull request before landing. |
5 similar comments
@archang19 has updated the pull request. You must reimport the pull request before landing. |
@archang19 has updated the pull request. You must reimport the pull request before landing. |
@archang19 has updated the pull request. You must reimport the pull request before landing. |
@archang19 has updated the pull request. You must reimport the pull request before landing. |
@archang19 has updated the pull request. You must reimport the pull request before landing. |
Just to clarify:
Why do we need overlap buffer logic for reuse file system buffer + prefetch + non-assync IO? Is it because there can actually be overlapping data OR is it just because the |
// readahize_size_. | ||
uint64_t trimmed_readahead_size = 0; | ||
if (n < readahead_size_) { | ||
trimmed_readahead_size = readahead_size_ - n; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where in the original code shows that compaction is prefetching only readahead_size_ - n
but not readahead_size_
?
I remembered it was very confusing to me what was std::max(n, readahead_size_)
in Prefetch() intended for - why not n + readahead_size_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where in the original code shows that compaction is prefetching only readahead_size_ - n but not readahead_size_?
https://github.com/facebook/rocksdb/blob/main/file/file_prefetch_buffer.cc#L823
if (for_compaction) {
s = Prefetch(opts, reader, offset, std::max(n, readahead_size_));
I think you have it correct. The std::max(n, readahead_size_)
that you mentioned is the reason that the prefetch is only readahead_size_ - n
. If n
< readahead_size_
, we read the original n
bytes plus readahead_size_ - n
of prefetched data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anand1976 I am in favor of unifying the logic so that we treat compaction readaheads the same as non-compaction readaheads (e.g. fetch the requested amount + readahead_size_
in either case). Do you have any objections / what are your thoughts? I think this would simplify our logic and don't see downsides
readahead_params.max_readahead_size = 8192; | ||
readahead_params.num_buffers = 1; | ||
|
||
FilePrefetchBuffer fpb(readahead_params, true, false, fs(), nullptr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add comments of the parameter name right after each hard-coded value, same for TryReadFromCache()
@@ -819,8 +819,20 @@ bool FilePrefetchBuffer::TryReadFromCacheUntracked( | |||
assert(reader != nullptr); | |||
assert(max_readahead_size_ >= readahead_size_); | |||
|
|||
// We disallow async IO for compaction reads since their | |||
// latencies are not user-visible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ... are more tolerable ?? We have stats for user to know when compaction read is slow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can update the wording. The idea was that the compaction read was a background operation, in contrast to a latency-sensitive user-initiated scan operation
ASSERT_EQ(stats->getAndResetTickerCount(PREFETCH_HITS), 1); | ||
ASSERT_EQ(stats->getAndResetTickerCount(PREFETCH_BYTES_USEFUL), | ||
4096); // 12288-16384 | ||
if (!for_compaction) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why !for_compaction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add a comment in the code for this.
We do not update the prefetch stats for compaction reads https://github.com/facebook/rocksdb/blob/main/file/file_prefetch_buffer.cc#L856-L857, so I need this check to keep the tests passing for all the different test parameter variations
} else if (!for_compaction) {
UpdateStats(/*found_in_buffer=*/true, n);
}
I don't know why exactly we don't do this. @anand1976 may know. I would guess that we don't want to "contaminate" the prefetch stats for non-compaction reads (PREFETCH_HITS
and PREFETCH_BYTES_USEFUL
) which we presumably are more interested in. I think we could have defined a separate set of prefetch stats for compaction reads specifically, but maybe we just did not care because the compaction reads are all background ops anyways
2d72233
to
ac5fd5c
Compare
@archang19 has updated the pull request. You must reimport the pull request before landing. |
@archang19 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@hx235 With async IO, there are 2 buffers, so the overlap buffer is for when the requested data spans the 2 buffers (i.e. buffer 1 has the first half of the data and buffer 2 has the second half). Without async IO, and with the file system buffer optimization, we use the overlap buffer when the main (and only buffer) only has a "partial hit." Say the buffer contains offsets 100-200 and we request 150-300. Now, without the file system buffer optimization, we would normally "refit tail," move bytes 150-200 from the end of the buffer to the start of the buffer, and then request bytes 200-300 (+ any additional prefetching). With the file system buffer optimization, we don't do that because we are pointing to a buffer allocated by the file system. So instead we would copy bytes 150-200 to the overlap buffer, request bytes 200-300 (+ any additional prefetching), and then copy back bytes 200-300 to the overlap buffer. Ultimately, the overlap buffer would contain exactly what the user asked for (bytes 150-300) and no more. We also avoid refetching the same data (i.e. making a request for bytes 150-300) |
Summary
In #13177, I discussed an unsigned integer overflow issue that affects compaction reads inside
FilePrefetchBuffer
when we attempt to enable the file system buffer reuse optimization. In that PR, I disabled the optimization wheneverfor_compaction
wastrue
to eliminate the source of the bug.This PR safely re-enables the optimization when
for_compaction
istrue
. We need to properly set the overlap buffer throughPrefetchInternal
rather than simply callingPrefetch
.Prefetch
assumesnum_buffers_
is 1 (i.e. async IO is disabled), so historically it did not have any overlap buffer logic. What ends up happening (with the old bug) is that, when we try to reuse the file system provided buffer, inside thePrefetch
method, we read the remaining missing data. However, since we do not do anyRefitTail
method whenuse_fs_buffer
is true, normally we would rely on copying the partial relevant data into an overlap buffer. That overlap buffer logic was missing, so the final main buffer ends up storing data from an offset that is greater than the requested offset, and we effectively end up "throwing away" part of the requested data.Test Plan
I removed the temporary test case from #13200 and incorporated the same test cases into my updated parameterized test case, which tests the valid combinations between
use_async_prefetch
andfor_compaction
.I went further and added a randomized test case that will simply try to hit
assert
ion failures and catch any missing areas in the logic.I also added a test case for compaction reads without the file system buffer reuse optimization. I am thinking that it may be valuable to make a future PR that unifies a lot of these prefetch tests and parametrizes as much of them as possible. This way we can avoid writing duplicate tests and just look over different parameters for async IO, direct IO, file system buffer reuse, and
for_compaction
.