-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect WAL hole #13226
base: main
Are you sure you want to change the base?
Detect WAL hole #13226
Conversation
db/log_reader.h
Outdated
// TODO(hx235): to revise `stop_replay_for_corruption_` in `LogReader` since | ||
// we have `predecessor_wal_info_` to verify against the `PredecessorWALInfo` | ||
// recorded in current WAL. If there is no WAL hole, we can revise | ||
// `stop_replay_for_corruption_` to be false. | ||
bool stop_replay_for_corruption_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means we can do MaybeReviseStopReplayForCorruption()
rocksdb/db/db_impl/db_impl_open.cc
Lines 1518 to 1521 in d957e1a
// In point-in-time recovery mode, if sequence id of log files are | |
// consecutive, we continue recovery despite corruption. This could | |
// happen when we open and write to a corrupted DB, where sequence id | |
// will start from the last sequence id we recovered. |
So we can fix the issue mentioned #12918 and re-enable ecycle_log_file_num (wal recycling)
db/log_format.h
Outdated
@@ -17,7 +17,7 @@ | |||
namespace ROCKSDB_NAMESPACE { | |||
namespace log { | |||
|
|||
enum RecordType { | |||
enum RecordType : int { | |||
// Zero is reserved for preallocated files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temporary note to reviewer: Please ignore changes to db/log_format.h. It belongs to #13225 this PR has to base onto.
fd4c1a8
to
22d1c96
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall! A couple of issues and questions
db/db_wal_test.cc
Outdated
ASSERT_OK(Put("key_ignore1", "wal_to_recycle")); | ||
ASSERT_OK(Put("key_ignore2", "wal_to_recycle")); | ||
FlushOptions fo; | ||
fo.wait = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait = true
is the default. Should be able to just ASSERT_OK(Flush())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
: log_number_(0), | ||
size_bytes_(0), | ||
last_seqno_recorded_(0), | ||
initialized_(false) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a slight preference for initializing the fields to obviously fake extreme values (UINT64_MAX) to indicate "not populated with real info," instead of an initialized_
field, because if those get used by accident, it's more obvious what the bug is. Just a personal preference.
Or if you want to keep the bool, it's probably worth asserting in some places that it's initialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by adding more assertion
db/log_format.h
Outdated
@@ -41,10 +41,14 @@ enum RecordType : uint8_t { | |||
// User-defined timestamp sizes | |||
kUserDefinedTimestampSizeType = 10, | |||
kRecyclableUserDefinedTimestampSizeType = 11, | |||
|
|||
// For WAL verification | |||
kPredecessorWALInfoType = 129, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to set (and note in a comment) a precedent that for all the values >= 10, the 1 bit just indicates/toggles whether it's recyclable. For that we should use either 128, 129 or 130, 131 for these two enums, not 129, 130.
I think we could use that for some code simplification in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
db/log_reader.cc
Outdated
@@ -329,6 +356,54 @@ bool Reader::ReadRecord(Slice* record, std::string* scratch, | |||
return false; | |||
} | |||
|
|||
void Reader::MaybeVerifyPredecessorWALInfo( | |||
WALRecoveryMode wal_recovery_mode, Slice fragment, | |||
const PredecessorWALInfo& expected_predecessor_wal_info) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"expected" is confusing to me, because both what is observed in DB recovery and what is saved in the special WAL entry are "expected" to match the other. Similarly, predecessor_wal_info_
is ambiguous with no comment.
How about observed_predecessor_wal_info_
and recorded_predecessor_wal_info
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
if (expected_predecessor_log_number >= min_wal_number_to_keep_) { | ||
std::string reason = "Missing WAL of log number " + | ||
std::to_string(expected_predecessor_log_number); | ||
ReportCorruption(fragment.size(), reason.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all of these corruption cases covered by the unit tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me cover all of them though the size difference is hard to cover without triggering other corruption. I will see what I can do with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
include/rocksdb/options.h
Outdated
@@ -632,6 +632,25 @@ struct DBOptions { | |||
// Default: false | |||
bool track_and_verify_wals_in_manifest = false; | |||
|
|||
// EXPERIMENTAL | |||
// | |||
// If true, various information about predecessor WAL will be recorded in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: better to avoid passive voice ("will be recorded in the current WAL" -> "each new WAL will record")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
include/rocksdb/options.h
Outdated
// stricter requirement on WAL than the DB went through `RepariDB()` can | ||
// normally meet | ||
// 2. There exists no WAL hole where new WAL data presents while some old WAL | ||
// data is missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest a sentence about the DB manifest indicating which WALs are obsolete and can be missing with no data loss. To answer the obvious question of how it is known OK for a predecessor WAL to be missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
Context/Summary:
This PR provides a new Options
track_and_verify_wals
to detect and handle WAL hole where new WAL data presents while some old WAL data is missing as well as db opened with no WAL. It's for #12488.It's intended to be a future replacement to
track_and_verify_wals_in_manifest
for its simplicity, better handling of WAL hole inWALRecoveryMode::kPointInTimeRecovery
and potentials to cover more scenarios forWALRecoveryMode::kTolerateCorruptedTailRecords/kAbsoluteConsistency
(in future PRs).The verification is done in
LogReader::MaybeVerifyPredecessorWALInfo()
and tracking is done inlog::Writer::MaybeAddPredecessorWALInfo()
. This PR also groups common utilities inlog::Writer
into functionsMaybeHandleSeenFileWriterError()
,MaybeSwitchToNewBlock()
to avoid adding redundant codeTest:
log::Writer
. Below benchmark show no regression.