Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect WAL hole #13226

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Detect WAL hole #13226

wants to merge 1 commit into from

Conversation

hx235
Copy link
Contributor

@hx235 hx235 commented Dec 19, 2024

Context/Summary:

This PR provides a new Options track_and_verify_wals to detect and handle WAL hole where new WAL data presents while some old WAL data is missing as well as db opened with no WAL. It's for #12488.

It's intended to be a future replacement to track_and_verify_wals_in_manifest for its simplicity, better handling of WAL hole in WALRecoveryMode::kPointInTimeRecovery and potentials to cover more scenarios for WALRecoveryMode::kTolerateCorruptedTailRecords/kAbsoluteConsistency(in future PRs).

The verification is done in LogReader::MaybeVerifyPredecessorWALInfo() and tracking is done in log::Writer::MaybeAddPredecessorWALInfo(). This PR also groups common utilities in log::Writer into functions MaybeHandleSeenFileWriterError(), MaybeSwitchToNewBlock() to avoid adding redundant code

Test:

  • New UT
  • Integrate into existing UT
  • Stress test for 0.5 hour and will keep monitoring after landing
  • db bench
    • The only potential performance implication it has is to the write path since now we keep track of the last seqno recorded in the WAL in log::Writer. Below benchmark show no regression.
./db_bench --benchmarks=fillrandom[-X3] --num=2500000 --db=/dev/shm/db_bench_new --disable_auto_compactions=1 --threads=1 --enable_pipelined_write=0 --disable_wal=0 --track_and_verify_wals=1

Pre
fillrandom [AVG    3 runs] : 310517 (± 5641) ops/sec;   34.4 (± 0.6) MB/sec
fillrandom [MEDIAN 3 runs] : 308848 ops/sec;   34.2 MB/sec

Post
fillrandom [AVG    3 runs] : 311469 (± 4096) ops/sec;   34.5 (± 0.5) MB/sec
fillrandom [MEDIAN 3 runs] : 311961 ops/sec;   34.5 MB/sec

db/log_reader.h Outdated
Comment on lines 158 to 164
// TODO(hx235): to revise `stop_replay_for_corruption_` in `LogReader` since
// we have `predecessor_wal_info_` to verify against the `PredecessorWALInfo`
// recorded in current WAL. If there is no WAL hole, we can revise
// `stop_replay_for_corruption_` to be false.
bool stop_replay_for_corruption_;
Copy link
Contributor Author

@hx235 hx235 Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we can do MaybeReviseStopReplayForCorruption()

// In point-in-time recovery mode, if sequence id of log files are
// consecutive, we continue recovery despite corruption. This could
// happen when we open and write to a corrupted DB, where sequence id
// will start from the last sequence id we recovered.
in a better way that is compatible with disable WAL, file ingestion when WAL is recycled (in next PR).

So we can fix the issue mentioned #12918 and re-enable ecycle_log_file_num (wal recycling)

db/log_format.h Outdated
@@ -17,7 +17,7 @@
namespace ROCKSDB_NAMESPACE {
namespace log {

enum RecordType {
enum RecordType : int {
// Zero is reserved for preallocated files
Copy link
Contributor Author

@hx235 hx235 Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporary note to reviewer: Please ignore changes to db/log_format.h. It belongs to #13225 this PR has to base onto.

@hx235 hx235 marked this pull request as draft December 19, 2024 19:07
@hx235 hx235 force-pushed the wal_hole branch 12 times, most recently from fd4c1a8 to 22d1c96 Compare December 20, 2024 10:19
@hx235 hx235 marked this pull request as ready for review December 20, 2024 19:50
@hx235 hx235 requested a review from pdillinger December 20, 2024 19:50
Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall! A couple of issues and questions

ASSERT_OK(Put("key_ignore1", "wal_to_recycle"));
ASSERT_OK(Put("key_ignore2", "wal_to_recycle"));
FlushOptions fo;
fo.wait = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait = true is the default. Should be able to just ASSERT_OK(Flush())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

: log_number_(0),
size_bytes_(0),
last_seqno_recorded_(0),
initialized_(false) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slight preference for initializing the fields to obviously fake extreme values (UINT64_MAX) to indicate "not populated with real info," instead of an initialized_ field, because if those get used by accident, it's more obvious what the bug is. Just a personal preference.

Or if you want to keep the bool, it's probably worth asserting in some places that it's initialized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by adding more assertion

db/log_format.h Outdated
@@ -41,10 +41,14 @@ enum RecordType : uint8_t {
// User-defined timestamp sizes
kUserDefinedTimestampSizeType = 10,
kRecyclableUserDefinedTimestampSizeType = 11,

// For WAL verification
kPredecessorWALInfoType = 129,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to set (and note in a comment) a precedent that for all the values >= 10, the 1 bit just indicates/toggles whether it's recyclable. For that we should use either 128, 129 or 130, 131 for these two enums, not 129, 130.

I think we could use that for some code simplification in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

db/log_reader.cc Outdated
@@ -329,6 +356,54 @@ bool Reader::ReadRecord(Slice* record, std::string* scratch,
return false;
}

void Reader::MaybeVerifyPredecessorWALInfo(
WALRecoveryMode wal_recovery_mode, Slice fragment,
const PredecessorWALInfo& expected_predecessor_wal_info) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"expected" is confusing to me, because both what is observed in DB recovery and what is saved in the special WAL entry are "expected" to match the other. Similarly, predecessor_wal_info_ is ambiguous with no comment.

How about observed_predecessor_wal_info_ and recorded_predecessor_wal_info?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if (expected_predecessor_log_number >= min_wal_number_to_keep_) {
std::string reason = "Missing WAL of log number " +
std::to_string(expected_predecessor_log_number);
ReportCorruption(fragment.size(), reason.c_str());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all of these corruption cases covered by the unit tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me cover all of them though the size difference is hard to cover without triggering other corruption. I will see what I can do with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -632,6 +632,25 @@ struct DBOptions {
// Default: false
bool track_and_verify_wals_in_manifest = false;

// EXPERIMENTAL
//
// If true, various information about predecessor WAL will be recorded in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: better to avoid passive voice ("will be recorded in the current WAL" -> "each new WAL will record")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

// stricter requirement on WAL than the DB went through `RepariDB()` can
// normally meet
// 2. There exists no WAL hole where new WAL data presents while some old WAL
// data is missing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest a sentence about the DB manifest indicating which WALs are obsolete and can be missing with no data loss. To answer the obvious question of how it is known OK for a predecessor WAL to be missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants