Detect WAL hole #13226

hx235 · 2024-12-19T05:13:32Z

Context/Summary:

This PR provides a new Options track_and_verify_wals to detect and handle WAL hole where new WAL data presents while some old WAL data is missing as well as db opened with no WAL. It's for #12488.

It's intended to be a future replacement to track_and_verify_wals_in_manifest for its simplicity, better handling of WAL hole in WALRecoveryMode::kPointInTimeRecovery and potentials to cover more scenarios for WALRecoveryMode::kTolerateCorruptedTailRecords/kAbsoluteConsistency(in future PRs).

The verification is done in LogReader::MaybeVerifyPredecessorWALInfo() and tracking is done in log::Writer::MaybeAddPredecessorWALInfo(). This PR also groups common utilities in log::Writer into functions MaybeHandleSeenFileWriterError(), MaybeSwitchToNewBlock() to avoid adding redundant code

Test:

New UT
Integrate into existing UT
Stress test for 0.5 hour and will keep monitoring after landing
db bench
- The only potential performance implication it has is to the write path since now we keep track of the last seqno recorded in the WAL in log::Writer. Below benchmark show no regression.

./db_bench --benchmarks=fillrandom[-X3] --num=2500000 --db=/dev/shm/db_bench_new --disable_auto_compactions=1 --threads=1 --enable_pipelined_write=0 --disable_wal=0 --track_and_verify_wals=1

Pre
fillrandom [AVG    3 runs] : 310517 (± 5641) ops/sec;   34.4 (± 0.6) MB/sec
fillrandom [MEDIAN 3 runs] : 308848 ops/sec;   34.2 MB/sec

Post
fillrandom [AVG    3 runs] : 311469 (± 4096) ops/sec;   34.5 (± 0.5) MB/sec
fillrandom [MEDIAN 3 runs] : 311961 ops/sec;   34.5 MB/sec

hx235 · 2024-12-19T05:59:16Z

db/log_reader.h

+  // TODO(hx235): to revise `stop_replay_for_corruption_` in `LogReader` since
+  // we have `predecessor_wal_info_` to verify against the `PredecessorWALInfo`
+  // recorded in current WAL. If there is no WAL hole, we can revise
+  // `stop_replay_for_corruption_` to be false.
+  bool stop_replay_for_corruption_;


This means we can do MaybeReviseStopReplayForCorruption()

rocksdb/db/db_impl/db_impl_open.cc

Lines 1518 to 1521 in d957e1a

// In point-in-time recovery mode, if sequence id of log files are

// consecutive, we continue recovery despite corruption. This could

// happen when we open and write to a corrupted DB, where sequence id

// will start from the last sequence id we recovered.

in a better way that is compatible with disable WAL, file ingestion when WAL is recycled (in next PR).

So we can fix the issue mentioned #12918 and re-enable ecycle_log_file_num (wal recycling)

hx235 · 2024-12-19T06:19:44Z

db/log_format.h

@@ -17,7 +17,7 @@
 namespace ROCKSDB_NAMESPACE {
 namespace log {

-enum RecordType {
+enum RecordType : int {
  // Zero is reserved for preallocated files


Temporary note to reviewer: Please ignore changes to db/log_format.h. It belongs to #13225 this PR has to base onto.

pdillinger

Looks good overall! A couple of issues and questions

pdillinger · 2024-12-20T22:14:39Z

db/db_wal_test.cc

+  ASSERT_OK(Put("key_ignore1", "wal_to_recycle"));
+  ASSERT_OK(Put("key_ignore2", "wal_to_recycle"));
+  FlushOptions fo;
+  fo.wait = true;


wait = true is the default. Should be able to just ASSERT_OK(Flush())

pdillinger · 2024-12-20T22:29:57Z

db/dbformat.h

+      : log_number_(0),
+        size_bytes_(0),
+        last_seqno_recorded_(0),
+        initialized_(false) {}


I have a slight preference for initializing the fields to obviously fake extreme values (UINT64_MAX) to indicate "not populated with real info," instead of an initialized_ field, because if those get used by accident, it's more obvious what the bug is. Just a personal preference.

Or if you want to keep the bool, it's probably worth asserting in some places that it's initialized.

Fixed by adding more assertion

pdillinger · 2024-12-20T23:21:36Z

db/log_format.h

@@ -41,10 +41,14 @@ enum RecordType : uint8_t {
  // User-defined timestamp sizes
  kUserDefinedTimestampSizeType = 10,
  kRecyclableUserDefinedTimestampSizeType = 11,
+
+  // For WAL verification
+  kPredecessorWALInfoType = 129,


Let's try to set (and note in a comment) a precedent that for all the values >= 10, the 1 bit just indicates/toggles whether it's recyclable. For that we should use either 128, 129 or 130, 131 for these two enums, not 129, 130.

I think we could use that for some code simplification in the future.

pdillinger · 2024-12-20T23:35:17Z

db/log_reader.cc

@@ -329,6 +356,54 @@ bool Reader::ReadRecord(Slice* record, std::string* scratch,
  return false;
 }

+void Reader::MaybeVerifyPredecessorWALInfo(
+    WALRecoveryMode wal_recovery_mode, Slice fragment,
+    const PredecessorWALInfo& expected_predecessor_wal_info) {


"expected" is confusing to me, because both what is observed in DB recovery and what is saved in the special WAL entry are "expected" to match the other. Similarly, predecessor_wal_info_ is ambiguous with no comment.

How about observed_predecessor_wal_info_ and recorded_predecessor_wal_info?

pdillinger · 2024-12-20T23:37:49Z

db/log_reader.cc

+    if (expected_predecessor_log_number >= min_wal_number_to_keep_) {
+      std::string reason = "Missing WAL of log number " +
+                           std::to_string(expected_predecessor_log_number);
+      ReportCorruption(fragment.size(), reason.c_str());


Are all of these corruption cases covered by the unit tests?

Let me cover all of them though the size difference is hard to cover without triggering other corruption. I will see what I can do with it.

pdillinger · 2024-12-20T23:44:20Z

include/rocksdb/options.h

@@ -632,6 +632,25 @@ struct DBOptions {
  // Default: false
  bool track_and_verify_wals_in_manifest = false;

+  // EXPERIMENTAL
+  //
+  // If true, various information about predecessor WAL will be recorded in the


Nit: better to avoid passive voice ("will be recorded in the current WAL" -> "each new WAL will record")

pdillinger · 2024-12-20T23:48:30Z

include/rocksdb/options.h

+  // stricter requirement on WAL than the DB went through `RepariDB()` can
+  // normally meet
+  // 2. There exists no WAL hole where new WAL data presents while some old WAL
+  // data is missing


Suggest a sentence about the DB manifest indicating which WALs are obsolete and can be missing with no data loss. To answer the obvious question of how it is known OK for a predecessor WAL to be missing.

facebook-github-bot · 2024-12-21T00:33:44Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-12-26T08:13:20Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot added the CLA Signed label Dec 19, 2024

hx235 force-pushed the wal_hole branch from 3c9bedd to 13ceba5 Compare December 19, 2024 05:15

hx235 commented Dec 19, 2024

View reviewed changes

hx235 force-pushed the wal_hole branch from 13ceba5 to 417673c Compare December 19, 2024 06:15

hx235 commented Dec 19, 2024

View reviewed changes

hx235 marked this pull request as draft December 19, 2024 19:07

hx235 force-pushed the wal_hole branch 12 times, most recently from fd4c1a8 to 22d1c96 Compare December 20, 2024 10:19

hx235 marked this pull request as ready for review December 20, 2024 19:50

hx235 requested a review from pdillinger December 20, 2024 19:50

pdillinger requested changes Dec 20, 2024

View reviewed changes

Detect

940feb6

hx235 force-pushed the wal_hole branch from 22d1c96 to 940feb6 Compare December 26, 2024 08:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect WAL hole #13226

Detect WAL hole #13226

hx235 commented Dec 19, 2024 •

edited

Loading

hx235 Dec 19, 2024 •

edited

Loading

hx235 Dec 19, 2024 •

edited

Loading

pdillinger left a comment

pdillinger Dec 20, 2024

hx235 Dec 26, 2024

pdillinger Dec 20, 2024

hx235 Dec 26, 2024

pdillinger Dec 20, 2024

hx235 Dec 26, 2024

pdillinger Dec 20, 2024

hx235 Dec 26, 2024

pdillinger Dec 20, 2024

hx235 Dec 21, 2024

hx235 Dec 26, 2024

pdillinger Dec 20, 2024

hx235 Dec 26, 2024

pdillinger Dec 20, 2024

hx235 Dec 26, 2024

facebook-github-bot commented Dec 21, 2024

facebook-github-bot commented Dec 26, 2024

	// In point-in-time recovery mode, if sequence id of log files are
	// consecutive, we continue recovery despite corruption. This could
	// happen when we open and write to a corrupted DB, where sequence id
	// will start from the last sequence id we recovered.

Detect WAL hole #13226

Are you sure you want to change the base?

Detect WAL hole #13226

Conversation

hx235 commented Dec 19, 2024 • edited Loading

hx235 Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

hx235 Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

pdillinger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Dec 21, 2024

facebook-github-bot commented Dec 26, 2024

hx235 commented Dec 19, 2024 •

edited

Loading

hx235 Dec 19, 2024 •

edited

Loading

hx235 Dec 19, 2024 •

edited

Loading