-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return error instead of panicking if rewriting fails #343
Conversation
Signed-off-by: v01dstar <[email protected]>
Signed-off-by: v01dstar <[email protected]>
Signed-off-by: v01dstar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most if not all panics in this codebase is necessary. fsync
error on certain filesystems could cause all buffered write to be lost, and we (currently) have no way to check if those buffered writes belong to current write request or older ones. (#131).
I thought the is_no_space_err
has already fixed this issue? @LykxSassinator
Nope, I think I missed some works. raft-engine/tests/failpoints/test_engine.rs Line 1169 in 385182b
IMO, maybe when error == |
The issue seems indicating that
|
FYI, the panic may also happen in the
/cc @v01dstar |
The case we are concerned with is after the first fsync fails and clears the buffer, the second fsync returns success, producing the false impression that what hasn't been flushed out in the first fsync is persisted in the second one. As long as you don't bubble a fsync error, things should be fine. The case @LykxSassinator mentioned is pwrite fails first before fsync. I think it probably makes more sense to push down the panic close to fsync call. |
Are above 2 functions being called in non-test code path? If not, I don't think we need to propagate the error up.
This is already covered, isn't it.
|
Co-authored-by: lucasliang <[email protected]> Signed-off-by: v01dstar <[email protected]>
Co-authored-by: lucasliang <[email protected]> Signed-off-by: v01dstar <[email protected]>
Signed-off-by: v01dstar <[email protected]>
Signed-off-by: v01dstar <[email protected]>
cbb8a6b
to
6fcb077
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #343 +/- ##
=======================================
Coverage 98.21% 98.21%
=======================================
Files 33 33
Lines 12446 12457 +11
=======================================
+ Hits 12224 12235 +11
Misses 222 222 ☔ View full report in Codecov by Sentry. |
Need to at least try to do this. AFAIK there's no guarantee that pwrite will fail before fsync in the case of disk full. Especially when raft-engine disk is shared with other parties. |
IMO, specify whether the returned error from // src/env/log_fd/unix.rs
#[inline]
fn sync(&self) -> IoResult<()> {
fail_point!("log_fd::sync::err", |_| {
Err(from_nix_error(nix::Error::EINVAL, "fp"))
});
#[cfg(target_os = "linux")]
{
nix::unistd::fdatasync(self.0).map_err(|e| match e {
Errno::ENOSPC => from_nix_error(e, "nospace"),
_ => from_nix_error(e, "fdatasync"),
})
}
#[cfg(not(target_os = "linux"))]
{
nix::unistd::fsync(self.0).map_err(|e| match e {
Errno::ENOSPC => from_nix_error(e, "nospace"),
_ => from_nix_error(e, "fsync"),
})
}
} /cc @v01dstar |
Why? |
Returning the |
Please refer to this line. Panics are here to avoid silent data loss. |
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: v01dstar <[email protected]>
Signed-off-by: v01dstar <[email protected]>
Signed-off-by: v01dstar <[email protected]>
PTAL. I move panic inside |
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
src/file_pipe_log/pipe.rs
Outdated
@@ -272,7 +270,7 @@ impl<F: FileSystem> SinglePipe<F> { | |||
}; | |||
// File header must be persisted. This way we can recover gracefully if power | |||
// loss before a new entry is written. | |||
new_file.writer.sync()?; | |||
new_file.writer.sync(); | |||
self.sync_dir(path_id)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error needs to be handled carefully now. (e.g. remove the newly created file and make sure the old writer is okay to write again) Better just unwrap it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
build_file_writer
above is the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made sync_dir
panic if it fails.
But build_file_writer
should be fine, right? It is the type of panic this PR trying to avoid (this can be confirmed by test_no_space_write_error
). If it fails, the new file won't be used for writing and will be recycled the next time rotate_impl
is called. So, it already meet your expectation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably.. I suggest add a few restart in test_file_rotate_error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few more verifications in test_file_rotate_error test, should be able to address your concern? PTAL
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
src/file_pipe_log/pipe.rs
Outdated
self.sync_dir(path_id)?; | ||
// Panic if sync calls fail, keep consistent with the behavior of | ||
// `LogFileWriter::sync()`. | ||
self.sync_dir(path_id).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
panic inside sync_dir as well.
src/file_pipe_log/pipe.rs
Outdated
@@ -248,7 +248,7 @@ impl<F: FileSystem> SinglePipe<F> { | |||
let new_seq = writable_file.seq + 1; | |||
debug_assert!(new_seq > DEFAULT_FIRST_FILE_SEQ); | |||
|
|||
writable_file.writer.close()?; | |||
writable_file.writer.close().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to unwrap now.
@@ -67,7 +67,7 @@ impl<F: FileSystem> LogFileWriter<F> { | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment to this struct stating it should be fail-safe, i.e. user can still use the writer without breaking data consistency if any operation has failed.
src/filter.rs
Outdated
@@ -333,7 +333,7 @@ impl RhaiFilterMachine { | |||
)?; | |||
log_batch.drain(); | |||
} | |||
writer.close()?; | |||
writer.close().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
src/purge.rs
Outdated
@@ -273,7 +273,7 @@ where | |||
// Rewrites the entire rewrite queue into new log files. | |||
fn rewrite_rewrite_queue(&self) -> Result<Vec<u64>> { | |||
let _t = StopWatch::new(&*ENGINE_REWRITE_REWRITE_DURATION_HISTOGRAM); | |||
self.pipe_log.rotate(LogQueue::Rewrite)?; | |||
self.pipe_log.rotate(LogQueue::Rewrite).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why unwrap this?
@@ -165,20 +165,24 @@ fn test_file_rotate_error() { | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make two versions of this test: fn test_file_rotate_error(restart: bool)
// case 1
if restart {
let engine = Engine::open_with_file_system(cfg.clone(), fs.clone()).unwrap();
}
// case 2
// ...
Signed-off-by: Yang Zhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LG
tests/failpoints/test_io_error.rs
Outdated
let engine = Engine::open_with_file_system(cfg.clone(), fs.clone()).unwrap(); | ||
engine | ||
let mut engine = Some(Engine::open_with_file_system(cfg.clone(), fs.clone()).unwrap()); | ||
let mut engine_ref = engine.as_ref().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need, you can re-assign a variable after it's moved, e.g. drop(engine); engine = Engine::new();
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
/test |
Signed-off-by: Yang Zhang <[email protected]>
/cc @Connor1996 Can u help to merge this pr? THx |
Ref #tikv/15755
Return the error instead of panicking if the error won't cause inconsistency while rotating log files.
Ref #131, this PR modifies the decision 3 made in the issue. After this PR,
create
no longer panics.truncate
will be retry-able while appending but non-retry-able while closing.Also updated Cargo.toml to remove the
TODO
.