Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pvf thiserror #2958

Merged
merged 8 commits into from
Jan 19, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 21 additions & 51 deletions polkadot/node/core/pvf/common/src/error.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@

use crate::prepare::{PrepareSuccess, PrepareWorkerSuccess};
use parity_scale_codec::{Decode, Encode};
use std::fmt;

/// Result of PVF preparation from a worker, with checksum of the compiled PVF and stats of the
/// preparation if successful.
Expand All @@ -32,35 +31,43 @@ pub type PrecheckResult = Result<(), PrepareError>;

/// An error that occurred during the prepare part of the PVF pipeline.
// Codec indexes are intended to stabilize pre-encoded payloads (see `OOM_PAYLOAD`)
#[derive(Debug, Clone, Encode, Decode)]
#[derive(thiserror::Error, Debug, Clone, Encode, Decode)]
pub enum PrepareError {
/// During the prevalidation stage of preparation an issue was found with the PVF.
#[codec(index = 0)]
#[error("prevalidation: {0}")]
Prevalidation(String),
/// Compilation failed for the given PVF.
#[codec(index = 1)]
#[error("preparation: {0}")]
Preparation(String),
/// Instantiation of the WASM module instance failed.
#[codec(index = 2)]
#[error("runtime construction: {0}")]
RuntimeConstruction(String),
/// An unexpected error has occurred in the preparation job.
#[codec(index = 3)]
#[error("panic: {0}")]
JobError(String),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[error("panic: {0}")]
JobError(String),
#[error("prepare job error: {0}")]
JobError(String),

(That's my bad, I forgot to update this line after renaming Panic -> JobError.

Copy link
Contributor Author

@maksimryndin maksimryndin Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure I will update:) I've noticed it but from the book description it was quite logical

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency we could also make a prepare: prefix for other messages in PrepareError if it is allowed

Copy link
Contributor Author

@maksimryndin maksimryndin Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested in whether we can now better log some of these errors that didn't implement the Display trait before. We should be able to log them with {} instead of {:?}. I'll look through the codebase later for such instances.

It is a good point! Thank you! I will check these instances within the PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrcnski I've checked the usage of {:?} within pvf/ and changed several ones to {}. And also changed PrepareError messages a little bit for consistency.

/// Failed to prepare the PVF due to the time limit.
#[codec(index = 4)]
#[error("prepare: timeout")]
TimedOut,
/// An IO error occurred. This state is reported by either the validation host or by the
/// worker.
#[codec(index = 5)]
#[error("prepare: io error while receiving response: {0}")]
IoErr(String),
/// The temporary file for the artifact could not be created at the given cache path. This
/// state is reported by the validation host (not by the worker).
#[codec(index = 6)]
#[error("prepare: error creating tmp file: {0}")]
CreateTmpFile(String),
/// The response from the worker is received, but the file cannot be renamed (moved) to the
/// final destination location. This state is reported by the validation host (not by the
/// worker).
#[codec(index = 7)]
#[error("prepare: error renaming tmp file ({src:?} -> {dest:?}): {err}")]
RenameTmpFile {
err: String,
// Unfortunately `PathBuf` doesn't implement `Encode`/`Decode`, so we do a fallible
Expand All @@ -70,17 +77,21 @@ pub enum PrepareError {
},
/// Memory limit reached
#[codec(index = 8)]
#[error("prepare: out of memory")]
OutOfMemory,
/// The response from the worker is received, but the worker cache could not be cleared. The
/// worker has to be killed to avoid jobs having access to data from other jobs. This state is
/// reported by the validation host (not by the worker).
#[codec(index = 9)]
#[error("prepare: error clearing worker cache: {0}")]
ClearWorkerDir(String),
/// The preparation job process died, due to OOM, a seccomp violation, or some other factor.
JobDied { err: String, job_pid: i32 },
#[codec(index = 10)]
#[error("prepare: prepare job with pid {job_pid} died: {err}")]
JobDied { err: String, job_pid: i32 },
/// Some error occurred when interfacing with the kernel.
#[codec(index = 11)]
#[error("prepare: error interfacing with the kernel: {0}")]
Kernel(String),
}

Expand Down Expand Up @@ -109,74 +120,33 @@ impl PrepareError {
}
}

impl fmt::Display for PrepareError {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
use PrepareError::*;
match self {
Prevalidation(err) => write!(f, "prevalidation: {}", err),
Preparation(err) => write!(f, "preparation: {}", err),
RuntimeConstruction(err) => write!(f, "runtime construction: {}", err),
JobError(err) => write!(f, "panic: {}", err),
TimedOut => write!(f, "prepare: timeout"),
IoErr(err) => write!(f, "prepare: io error while receiving response: {}", err),
JobDied { err, job_pid } =>
write!(f, "prepare: prepare job with pid {job_pid} died: {err}"),
CreateTmpFile(err) => write!(f, "prepare: error creating tmp file: {}", err),
RenameTmpFile { err, src, dest } =>
write!(f, "prepare: error renaming tmp file ({:?} -> {:?}): {}", src, dest, err),
OutOfMemory => write!(f, "prepare: out of memory"),
ClearWorkerDir(err) => write!(f, "prepare: error clearing worker cache: {}", err),
Kernel(err) => write!(f, "prepare: error interfacing with the kernel: {}", err),
}
}
}

/// Some internal error occurred.
///
/// Should only ever be used for validation errors independent of the candidate and PVF, or for
/// errors we ruled out during pre-checking (so preparation errors are fine).
#[derive(Debug, Clone, Encode, Decode)]
#[derive(thiserror::Error, Debug, Clone, Encode, Decode)]
pub enum InternalValidationError {
/// Some communication error occurred with the host.
#[error("validation: some communication error occurred with the host: {0}")]
HostCommunication(String),
/// Host could not create a hard link to the artifact path.
#[error("validation: host could not create a hard link to the artifact path: {0}")]
CouldNotCreateLink(String),
/// Could not find or open compiled artifact file.
#[error("validation: could not find or open compiled artifact file: {0}")]
CouldNotOpenFile(String),
/// Host could not clear the worker cache after a job.
#[error("validation: host could not clear the worker cache ({path:?}) after a job: {err}")]
CouldNotClearWorkerDir {
err: String,
// Unfortunately `PathBuf` doesn't implement `Encode`/`Decode`, so we do a fallible
// conversion to `Option<String>`.
path: Option<String>,
},
/// Some error occurred when interfacing with the kernel.
#[error("validation: error interfacing with the kernel: {0}")]
Kernel(String),

/// Some non-deterministic preparation error occurred.
#[error("validation: prepare: {0}")]
NonDeterministicPrepareError(PrepareError),
}

impl fmt::Display for InternalValidationError {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
use InternalValidationError::*;
match self {
HostCommunication(err) =>
write!(f, "validation: some communication error occurred with the host: {}", err),
CouldNotCreateLink(err) => write!(
f,
"validation: host could not create a hard link to the artifact path: {}",
err
),
CouldNotOpenFile(err) =>
write!(f, "validation: could not find or open compiled artifact file: {}", err),
CouldNotClearWorkerDir { err, path } => write!(
f,
"validation: host could not clear the worker cache ({:?}) after a job: {}",
path, err
),
Kernel(err) => write!(f, "validation: error interfacing with the kernel: {}", err),
NonDeterministicPrepareError(err) => write!(f, "validation: prepare: {}", err),
}
}
}
27 changes: 15 additions & 12 deletions polkadot/node/core/pvf/src/error.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
use polkadot_node_core_pvf_common::error::{InternalValidationError, PrepareError};

/// A error raised during validation of the candidate.
#[derive(Debug, Clone)]
#[derive(thiserror::Error, Debug, Clone)]
pub enum ValidationError {
/// Deterministic preparation issue. In practice, most of the problems should be caught by
/// prechecking, so this may be a sign of internal conditions.
Expand All @@ -27,35 +27,42 @@ pub enum ValidationError {
/// pre-checking enabled only valid runtimes should ever get enacted, so we can be
/// reasonably sure that this is some local problem on the current node. However, as this
/// particular error *seems* to indicate a deterministic error, we raise a warning.
#[error(transparent)]
Preparation(PrepareError),
/// The error was raised because the candidate is invalid. Should vote against.
Invalid(InvalidCandidate),
#[error(transparent)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have specific error messages here instead of the fall-through transparent?

Copy link
Contributor Author

@maksimryndin maksimryndin Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me explain the motivation:)

  1. InternalValidationError already has validation: prefix in its messages
  2. ValidationError has 1-1 match between enum variants and wrapped sub errors
  3. As a novice to the codebase, I am not sure how much freedom I have over the error messages

So I decided to stick with the safest solution in my opinion. But of course, if you think we should expand messages on wrapper, I will do :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points! If you want though, you can change this:

  • For preparation, add a message like preparation error for validation: {0}
  • For internal error, maybe change the message to internal validation error: {0}

As a novice to the codebase, I am not sure how much freedom I have over the error messages

Appreciate the humility and respect for the codebase. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not sure if the change is necessary TBH so I'll leave it to your judgment. I think it's fine to err on the side of being a bit too verbose: error messages are rare and the more info they contain, the better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree! I've decided to use candidate validation: prefix for ValidationError. Thus it looks consistent

Invalid(#[from] InvalidCandidate),
/// Possibly transient issue that may resolve after retries. Should vote against when retries
/// fail.
PossiblyInvalid(PossiblyInvalidError),
#[error(transparent)]
PossiblyInvalid(#[from] PossiblyInvalidError),
/// Preparation or execution issue caused by an internal condition. Should not vote against.
Internal(InternalValidationError),
#[error(transparent)]
Internal(#[from] InternalValidationError),
}

/// A description of an error raised during executing a PVF and can be attributed to the combination
/// of the candidate [`polkadot_parachain_primitives::primitives::ValidationParams`] and the PVF.
#[derive(Debug, Clone)]
#[derive(thiserror::Error, Debug, Clone)]
pub enum InvalidCandidate {
/// The candidate is reported to be invalid by the execution worker. The string contains the
/// error message.
#[error("invalid: worker reported: {0}")]
WorkerReportedInvalid(String),
/// PVF execution (compilation is not included) took more time than was allotted.
#[error("invalid: hard timeout")]
HardTimeout,
}

/// Possibly transient issue that may resolve after retries.
#[derive(Debug, Clone)]
#[derive(thiserror::Error, Debug, Clone)]
pub enum PossiblyInvalidError {
/// The worker process (not the job) has died during validation of a candidate.
///
/// It's unlikely that this is caused by malicious code since workers spawn separate job
/// processes, and those job processes are sandboxed. But, it is possible. We retry in this
/// case, and if the error persists, we assume it's caused by the candidate and vote against.
#[error("possibly invalid: ambiguous worker death")]
AmbiguousWorkerDeath,
/// The job process (not the worker) has died for one of the following reasons:
///
Expand All @@ -69,22 +76,18 @@ pub enum PossiblyInvalidError {
/// (c) Some other reason, perhaps transient or perhaps caused by malicious code.
///
/// We cannot treat this as an internal error because malicious code may have caused this.
#[error("possibly invalid: ambiguous job death: {0}")]
AmbiguousJobDeath(String),
/// An unexpected error occurred in the job process and we can't be sure whether the candidate
/// is really invalid or some internal glitch occurred. Whenever we are unsure, we can never
/// treat an error as internal as we would abstain from voting. This is bad because if the
/// issue was due to the candidate, then all validators would abstain, stalling finality on the
/// chain. So we will first retry the candidate, and if the issue persists we are forced to
/// vote invalid.
#[error("possibly invalid: job error: {0}")]
JobError(String),
}

impl From<InternalValidationError> for ValidationError {
fn from(error: InternalValidationError) -> Self {
Self::Internal(error)
}
}

impl From<PrepareError> for ValidationError {
fn from(error: PrepareError) -> Self {
// Here we need to classify the errors into two errors: deterministic and non-deterministic.
Expand Down
3 changes: 2 additions & 1 deletion polkadot/node/core/pvf/src/host.rs
Original file line number Diff line number Diff line change
Expand Up @@ -486,7 +486,8 @@ async fn handle_precheck_pvf(
///
/// If the prepare job failed previously, we may retry it under certain conditions.
///
/// When preparing for execution, we use a more lenient timeout ([`LENIENT_PREPARATION_TIMEOUT`])
/// When preparing for execution, we use a more lenient timeout
/// ([`DEFAULT_LENIENT_PREPARATION_TIMEOUT`](polkadot_primitives::executor_params::DEFAULT_LENIENT_PREPARATION_TIMEOUT))
/// than when prechecking.
async fn handle_execute_pvf(
artifacts: &mut Artifacts,
Expand Down
4 changes: 2 additions & 2 deletions polkadot/node/core/pvf/src/prepare/queue.rs
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,8 @@ pub struct FromQueue {
/// Identifier of an artifact.
pub(crate) artifact_id: ArtifactId,
/// Outcome of the PVF processing. [`Ok`] indicates that compiled artifact
/// is successfully stored on disk. Otherwise, an [error](crate::error::PrepareError)
/// is supplied.
/// is successfully stored on disk. Otherwise, an
/// [error](polkadot_node_core_pvf_common::error::PrepareError) is supplied.
pub(crate) result: PrepareResult,
}

Expand Down
Loading