-
Notifications
You must be signed in to change notification settings - Fork 982
Operator Error Handling
Many types of errors can occur in a big data system such as Drill: missing files, network problems, out-of-memory, bad user-provided code, math errors, and on and on. Drill is a high-volume server so a failure within one query must not impact the server as a whole: a failed query must shut down gracefully, releasing all its resources, even if rather severe errors occurred.
Drill handles a variety of error conditions:
- Resource errors: out-of-memory, I/O exceptions and so on.
- Schema change exceptions: when code is not designed to handle schema changes.
- Invariant violations: when one part of Drill detects that some other part of Drill has malfunctioned (that is, wrong code.)
- Cascade errors: errors that result when handling a primary error. For example, if a file write runs out of disk space, a close of that file will also fail (since data cannot be flushed.)
While handling of these errors follows some general patterns, much variation exists.
Let's start the error discussion by exploring the types of error handling observed in Drill operator code.
- Checked exceptions: any Java exception that derives from
Exception
that requires a method declaration. Since the protocolnext()
method does not declare athrows
clause, checked exceptions cannot propagate outside of eachnext()
method. Checked exceptions are turned into one of the conditions discussed below. - Unchecked exceptions: any Java exception that drives from
RuntimeException
(or directly fromThrowable
) but which does not require athrows
declaration. Such exceptions can be thrown fromnext()
. -
UserException
s: Drill's primary unchecked exception thrown in response to serious (but expected) error conditions such as out-of-memory, disk I/O errors and so on. Here "user exception" seems to mean "exception to be reported to the user" instead of "exception caused by the user." Some checked exceptions are converted toUserException
and thrown up the call stack. - Killing the Fragment from within: some operators respond to error conditions by killing the fragment as explained below. It is not clear when a fragment should use this technique vs. the
UserException
approach. - Killing the Fragment from without: The Drill bit can kill the fragment thread in response to a variety of conditions. Here it is necessary to terminate a running thread as discussed below.
- Logging the error and continuing. Often occurs when the error is recoverable.
Because next()
declares no throws
clause, the operator must handle all (checked) errors within the bounds of the next()
method. This seems like a nice, clean protocol, but it raises the question: how should the operator handle such errors if they can't be thrown upward? As indicated above, there are three choices.
First, the operator often uses the internal kill protocol as explained below. Second, the operator may translate the exception into an unchecked exception such as UserException
or RuntimeException
. Third, the operator can elect to ignore the exception. From the external sort:
} catch (UnsupportedOperationException e) {
throw new RuntimeException(e);
}
...
} catch (Throwable e) {
throw UserException. ... .build(logger);
...
} catch (UnsupportedOperationException e) {
estimatedRecordSize += 50;
}
...
} catch (IOException e) {
// since this is meant to be used in a batches's spilling, we don't propagate the exception
logger.warn("Unable to mark spill directory " + currSpillPath + " for deleting on exit", e);
}
Clearly, the ignore strategy should be used sparingly. One place that the ignore strategy must be used is when errors occur when releasing resources:
try {
AutoCloseables.close(e, newGroup);
} catch (Throwable t) { /* close() may hit the same IO issue; just ignore */ }
The operator code appears to be inconsistent in how it handles unchecked exceptions. In theory, the primary protocol would be to throw the exception up the stack, allowing the fragment executor to terminate the query and clean up resources. However, it appears that code uses a variety of exceptions, and translates between them:
-
UserException
: Appears to be the primary means to communicate failures to users: it has quite a bit of functionality to log errors, format user messages and so on. -
RuntimeException
: Java's root unchecked exception which seems to be thrown as a last result when "something is wrong." -
OutOfMemoryException
: Thrown when Drill exhausts its direct memory. (This exception is separate from Java's ownOutOfMemoryError
.)