Skip to content

Operator Error Handling

Paul Rogers edited this page Dec 1, 2016 · 3 revisions

Operator Error Handling

Many types of errors can occur in a big data system such as Drill: missing files, network problems, out-of-memory, bad user-provided code, math errors, and on and on. Drill is a high-volume server so a failure within one query must not impact the server as a whole: a failed query must shut down gracefully, releasing all its resources, even if rather severe errors occurred.

Types of Errors

Drill handles a variety of error conditions:

  • Resource errors: out-of-memory, I/O exceptions and so on.
  • Schema change exceptions: when code is not designed to handle schema changes.
  • Invariant violations: when one part of Drill detects that some other part of Drill has malfunctioned (that is, wrong code.)
  • Cascade errors: errors that result when handling a primary error. For example, if a file write runs out of disk space, a close of that file will also fail (since data cannot be flushed.)

While handling of these errors follows some general patterns, much variation exists.

Four Forms of Error Handling

Let's start the error discussion by exploring the types of error handling observed in Drill operator code.

  • Checked exceptions: any Java exception that derives from Exception that requires a method declaration. Since the protocol next() method does not declare a throws clause, checked exceptions cannot propagate outside of each next() method. Checked exceptions are turned into one of the conditions discussed below.
  • Unchecked exceptions: any Java exception that drives from RuntimeException (or directly from Throwable) but which does not require a throws declaration. Such exceptions can be thrown from next().
  • UserExceptions: Drill's primary unchecked exception thrown in response to serious (but expected) error conditions such as out-of-memory, disk I/O errors and so on. Here "user exception" seems to mean "exception to be reported to the user" instead of "exception caused by the user." Some checked exceptions are converted to UserException and thrown up the call stack.
  • Killing the Fragment from within: some operators respond to error conditions by killing the fragment as explained below. It is not clear when a fragment should use this technique vs. the UserException approach.
  • Killing the Fragment from without: The Drill bit can kill the fragment thread in response to a variety of conditions. Here it is necessary to terminate a running thread as discussed below.
  • Logging the error and continuing. Often occurs when the error is recoverable.

Checked Error Handling

Because next() declares no throws clause, the operator must handle all (checked) errors within the bounds of the next() method. This seems like a nice, clean protocol, but it raises the question: how should the operator handle such errors if they can't be thrown upward? As indicated above, there are three choices.

First, the operator often uses the internal kill protocol as explained below. Second, the operator may translate the exception into an unchecked exception such as UserException or RuntimeException. Third, the operator can elect to ignore the exception. From the external sort:

    } catch (UnsupportedOperationException e) {
      throw new RuntimeException(e);
    }
...
    } catch (Throwable e) {
      throw UserException. ... .build(logger);
...
          } catch (UnsupportedOperationException e) {
            estimatedRecordSize += 50;
          }
...
    } catch (IOException e) {
        // since this is meant to be used in a batches's spilling, we don't propagate the exception
        logger.warn("Unable to mark spill directory " + currSpillPath + " for deleting on exit", e);
    }

Clearly, the ignore strategy should be used sparingly. One place that the ignore strategy must be used is when errors occur when releasing resources:

      try {
        AutoCloseables.close(e, newGroup);
      } catch (Throwable t) { /* close() may hit the same IO issue; just ignore */ }

Unchecked Error Handling

The operator code appears to be inconsistent in how it handles unchecked exceptions. In theory, the primary protocol would be to throw the exception up the stack, allowing the fragment executor to terminate the query and clean up resources. However, it appears that code uses a variety of exceptions, and translates between them:

  • UserException: Appears to be the primary means to communicate failures to users: it has quite a bit of functionality to log errors, format user messages and so on.
  • RuntimeException: Java's root unchecked exception which seems to be thrown as a last result when "something is wrong."
  • OutOfMemoryException: Thrown when Drill exhausts its direct memory. (This exception is separate from Java's own OutOfMemoryError.)

Clean-up

Clone this wiki locally