Skip to content
Paul Rogers edited this page Jan 23, 2017 · 15 revisions

Memory management in Drill is a complex topic. Here we describe just enough to explain a serious design flaw that impairs our ability to implement effective resource management for memory. We come at the topic from three directions:

  • Explanation of Drill's memory allocator
  • Explanation of Drill batch sizes
  • Explanation of how the above two are in direct conflict.

Memory Manager

Drill is a columnar system, with data for each column stored in a value vector. Each vector is backed by a direct memory allocation. While Java provides robust management of heap memory, each application must invent its own manager for direct memory. Drill's manager is based on that provided by Netty, since Drill uses Netty to send (and thus release) and receive (and thus allocate) value vectors.

Memory management is a very complex topic; this discussion does not attempt to repeat that story. Rather, we focus on a number of critical aspects.

Java provides the Unsafe class that, via devious means, and application can use to allocate blocks of memory directly from the OS. Java provides no management at all: the application is responsible for freeing the memory when no longer needed (unlike heap memory, which is automatically garbage collected.)

This writer does not know how Java handles memory management below the Unsafe level. Is memory obtained from a call to the C malloc routine? Does Java provide its own memory manager above malloc? Presumably, the Unsafe memory manager is inefficient because Netty layers its own memory manager on top of Unsafe.

The Netty manager is implemented in the io.netty.buffer.PooledByteBufAllocator class. This class, in Drill, provides a per-thread, lock-free memory allocator. Basically, each minor fragment has its own pooled allocator instance. The "pooled" in the name is the key: the allocator maintains a free list of direct memory "chunks". In Drill, each chunk is 16 MB in size.

Allocation proceeds as follows:

  • Round the allocation up to the next power of 2. (A 5K request, say, is rounded up to 8K.)
  • Scan the pool's free list for an available chunk.
  • If a free chunk is available, and has sufficient capacity, carve off the desired allocation.
  • If the request is 16MB in size, the request uses the entire chunk. If smaller, the request slices up a chunk to provide the allocation.
  • If no free chunks exist or the request is larger than 16MB, request the memory directly from Unsafe.

Freeing proceeds as follows:

  • If the request is 16MB or larger, release memory back to the pool in 16 MB chunks.
  • If the request is smaller than 16 MB, find the existing chunk in the memory pool and mark the used portion as now free.

Notice the asymmetry: any size request can be made. But, releasing memory carves it up into 16 MB chunks. Large requests come from Unsafe, but releases of those blocks go into the pooled allocator as 16 MB chunks.

Record Batch Management

Drill works with records that typically contain multiple fields. Upon reading, Drill "rotates" the data from row to columnar format. Thus, a record with, say, 10 fields becomes a group of 10 value vectors. Vectors only make sense when used to hold a group of records. In Drill, that group (or, more specifically, the vectors that make up the group) are called a "record batch." (Not really, a "record batch" is the implementation of an operator that works on the record batch, but we shall ignore that confusion here and simply use the term to refer to the "bundle" of vectors.)

The above description immediately raises the question: how many records should appear in each batch? Should the number be fixed, or should it depend on the size of each record (or column)? The answer in Drill is, "it depends."

It seems Drill was originally designed so that all record batches hold 64K records. This number is an obvious choice: it is the maximum number addressable by a two-byte selection vector (a so-called "SV2.") This choice works well for records that consist of small numbers of numeric columns. For example, a record of 10 floats, needs only:

10 columns * 8 bytes per double * 64K rows = 5 MB

This is quite a small allocation for a big data system.

Later, it was realized (it seems) that now all rows are (or can be restricted to) such a favorable layout. Consider analytics over documents. Each row may be 10s of K in size. Document databases, which favor denormalized data, are often of this size. Now the math becomes:

1 column * 50K characters * 64K rows = 3.3 GB

It was realized that this size is just a bit too big for comfort. But, the 64K row goal is still desirable. So, various operators tried for compromises: choose 64K rows, or as much data as will fit into some target record batch size. This gave rise to two additional issues.

First, Drill provides no metadata information in order to predict (or track) column widths. Thus, Drill is blind to the size of record batches during execution. Apparently some operators found work-arounds (need to research.) For example, the "managed" external sort measures the size of the entire record batch, divides this by the record count and obtains a crude estimate of row width. (But, not information about individual columns.)

Second, the target batch size is a per-operator decision. The Flatten operator chose 512 MB. The text scanner appears to have chosen 128 MB. Other operators don't enforce size. The result is that there is no standard, no guideline for how large record batches can be (or even if size is a concern.)

Vector Serialization

Some operators serialize value vectors. Indeed, serialization is one of the key benefits of the vector format.

When creating a vector, Drill may (need to verify) rely on Netty's concept of a "composite" ByteBuf: a logical buffer that is backed by a collection of separate physical buffers. Or, Drill may allocate a contiguous buffer.

When serializing a vector, Drill (actually Netty) writes the bytes out as one continuous stream of bytes. Drill prepends type and length information.

When deserializing, Drill:

  • Obtains type information
  • Obtains the data length
  • Allocates a (single contiguous) buffer to hold the data
  • Reads data into the newly created buffer
  • Wraps the data in a value vector depending on type type information

Notice that deserialization is always contiguous, this will be important in the next section.

Memory Fragmentation of Death

We can now combine the above two descriptions to explain the fatal flaw in Drill's memory management design.

Consider a summary of the facts presented above:

  • Allocations of any size are permitted: those above 16 MB come directly from Unsafe.
  • Release of memory always goes into the pooled allocator with large allocations sliced into 16 MB chunks.
  • Drill targets 64K record batches, and has no effective knowledge of column width.
  • Deserialization reads an entire value vector into one contiguous buffer.

Now, consider what happens when input records have large columns, say 1K in size.

Clone this wiki locally