Memory Management

Memory management in Drill is a complex topic. Here we describe just enough to explain a serious design flaw that impairs our ability to implement effective resource management for memory. We come at the topic from three directions:

Explanation of Drill's memory allocator
Explanation of Drill batch sizes
Explanation of how the above two are in direct conflict.

Memory Manager

Drill is a columnar system, with data for each column stored in a value vector. Each vector is backed by a direct memory allocation. While Java provides robust management of heap memory, each application must invent its own manager for direct memory. Drill's manager is based on that provided by Netty, since Drill uses Netty to send (and thus release) and receive (and thus allocate) value vectors.

Memory management is a very complex topic; this discussion does not attempt to repeat that story. Rather, we focus on a number of critical aspects.

Java provides the Unsafe class that, via devious means, and application can use to allocate blocks of memory directly from the OS. Java provides no management at all: the application is responsible for freeing the memory when no longer needed (unlike heap memory, which is automatically garbage collected.)

This writer does not know how Java handles memory management below the Unsafe level. Is memory obtained from a call to the C malloc routine? Does Java provide its own memory manager above malloc? Presumably, the Unsafe memory manager is inefficient because Netty layers its own memory manager on top of Unsafe.

The Netty manager is implemented in the io.netty.buffer.PooledByteBufAllocator class. This class, in Drill, provides a per-thread, lock-free memory allocator. Basically, each minor fragment has its own pooled allocator instance. The "pooled" in the name is the key: the allocator maintains a free list of direct memory "chunks". In Drill, each chunk is 16 MB in size.

Allocation proceeds as follows:

Round the allocation up to the next power of 2. (A 5K request, say, is rounded up to 8K.)
Scan the pool's free list for an available chunk.
If a free chunk is available, and has sufficient capacity, carve off the desired allocation.
If the request is 16MB in size, the request uses the entire chunk. If smaller, the request slices up a chunk to provide the allocation.
If no free chunks exist or the request is larger than 16MB, request the memory directly from Unsafe.

Freeing proceeds as follows:

If the request is 16MB or larger, release memory back to the pool in 16 MB chunks. (Need to verify this...)
If the request is smaller than 16 MB, find the existing chunk in the memory pool and mark the used portion as now free.

Notice the asymmetry: any size request can be made. But, releasing memory carves it up into 16 MB chunks. Large requests come from Unsafe, but releases of those blocks go into the pooled allocator as 16 MB chunks.

Record Batch Management

Drill works with records that typically contain multiple fields. Upon reading, Drill "rotates" the data from row to columnar format. Thus, a record with, say, 10 fields becomes a group of 10 value vectors. Vectors only make sense when used to hold a group of records. In Drill, that group (or, more specifically, the vectors that make up the group) are called a "record batch." (Not really, a "record batch" is the implementation of an operator that works on the record batch, but we shall ignore that confusion here and simply use the term to refer to the "bundle" of vectors.)

The above description immediately raises the question: how many records should appear in each batch? Should the number be fixed, or should it depend on the size of each record (or column)? The answer in Drill is, "it depends."

It seems Drill was originally designed so that all record batches hold 64K records. This number is an obvious choice: it is the maximum number addressable by a two-byte selection vector (a so-called "SV2.") This choice works well for records that consist of small numbers of numeric columns. For example, a record of 10 floats, needs only:

10 columns * 8 bytes per double * 64K rows = 5 MB

This is quite a small allocation for a big data system.

Later, it was realized (it seems) that not all rows are (or can be restricted to) such a favorable layout. Consider analytics over documents. Each row may be 10s of K in size. Document databases, which favor denormalized data, are often of this size. Now the math becomes:

1 column * 50K characters * 64K rows = 3.3 GB

It was realized that this size is just a bit too big for comfort. But, the 64K row goal is still desirable. So, various operators tried for compromises: choose 64K rows, or as much data as will fit into some target record batch size. This gave rise to two additional issues.

First, Drill provides no metadata information in order to predict (or track) column widths. Thus, Drill is blind to the size of record batches during execution. Apparently some operators found work-arounds (need to research.) For example, the "managed" external sort measures the size of the entire record batch, divides this by the record count and obtains a crude estimate of row width. (But, not information about individual columns.)

Second, the target batch size is a per-operator decision. The Flatten operator chose 512 MB. The text scanner appears to have chosen 128 MB. Other operators don't enforce size. The result is that there is no standard, no guideline for how large record batches can be (or even if size is a concern.)

Vector Serialization

Some operators serialize value vectors. Indeed, serialization is one of the key benefits of the vector format.

When creating a vector, Drill may (need to verify) rely on Netty's concept of a "composite" ByteBuf: a logical buffer that is backed by a collection of separate physical buffers. Or, Drill may allocate a contiguous buffer.

When serializing a vector, Drill (actually Netty) writes the bytes out as one continuous stream of bytes. Drill prepends type and length information.

When deserializing, Drill:

Obtains type information
Obtains the data length
Allocates a (single contiguous) buffer to hold the data
Reads data into the newly created buffer
Wraps the data in a value vector depending on type type information

Notice that deserialization is always contiguous, this will be important in the next section.

Memory Fragmentation of Death

We can now combine the above two descriptions to explain the fatal flaw in Drill's memory management design.

Consider a summary of the facts presented above:

Allocations of any size are permitted: those above 16 MB come directly from Unsafe.
Release of memory always goes into the pooled allocator with large allocations sliced into 16 MB chunks.
Drill targets 64K record batches, and has no effective knowledge of column width.
Deserialization reads an entire value vector into one contiguous buffer.

Now, consider what happens when input records have large columns, say 8K in size. The scanner attempts to create input batches of size 128 MB. Since our columns are 1K in size, this means 16K rows. (Or, the target record count is 16K and batch size works out to 128MB. Need to research to determine which is the case.)

The 128MB is a single column. This batch is fed into the external sort. If the file is too large for memory (say a 18 GB file sorted in 3 GB of memory), the sort will spill. For efficiency, let's say we want large runs, of maybe 32 MB in size. During the run we continually receive batches, spill them, and free the underlying vectors. All these vectors go into Netty's pooled memory as 16 MB free chunks.

When reading each batch back from the spill file, we invoke the Drill deserializer. As explained above, Drill creates a single vector for each spilled vector. In this case, each vector is 32 MB in size.

On each such allocation, Drill asks for an allocation. Since the request is above 16 MB, the request goes directly to Unsafe.

Now, here is the issue. We set the external sort's memory at 3GB, which is most of the memory given to the Drillbit. The fragment's own pooled allocator already owns the 3GB. When we request Unsafe to give us more (for the 32 MB allocation), there is no more to give.

The result is that the allocation fails with an out-of-memory (OOM) error. And, does so despite the fact that a vast pool of memory exists in the pooled allocator -- but divided up into 16 MB chunks unusable to satisfy a 32 MB allocation.

Solutions

The above is, as was said, a fundamental flaw in Drill's memory management algorithm. It simply does not work to allow Drill to allocate any size of buffer, but only keep a free list of 16 MB chunks. Something has to change.

Better Free List Management?

One could dispense with the free list. However, direct memory has no garbage collector. If it did, then the GC might be able to shift data around in memory to coalesce free blocks into one large contiguous block. Since no GC is available, Drill (or Netty) must manage memory. Since a global malloc is expensive (due to locking), a local allocator is required.

However, the local allocator does not request all its memory at once; it does so in 16 MB chunks from the global allocator. As a result, the free chunks in the pool cannot be coalesced to form a large chunk when needed. Instead, a new, large chunk must be requested from the underlying system allocator.

Better Drill "Page Buffer" Management?

On the other hand, we could restrict Drill to allocate only blocks of 16 MB. Relational databases have long done this: each defines some fixed page size. All data fits into page buffers (perhaps spanning pages when needed.) A page buffer pool manager manages free, in-use and dirty pages, writing data to and from disk as needed.

Drill could use a similar idea: use fixed-size pages to back vectors. However, doing so requires an extensive rewrite of Drill's memory allocator, and requires changes to how Netty allocates and frees memory (since Netty is the source and/or sink of many Drill vectors.)

The problem also is that Drill would have to be aware of column widths and know how to limit vector sizes based on such widths. The required meta-data does not, however, exist in Drill at present.

Use Java Heap

Yet another solution is to rely on the Java heap, and Java garbage collector, for memory management. In the abstract, this is a bad idea as Drill's frequent, large allocations will lead to excessive GC events. One must ask, however, if the costs of avoiding GC outweigh the benefits of doing so. This is not an easy question to answer.

Force Flush of Free List

It may be possible to force a flush of all blocks on the free list back to the global pool. However, unlike Java GC, ,the global pool has no way to move allocated blocks in order to gather and coalesce free blocks. It is thus not clear whether releasing a collection of 16 MB chunks back to the system will result in the ability to allocate larger chunks.

Increase the Chunk Size

If Drill allocates blocks larger than the 16 MB blocks which Netty tracks, perhaps Drill should increase the chunk size to the largest that Drill will allocate. The problem is that the size is, essentially, unbounded. Drill works with record counts, not with column widths. We saw in an example above that columns with 10K of text would require allocations of over 3GB. At that level, memory fragmentation, even in a large direct memory pool, will become a severe problem.

Practical Solutions

Of the above, only two appear practical:

Switch to use heap memory.
Design a buffer pool system for Drill.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly