Skip to content
Paul Rogers edited this page Jan 23, 2017 · 15 revisions

Memory management in Drill is a complex topic. Here we describe just enough to explain a serious design flaw that impairs our ability to implement effective resource management for memory. We come at the topic from three directions:

  • Explanation of Drill's memory allocator
  • Explanation of Drill batch sizes
  • Explanation of how the above two are in direct conflict.

Memory Manager

Drill is a columnar system, with data for each column stored in a value vector. Each vector is backed by a direct memory allocation. While Java provides robust management of heap memory, each application must invent its own manager for direct memory. Drill's manager is based on that provided by Netty, since Drill uses Netty to send (and thus release) and receive (and thus allocate) value vectors.

Memory management is a very complex topic; this discussion does not attempt to repeat that story. Rather, we focus on a number of critical aspects.

Java provides the Unsafe class that, via devious means, and application can use to allocate blocks of memory directly from the OS. Java provides no management at all: the application is responsible for freeing the memory when no longer needed (unlike heap memory, which is automatically garbage collected.)

This writer does not know how Java handles memory management below the Unsafe level. Is memory obtained from a call to the C malloc routine? Does Java provide its own memory manager above malloc? Presumably, the Unsafe memory manager is inefficient because Netty layers its own memory manager on top of Unsafe.

The Netty manager is implemented in the io.netty.buffer.PooledByteBufAllocator class. This class, in Drill, provides a per-thread, lock-free memory allocator. Basically, each minor fragment has its own pooled allocator instance. The "pooled" in the name is the key: the allocator maintains a free list of direct memory "chunks". In Drill, each chunk is 16 MB in size.

Allocation proceeds as follows:

  • Round the allocation up to the next power of 2. (A 5K request, say, is rounded up to 8K.)
  • Scan the pool's free list for an available chunk.
  • If a free chunk is available, and has sufficient capacity, carve off the desired allocation.
  • If the request is 16MB in size, the request uses the entire chunk. If smaller, the request slices up a chunk to provide the allocation.
  • If no free chunks exist or the request is larger than 16MB, request the memory directly from Unsafe.

Freeing proceeds as follows:

  • If the request is 16MB or larger, release memory back to the pool in 16 MB chunks. (Need to verify this...)
  • If the request is smaller than 16 MB, find the existing chunk in the memory pool and mark the used portion as now free.

Notice the asymmetry: any size request can be made. But, releasing memory carves it up into 16 MB chunks. Large requests come from Unsafe, but releases of those blocks go into the pooled allocator as 16 MB chunks.

Record Batch Management

Drill works with records that typically contain multiple fields. Upon reading, Drill "rotates" the data from row to columnar format. Thus, a record with, say, 10 fields becomes a group of 10 value vectors. Vectors only make sense when used to hold a group of records. In Drill, that group (or, more specifically, the vectors that make up the group) are called a "record batch." (Not really, a "record batch" is the implementation of an operator that works on the record batch, but we shall ignore that confusion here and simply use the term to refer to the "bundle" of vectors.)

The above description immediately raises the question: how many records should appear in each batch? Should the number be fixed, or should it depend on the size of each record (or column)? The answer in Drill is, "it depends."

It seems Drill was originally designed so that all record batches hold 64K records. This number is an obvious choice: it is the maximum number addressable by a two-byte selection vector (a so-called "SV2.") This choice works well for records that consist of small numbers of numeric columns. For example, a record of 10 floats, needs only:

10 columns * 8 bytes per double * 64K rows = 5 MB

This is quite a small allocation for a big data system.

Later, it was realized (it seems) that not all rows are (or can be restricted to) such a favorable layout. Consider analytics over documents. Each row may be 10s of K in size. Document databases, which favor denormalized data, are often of this size. Now the math becomes:

1 column * 50K characters * 64K rows = 3.3 GB

It was realized that this size is just a bit too big for comfort. But, the 64K row goal is still desirable. So, various operators tried for compromises: choose 64K rows, or as much data as will fit into some target record batch size. This gave rise to two additional issues.

First, Drill provides no metadata information in order to predict (or track) column widths. Thus, Drill is blind to the size of record batches during execution. Apparently some operators found work-arounds (need to research.) For example, the "managed" external sort measures the size of the entire record batch, divides this by the record count and obtains a crude estimate of row width. (But, not information about individual columns.)

Second, the target batch size is a per-operator decision. The Flatten operator chose 512 MB. The text scanner appears to have chosen 128 MB. Other operators don't enforce size. The result is that there is no standard, no guideline for how large record batches can be (or even if size is a concern.)

Vector Serialization

Some operators serialize value vectors. Indeed, serialization is one of the key benefits of the vector format.

When creating a vector, Drill may (need to verify) rely on Netty's concept of a "composite" ByteBuf: a logical buffer that is backed by a collection of separate physical buffers. Or, Drill may allocate a contiguous buffer.

When serializing a vector, Drill (actually Netty) writes the bytes out as one continuous stream of bytes. Drill prepends type and length information.

When deserializing, Drill:

  • Obtains type information
  • Obtains the data length
  • Allocates a (single contiguous) buffer to hold the data
  • Reads data into the newly created buffer
  • Wraps the data in a value vector depending on type type information

Notice that deserialization is always contiguous, this will be important in the next section.

Memory Fragmentation of Death

We can now combine the above two descriptions to explain the fatal flaw in Drill's memory management design.

Consider a summary of the facts presented above:

  • Allocations of any size are permitted: those above 16 MB come directly from Unsafe.
  • Release of memory always goes into the pooled allocator with large allocations sliced into 16 MB chunks.
  • Drill targets 64K record batches, and has no effective knowledge of column width.
  • Deserialization reads an entire value vector into one contiguous buffer.

Now, consider what happens when input records have large columns, say 8K in size. The scanner attempts to create input batches of size 128 MB. Since our columns are 1K in size, this means 16K rows. (Or, the target record count is 16K and batch size works out to 128MB. Need to research to determine which is the case.)

The 128MB is a single column. This batch is fed into the external sort. If the file is too large for memory (say a 18 GB file sorted in 3 GB of memory), the sort will spill. For efficiency, let's say we want large runs, of maybe 32 MB in size. During the run we continually receive batches, spill them, and free the underlying vectors. All these vectors go into Netty's pooled memory as 16 MB free chunks.

When reading each batch back from the spill file, we invoke the Drill deserializer. As explained above, Drill creates a single vector for each spilled vector. In this case, each vector is 32 MB in size.

On each such allocation, Drill asks for an allocation. Since the request is above 16 MB, the request goes directly to Unsafe.

Now, here is the issue. We set the external sort's memory at 3GB, which is most of the memory given to the Drillbit. The fragment's own pooled allocator already owns the 3GB. When we request Unsafe to give us more (for the 32 MB allocation), there is no more to give.

The result is that the allocation fails with an out-of-memory (OOM) error. And, does so despite the fact that a vast pool of memory exists in the pooled allocator -- but divided up into 16 MB chunks unusable to satisfy a 32 MB allocation.

Restating the Problem

The above should give you a good understanding of how we discovered the issues. At this point, it may be helpful to restate the problem clearly.

  • The original External Sort assumed all Varchar columns are 50 characters or less, even when they are actually larger. The original sort will run out of memory at random times with increasing probability as the width of columns increases above 50 bytes simply because the original sort does not know how bit the data actually is and so cannot properly plan spilling.
  • We revised the sort (to create the "managed" version). This version does plan spilling based on actual data size. This solves the above issue. As a result, users can now sort data with much larger widths, say up to 8K per column.
  • In most parts of Drill, record batches are sized by rows, not bytes. With the sort able to handle long columns, we now ask other parts of Drill to also handle long columns. Since batches are sized by rows, we end up allocating very large vectors to hold the fixed number of 8K columns.
  • The Drill (really Netty) memory allocator tracks memory in 16 MB chunks. With wide columns, we end up with buffers of very large sizes. That results in direct allocations from the system, bypassing the Netty free list.
  • With many large allocations of, essentially, random size, direct memory rapidly becomes fragmented.
  • The result is that, at some point, an allocation asks for a buffer greater than 16 MB, cannot be served by the ample 16 MB chunks on the free list, but the underlying direct memory has become so fragmented the it can't serve the request either. The query fails with OOM.
  • This issue was discovered as part of the Resource Management project. Here, we wish to maximize usage of memory: assigning each operator a known-size pool in which to operate, and adjust each operator to work within that pool. The more tightly we try to manage memory, the less "slop" exists to absorb memory losses due to fragmentation. So, the more we try to manage memory, the more likely we will encounter OOM errors in production.

The problem will not occur if columns are 256 bytes or less. (In this case 64K rows * 256 bytes = 16 MB buffer.) There is no magic number at which OOM errors are guaranteed to occur. As column widths increase (and as RM attempts to better manage memory), the more likely it becomes that fragmentation will trigger OOM.

Solutions

The above is, as was said, a fundamental flaw in Drill's memory management algorithm. It simply does not work to allow Drill to allocate any size of buffer, but only keep a free list of 16 MB chunks. Something has to change.

Better Free List Management?

One could dispense with the free list. However, direct memory has no garbage collector. If it did, then the GC might be able to shift data around in memory to coalesce free blocks into one large contiguous block. Since no GC is available, Drill (or Netty) must manage memory. Since a global malloc is expensive (due to locking), a local allocator is required.

However, the local allocator does not request all its memory at once; it does so in 16 MB chunks from the global allocator. As a result, the free chunks in the pool cannot be coalesced to form a large chunk when needed. Instead, a new, large chunk must be requested from the underlying system allocator.

Better Drill "Page Buffer" Management?

On the other hand, we could restrict Drill to allocate only blocks of 16 MB. Relational databases have long done this: each defines some fixed page size. All data fits into page buffers (perhaps spanning pages when needed.) A page buffer pool manager manages free, in-use and dirty pages, writing data to and from disk as needed.

Drill could use a similar idea: use fixed-size pages to back vectors. However, doing so requires an extensive rewrite of Drill's memory allocator, and requires changes to how Netty allocates and frees memory (since Netty is the source and/or sink of many Drill vectors.)

The problem also is that Drill would have to be aware of column widths and know how to limit vector sizes based on such widths. The required meta-data does not, however, exist in Drill at present.

Use Java Heap

Yet another solution is to rely on the Java heap, and Java garbage collector, for memory management. In the abstract, this is a bad idea as Drill's frequent, large allocations will lead to excessive GC events. One must ask, however, if the costs of avoiding GC outweigh the benefits of doing so. This is not an easy question to answer.

Force Flush of Free List

It may be possible to force a flush of all blocks on the free list back to the global pool. However, unlike Java GC, ,the global pool has no way to move allocated blocks in order to gather and coalesce free blocks. It is thus not clear whether releasing a collection of 16 MB chunks back to the system will result in the ability to allocate larger chunks.

Increase the Chunk Size

If Drill allocates blocks larger than the 16 MB blocks which Netty tracks, perhaps Drill should increase the chunk size to the largest that Drill will allocate. The problem is that the size is, essentially, unbounded. Drill works with record counts, not with column widths. We saw in an example above that columns with 10K of text would require allocations of over 3GB. At that level, memory fragmentation, even in a large direct memory pool, will become a severe problem.

Practical Solutions

Of the above, only two appear practical:

  1. Switch to use heap memory.
  2. Design a buffer pool system for Drill.
Clone this wiki locally