REFORMER: THE EFFICIENT TRANSFORMER

Summary

Model Name	Model Type (Encoder-Decoder, etc.)	Pre-train Objective	Tokenization	Vocab Size	OOV Handling	Embeddings	Attention	Activations	Parameters	Training	Pre-Train Data	Batch Size

TL;DR

The basic premise is that the authors are exploring ways of improving the efficiency of Transformers. By replacing the usual dot-product (or scaled-dot-product) attention mechanism that is O(T^2) in complexity, they suggest a locality-sensitive hashing that reduces complexity to O(T*log(T)). This, along with some changes in residuals, is the basis of the Reformer model.

Latest class of Transformer models are ~ 0.5B per layer, up to 64 layers, up to 11k tokens of text in a single example (i.e., really long text), as well as multi-modal models. These models can realistically only be trained in industrial labs, cannot be fine-tuned on a single GPU.

Memory Usage:

The authors observes that, at face value, you might expect the following math for memory usage:

0.5B of floats per layer -> 2GB of memory.
0.5B for embeddings per batch.
17GB of text.

With this math, a Transformer layer should fit onto a modern GPU/TPU. However, this doesn't account for:

Activations: storing O(N) activations for back-prop, especially in the FFN layers.
Dense Attention: Attention layers are O(L^2) in runtime and memory complexity.

This makes the memory explode.

Their model, the Refomer implements the following ideas:

Reversible layers: Following a paper by Uber that introduces the idea of reversible layers, they reduce O(N) -> O(1) space complexity by storing 1 set of activations and performing backprop, simply by storing the activations and their derivatives for the top layer in the sequence.
Splitting activations inside FFN.
Replace Dot-Product Attention Mechanism (Biggest Contribution): Replace the usual O(L^2) woth a "locality-sensitive hashing" that is O(L*log(L)) space complexity.

Assuming Q,K,V are (batch_size, max_seq_lenth, dim_model) The Q*K_transpose term is (batch_size,length,length) can be huge in the case of very long sequences.

How Locality-Sensitive Hashing (LSH) Works: (Btw, fantastic explanation of this in the paper)

The basic idea presented is this: Q*K_transpose is large, but we are really interested in softmax(Q*K_transpose) which converts the smallest products q_i*k_j-->0 and the largest products q_i*k_j-->1. Thus, finding the largest q_i*k_j is the task, which is a nearest neighbors problem. The math details are in the paper, but this done with a hashing function where nearby vectors get the same hash with a high probability. There are issues with the un-evenness of hashes, etc that are discussed in more detail in the paper, including creating more uniformity via multiple sequential hashes.

Art

Figure 2: LSH Attention Schematic

(from original paper)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reformer.md

reformer.md

REFORMER: THE EFFICIENT TRANSFORMER

Summary

TL;DR

Art

Figure 2: LSH Attention Schematic

Files

reformer.md

Latest commit

History

reformer.md

File metadata and controls

REFORMER: THE EFFICIENT TRANSFORMER

Summary

TL;DR

Art

Figure 2: LSH Attention Schematic