You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LONGNET, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences🤔
propose dilated attention
it has a linear computation complexity and a logarithm dependency between tokens;
it can be served as a distributed trainer for extremely long sequences;
its dilated attention is a drop-in replacement for standard attention, which can be seamlessly integrated with the existing Transformer-based optimization.
Introduction
기존 연구들
RNN-style models are primarily implemented to increase the length. However, its sequential nature limits the parallelization during training, which is essential in long-sequence modeling.
More recently, state space models [GGR22, SWL23, FDS+23, PMN+23] are appealing to sequence modeling. It can operate as a CNN during training, and transform to an efficient RNN at test time. While they perform well at long-range benchmarks [TDA+21], their performance on regular lengths is not as good as Transformers, limited mainly by the model expressivity [FPB+23].
Another strand of scaling the sequence length is to decrease the complexity of Transformers
Nevertheless, this sacrifices the ability to recall the early tokens, forgetting the prompts at the very beginning of the sequence. Sparse attention reduces the computation by sparsifying the attention matrix, preserving the possibility of recalling long-distant information.
Our solution is LONGNET, which replaces the attention of vanilla Transformers with a novel component named dilated attention.
attention allocation decreases exponentially as the distance between tokens grows.
In the implementation, LONGNET can be transformed into a dense Transformer, which seamlessly supports the off-the-shelf optimization for Transformers (e.g., kernel fusion, quantization, and distributed training).
Taking advantage of the linear complexity, LONGNET can parallelize the training across nodes, breaking the constraint of both computation and memory with a distributed algorithm.
LongNet
preliminary
figure
Dislated Attention
figure
Complexity 계산
수식
설명
LONGNET as a Distributed Trainer: Scaling up to 1B Tokens
Although the computation complexity of dilated attention has been greatly reduced to O(Nd), it is infeasible to scale the sequence length to the million level on a single GPU device due to the computation and memory constraints
세그먼트의 길이가 device에서 지원하는 length보다 짧으면 local attention 계산
세그먼트 길이가 device에서 지원하는 length보다 길면 all-gather로 가져와서 계산
마지막으로 local queries과 all-gather로 가져왔던 global key-value pairs를 위해 cross-attention 계산
Starting from 8K, we gradually scale the sequence length until the limit of GPU memory.
We reduce the batch size accordingly to keep the number of tokens per batch at 1 billion.
Each model of different sequence lengths has up to 3 segment lengths, which are 2,048, the number of tokens per device
Experiments on Language Modeling
Setup
The backbone architecture is MAGNETO [WMH+22] with XPOS [SDP+22] relative position encoding, except that we replace the standard attention with our dilated attention
pre-train the model with The Stack dataset
The data is preprocessed with the tiktoken tokenizer with cl100k_base encoding
All experiments are conducted based on the torchscale [MWH+22] codebase.
What is MAGNETO?
MS에서 2022년 10월에 발표한 멀티모달을 위한 Foundation Transformers
For LONGNET, we use segment lengths of w = {2048,4096,8192,16384,32768}, and the dilated ratios are r = {1,2,4,6,12}
All of our implementations of attention variants are based on FlashAttention for training efficiency
Scaling up Model Size
To verify whether LONGNET still follows the similar scaling law, we train a series of models with different model sizes, from 125 million(40B tokens) to 2.7 billion parameters(300B tokens)
It proves that LONGNET can still follow the power law. This implies that the dense Transformer is not a prerequisite for scaling the language models. Additionally, the scalability and the efficiency are both obtained by LONGNET.
Conclusion
The core of LONGNET is dilated attention, which reduces the computation complexity from quadratic to linear.
LONGNET can be served as a distributed trainer that parallelizes the training of a sequence across multiple GPU devices
The text was updated successfully, but these errors were encountered:
Author
Summary
Abstract
Introduction
LongNet
Complexity 계산
LONGNET as a Distributed Trainer: Scaling up to 1B Tokens
Scaling up to 1B Tokens
Experiments on Language Modeling
Setup
relative position encoding
, except that we replace the standard attention with our dilated attentionWhat is MAGNETO?
Results
Scaling up Model Size
Conclusion
The text was updated successfully, but these errors were encountered: