You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) models defy this and instead select different parameters for each incoming example
The result is a sparsely-activated model—with an outrageous number of parameters—but a constant computational cost
일부만 activated 되니까 sparse 모델인거고, 파라미터는 크지만 일부만 쓰기 때문에 계산 비용이 고정적이다
이슈들
widespread adoption has been hindered by complexity, communication costs, and training instability.
컨트리뷰션
design models based off T5-Base and T5-Large (Raffel et al., 2019) to obtain up to 7x increases in pre-training speed with the same computational resources.
Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus”, and achieve a 4x speedup over the T5-XXL model.
T 단위의 스케일로 학습가능
모델링은 T5로 테스트
Introduction
a sparsely-activated expert model: the Switch Transformer. In our case the sparsity comes from activating a subset of the neural network weights for each incoming example.
efficient sparse algorithm을 위해 MoE 패러다임 도입
To have an efficient sparse algorithm, we start with the Mixture-of-Expert (MoE) paradigm (Jacobs et al., 1991; Jordan and Jacobs, 1994; Shazeer et al., 2017), and simplify it to yield training stability and computational benefits
Our contributions are the following:
The Switch Transformer architecture, which simplifies and improves over Mixture of Experts.
Scaling properties and a benchmark against the strongly tuned T5 model (Raffel et al., 2019) where we measure 7x+ pre-training speedups while still using the same FLOPS per token. We further show the improvements hold even with limited computational resources, using as few as two experts.
Successful distillation of sparse pre-trained and specialized fine-tuned models into small dense models. We reduce the model size by up to 99% while preserving30% of the quality gains of the large sparse teacher.
Improved pre-training and fine-tuning techniques: (1) selective precision training that enables training with lower bfloat16 precision (2) an initialization scheme that allows for scaling to a larger number of experts and (3) increased expert regularization that improves sparse model fine-tuning and multi-task training.
A measurement of the pre-training benefits on multilingual data where we find a universal improvement across all 101 languages and with 91% of languages benefiting from 4x+ speedups over the mT5 baseline (Xue et al., 2020).
An increase in the scale of neural language models achieved by efficiently combining data, model, and expert-parallelism to create models with up to a trillion parameters. These models improve the pre-training speed of a strongly tuned T5-XXL baseline by 4x.
Switch Transformer
The guiding design principle for Switch Transformers is to maximize the parameter count of a Transformer model (Vaswani et al., 2017) in a simple and computationally efficient way.
investigate a fourth axis:
increase the parameter count while keeping the floating point operations (FLOPs) per example constant.
2.1 Simplifying Sparse Routing
Mixture of Expert Routing
Mixture- of-Experts (MoE) layer which takes as an input a token representation x and then routes this to the best determined top-k experts
Switch Routing: Rethinking Mixture-of-Experts.
routing to k > 1 experts was necessary in order to have non-trivial gradients to the routing functions. The authors intuited that learning to route would not work without the ability to compare at least two experts
당연히 후보가 2개 이상되야 routing 하는거 배우겠지 뭐;;
Contrary to these ideas, we instead use a simplified strategy where we route to only a single expert. We show this simplification preserves model quality, reduces routing computation and performs better. This k = 1 routing strategy is later referred to as a Switch layer
1개의 expert만 하는게 모델 퀄리티나 라우팅 계산을 줄이고 퍼포먼스도 좋아지게하는 simplified 전략이고 Switch Layer라는 라우팅 전략으로 불리운다는데 뭘까
The benefits for the Switch layer are three-fold:
(1) The router computation is reduced as we are only routing a token to a single expert.
(2) The batch size (expert capacity) of each expert can be at least halved since each token is only being routed to a single expert.
(3) The routing implementation is simplified and communication costs are reduced.
All of our tensor shapes are statically deter- mined at compilation time, but our computation is dynamic due to the routing decisions at training and inference. Because of this, one important technical consideration is how to set the expert capacity.
A capacity factor greater than 1.0 creates additional buffer to accommodate for when to- kens are not perfectly balanced across experts.
If too many tokens are routed to an expert (referred to later as dropped tokens), computation is skipped and the token representation is passed directly to the next layer through the residual connection.
너무 한쪽에 몰리면 토큰드랍이 발생하고 residual connection으로 그냥 다음 layer로 넘어가게됨
신기한 컨셉이네
Increasing the expert capacity is not without drawbacks, however, since high values will result in wasted computation and memory.
적당히 올리는게 중요
Empirically we find ensuring lower rates of dropped tokens are important for the scaling of sparse expert-models.
그래도 실험결과로는 lower rates of drooped tokens가 스케일업하기에 매우 중요
Throughout our experiments we didn’t notice any dependency on the number of experts for the number of tokens dropped (typically < 1%)
실험상으론 Expert 개수와 토큰 드랍간의 디펜던시는 발견하진 못했음
Using the auxiliary load balancing loss (next section) with a high enough coefficient ensured good load balancing.
auxiliary load balancing loss를 쓰면 조금 더 도움이 된다고함
이게 뭘까? 신기하네
A Differentiable Load Balancing Loss
To encourage a balanced load across experts we add an auxiliary loss
Switch Transformers simplifies the original design in Shazeer et al. (2017) which had separate load-balancing and importance-weighting losses.
For each Switch layer, this auxiliary loss is added to the total model loss during training.
균등한 라우팅을 위한 (token drop이 안되도록) auxiliary loss를 넣음
이게 웃긴게, 결국 expert로 라우팅 되는 토큰의 비율 * 라우팅되는 경우에 대한 확률을 곱한건데 이상적인건 uniform 분포가 되는거고, 지금 저 loss의 식자체도 보면 두 벡터가 1/n 값을 갖는게 제일 minimized되게 나옴
이거 약간 생각 잘해봐야할듯
2.3 Putting It All Together: The Switch Transformer
A head-to-head comparison of the Switch Transformer and the MoE Transformer is presented in Table 1
Note that the MoE model going from capacity factor 2.0 to 1.25 actually slows down (840 to 790) in the above experiment setup, which is unexpected.
Note that speed measurements are both a function of the algorithm and the implementation details.
Switch Transformer reduces the necessary computation relative to MoE (algorithm), but the final speed differences are impacted by low-level optimizations (implementation).
highlight three key findings from Table 1:
(1) Switch Transformers outperform both carefully tuned dense models and MoE Transformers on a speed-quality basis. For a fixed amount of computation and wall-clock time, Switch Transformers achieve the best result.
(2) The Switch Transformer has a smaller computational footprint than the MoE counterpart. If we increase its size to match the training speed of the MoE Transformer, we find this outperforms all MoE and Dense models on a per step basis as well.
Author
Google, Mountain View, CA 94043, USA
Abstract
Introduction
Switch Transformer
2.1 Simplifying Sparse Routing
2.2 Efficient Sparse Routing
Distributed Switch Implementation
how to set the expert capacity
.auxiliary load balancing loss
를 쓰면 조금 더 도움이 된다고함A Differentiable Load Balancing Loss
2.3 Putting It All Together: The Switch Transformer
Switch Transformer reduces the necessary computation relative to MoE (algorithm), but the final speed differences are impacted by low-level optimizations (implementation).
2.4 Improved Training and Fine-Tuning Techniques
Selective precision with large sparse models
The text was updated successfully, but these errors were encountered: