Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity #37

eagle705 · 2024-07-01T02:13:02Z

Author

William Fedus
- [email protected]
Barret Zoph
- [email protected]
Noam Shazeer
- [email protected]
  Google, Mountain View, CA 94043, USA

Abstract

In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) models defy this and instead select different parameters for each incoming example
The result is a sparsely-activated model—with an outrageous number of parameters—but a constant computational cost
- 일부만 activated 되니까 sparse 모델인거고, 파라미터는 크지만 일부만 쓰기 때문에 계산 비용이 고정적이다
이슈들
- widespread adoption has been hindered by complexity, communication costs, and training instability.
컨트리뷰션
- design models based off T5-Base and T5-Large (Raffel et al., 2019) to obtain up to 7x increases in pre-training speed with the same computational resources.
Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus”, and achieve a 4x speedup over the T5-XXL model.
- T 단위의 스케일로 학습가능
- 모델링은 T5로 테스트

Introduction

a sparsely-activated expert model: the Switch Transformer. In our case the sparsity comes from activating a subset of the neural network weights for each incoming example.
efficient sparse algorithm을 위해 MoE 패러다임 도입
- To have an efficient sparse algorithm, we start with the Mixture-of-Expert (MoE) paradigm (Jacobs et al., 1991; Jordan and Jacobs, 1994; Shazeer et al., 2017), and simplify it to yield training stability and computational benefits
Our contributions are the following:
- The Switch Transformer architecture, which simplifies and improves over Mixture of Experts.
- Scaling properties and a benchmark against the strongly tuned T5 model (Raffel et al., 2019) where we measure 7x+ pre-training speedups while still using the same FLOPS per token. We further show the improvements hold even with limited computational resources, using as few as two experts.
- Successful distillation of sparse pre-trained and specialized fine-tuned models into small dense models. We reduce the model size by up to 99% while preserving 30% of the quality gains of the large sparse teacher.
- Improved pre-training and fine-tuning techniques: (1) selective precision training that enables training with lower bfloat16 precision (2) an initialization scheme that allows for scaling to a larger number of experts and (3) increased expert regularization that improves sparse model fine-tuning and multi-task training.
- A measurement of the pre-training benefits on multilingual data where we find a universal improvement across all 101 languages and with 91% of languages benefiting from 4x+ speedups over the mT5 baseline (Xue et al., 2020).
- An increase in the scale of neural language models achieved by efficiently combining data, model, and expert-parallelism to create models with up to a trillion parameters. These models improve the pre-training speed of a strongly tuned T5-XXL baseline by 4x.

Switch Transformer

The guiding design principle for Switch Transformers is to maximize the parameter count of a Transformer model (Vaswani et al., 2017) in a simple and computationally efficient way.
investigate a fourth axis:
- increase the parameter count while keeping the floating point operations (FLOPs) per example constant.

2.1 Simplifying Sparse Routing

Mixture of Expert Routing
- Mixture- of-Experts (MoE) layer which takes as an input a token representation x and then routes this to the best determined top-k experts

Switch Routing: Rethinking Mixture-of-Experts.
- routing to k > 1 experts was necessary in order to have non-trivial gradients to the routing functions. The authors intuited that learning to route would not work without the ability to compare at least two experts
- 당연히 후보가 2개 이상되야 routing 하는거 배우겠지 뭐;;
- Contrary to these ideas, we instead use a simplified strategy where we route to only a single expert. We show this simplification preserves model quality, reduces routing computation and performs better. This k = 1 routing strategy is later referred to as a Switch layer
  - 1개의 expert만 하는게 모델 퀄리티나 라우팅 계산을 줄이고 퍼포먼스도 좋아지게하는 simplified 전략이고 Switch Layer라는 라우팅 전략으로 불리운다는데 뭘까
  - The benefits for the Switch layer are three-fold:
    - (1) The router computation is reduced as we are only routing a token to a single expert.
    - (2) The batch size (expert capacity) of each expert can be at least halved since each token is only being routed to a single expert.
    - (3) The routing implementation is simplified and communication costs are reduced.

Experts를 device마다 나눠서 하는건 알겠음 오케이
Expert Capacity를 정하는게 Capacity factor를 셋팅해서 계산한다는것도 어느정돈 알겠음
근데 이게 Overflow되는지 안되는지는 어떻게 정하는거지? logit 계산시 토큰마다 어디로 라우팅 될지 알수없는건데 흠..

2.2 Efficient Sparse Routing

Distributed Switch Implementation

All of our tensor shapes are statically deter- mined at compilation time, but our computation is dynamic due to the routing decisions at training and inference. Because of this, one important technical consideration is how to set the expert capacity.
A capacity factor greater than 1.0 creates additional buffer to accommodate for when to- kens are not perfectly balanced across experts.
If too many tokens are routed to an expert (referred to later as dropped tokens), computation is skipped and the token representation is passed directly to the next layer through the residual connection.
- 너무 한쪽에 몰리면 토큰드랍이 발생하고 residual connection으로 그냥 다음 layer로 넘어가게됨
  - 신기한 컨셉이네
Increasing the expert capacity is not without drawbacks, however, since high values will result in wasted computation and memory.
- 적당히 올리는게 중요
Empirically we find ensuring lower rates of dropped tokens are important for the scaling of sparse expert-models.
- 그래도 실험결과로는 lower rates of drooped tokens가 스케일업하기에 매우 중요
Throughout our experiments we didn’t notice any dependency on the number of experts for the number of tokens dropped (typically < 1%)
- 실험상으론 Expert 개수와 토큰 드랍간의 디펜던시는 발견하진 못했음
Using the auxiliary load balancing loss (next section) with a high enough coefficient ensured good load balancing.
- auxiliary load balancing loss를 쓰면 조금 더 도움이 된다고함
  - 이게 뭘까? 신기하네

A Differentiable Load Balancing Loss

To encourage a balanced load across experts we add an auxiliary loss
Switch Transformers simplifies the original design in Shazeer et al. (2017) which had separate load-balancing and importance-weighting losses.
For each Switch layer, this auxiliary loss is added to the total model loss during training.

균등한 라우팅을 위한 (token drop이 안되도록) auxiliary loss를 넣음
이게 웃긴게, 결국 expert로 라우팅 되는 토큰의 비율 * 라우팅되는 경우에 대한 확률을 곱한건데 이상적인건 uniform 분포가 되는거고, 지금 저 loss의 식자체도 보면 두 벡터가 1/n 값을 갖는게 제일 minimized되게 나옴
- 이거 약간 생각 잘해봐야할듯

2.3 Putting It All Together: The Switch Transformer

A head-to-head comparison of the Switch Transformer and the MoE Transformer is presented in Table 1
Note that the MoE model going from capacity factor 2.0 to 1.25 actually slows down (840 to 790) in the above experiment setup, which is unexpected.
- Note that speed measurements are both a function of the algorithm and the implementation details.
  Switch Transformer reduces the necessary computation relative to MoE (algorithm), but the final speed differences are impacted by low-level optimizations (implementation).
highlight three key findings from Table 1:
- (1) Switch Transformers outperform both carefully tuned dense models and MoE Transformers on a speed-quality basis. For a fixed amount of computation and wall-clock time, Switch Transformers achieve the best result.
- (2) The Switch Transformer has a smaller computational footprint than the MoE counterpart. If we increase its size to match the training speed of the MoE Transformer, we find this outperforms all MoE and Dense models on a per step basis as well.
- (3) Switch Transformers perform better at lower capacity factors (1.0, 1.25)

2.4 Improved Training and Fine-Tuning Techniques

학습의 어려움들
- Sparse expert models may introduce training difficulties over a vanilla Transformer.
- Instability can result because of the hard-switching (routing) decisions at each of these layers.
- Further, low precision formats like bfloat16 (Wang and Kanwar, 2019) can exacerbate issues in the softmax computation for our router
- describe training difficulties here and the methods we use to overcome them to achieve stable and scalable training.

Selective precision with large sparse models

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity #37

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity #37

eagle705 commented Jul 1, 2024 •

edited

Loading

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity #37

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity #37

Comments

eagle705 commented Jul 1, 2024 • edited Loading

Author

Abstract

Introduction

Switch Transformer

2.1 Simplifying Sparse Routing

2.2 Efficient Sparse Routing

Distributed Switch Implementation

A Differentiable Load Balancing Loss

2.3 Putting It All Together: The Switch Transformer

2.4 Improved Training and Fine-Tuning Techniques

Selective precision with large sparse models

eagle705 commented Jul 1, 2024 •

edited

Loading