PaLM 2 Technical Report #29

eagle705 · 2023-06-01T01:53:46Z

Note

Author

Google

Abstract

new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM
PaLM 2 is a Transformer-based model trained using a mixture of objectives similar to UL2

Introduction

PaLM 2 incorporates the following diverse set of research advances:
- Compute-optimal scaling:
  - data and model size should be scaled roughly 1:1 to achieve the best performance for a given amount of training compute (as opposed to past trends, which scaled the model 3× faster than the dataset).
- Improved dataset mixtures:
  - designed a more multilingual and diverse pre-training mixture, which extends across hundreds of languages and domains (e.g., programming languages, mathematics, and parallel multilingual documents)
- Architectural and objective improvements:
  - Given the strong results of UL2 (Tay et al., 2023), we use a tuned mixture of different pre-training objectives in this model to train the model to understand different aspects of language
The largest model in the PaLM 2 family, PaLM 2-L, is significantly smaller than the largest PaLM model but uses more training compute. Our evaluation results show that PaLM 2 models significantly outperform PaLM on a variety of tasks.

- PaLM 2 includes control tokens to enable inference-time control over toxicity, modifying only a fraction of pre-training as compared to prior work (Korbak et al., 2023). Special `‘canary’ token sequences` were injected into PaLM 2 pretraining data to enable `improved measures of memorization across languages` (Carlini et al., 2019, 2021). - PaLM 2 has lower average rates of verbatim memorization than PaLM, and for tail languages we observe that memorization rates increase above English only when data is repeated several times across documents.

Scaling law experiments

두가지 연구(캐플란(OpenAI)쪽 연구(Scaling laws for neural language models)와 호프만(딥마인드)(친칠라))들이 비슷하지만 다른 값들을 도출했었음(optimal ratios 라던지, 친칠라에 따르면 N과 D가 동일한 비율로 커져야 된다던지)
- Scaling Transformer language models has become a popular way to achieve state-of-the-art performance. Kaplan et al. (2020) studied the relationship between scaling the amount of training data (D) and model size (N), and reached the empirical conclusion that it follows a power law, with N needing to grow faster than D. Hoffmann et al. (2022) built upon this observation with a similar study that tuned smaller models’ hyperparameters better. Their results corroborated Kaplan et al. (2020)’s power law conclusion; however, they arrived at different results regarding the optimal ratios, showing that N and D should instead grow in equal proportions.
이번 연구에서는 친칠라쪽 연구와 비슷한 결과가 나왔다고함
- We arrive at a similar conclusion as Hoffmann et al. (2022), i.e., D and N should grow in equal proportions. We then explore the effect of scaling laws on downstream metrics.

Scaling laws

For each compute budget, we use the heuristic FLOPs ≈ 6ND (Kaplan et al., 2020) to determine how many tokens to train each model for.
- FLOPs 계산 방법 봐야할듯

perform quadratic fits for each isoFLOPS band	Plotting these optimal Ns and optimal Ds against FLOPs

The text was updated successfully, but these errors were encountered:

eagle705 self-assigned this Jun 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PaLM 2 Technical Report #29

PaLM 2 Technical Report #29

eagle705 commented Jun 1, 2023 •

edited

Loading

PaLM 2 Technical Report #29

PaLM 2 Technical Report #29

Comments

eagle705 commented Jun 1, 2023 • edited Loading

Note

Author

Abstract

Introduction

Scaling law experiments

Scaling laws

eagle705 commented Jun 1, 2023 •

edited

Loading