You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM
PaLM 2 is a Transformer-based model trained using a mixture of objectives similar to UL2
Introduction
PaLM 2 incorporates the following diverse set of research advances:
Compute-optimal scaling:
data and model size should be scaled roughly 1:1 to achieve the best performance for a given amount of training compute (as opposed to past trends, which scaled the model 3× faster than the dataset).
Improved dataset mixtures:
designed a more multilingual and diverse pre-training mixture, which extends across hundreds of languages and domains (e.g., programming languages, mathematics, and parallel multilingual documents)
Architectural and objective improvements:
Given the strong results of UL2 (Tay et al., 2023), we use a tuned mixture of different pre-training objectives in this model to train the model to understand different aspects of language
The largest model in the PaLM 2 family, PaLM 2-L, is significantly smaller than the largest PaLM model but uses more training compute. Our evaluation results show that PaLM 2 models significantly outperform PaLM on a variety of tasks.
- PaLM 2 includes control tokens to enable inference-time control over toxicity, modifying only a fraction of pre-training as compared to prior work (Korbak et al., 2023). Special `‘canary’ token sequences` were injected into PaLM 2 pretraining data to enable `improved measures of memorization across languages` (Carlini et al., 2019, 2021).
- PaLM 2 has lower average rates of verbatim memorization than PaLM, and for tail languages we observe that memorization rates increase above English only when data is repeated several times across documents.
Scaling law experiments
두가지 연구(캐플란(OpenAI)쪽 연구(Scaling laws for neural language models)와 호프만(딥마인드)(친칠라))들이 비슷하지만 다른 값들을 도출했었음(optimal ratios 라던지, 친칠라에 따르면 N과 D가 동일한 비율로 커져야 된다던지)
Scaling Transformer language models has become a popular way to achieve state-of-the-art performance. Kaplan et al. (2020) studied the relationship between scaling the amount of training data (D) and model size (N), and reached the empirical conclusion that it follows a power law, with N needing to grow faster than D. Hoffmann et al. (2022) built upon this observation with a similar study that tuned smaller models’ hyperparameters better. Their results corroborated Kaplan et al. (2020)’s power law conclusion; however, they arrived at different results regarding the optimal ratios, showing that N and D should instead grow in equal proportions.
이번 연구에서는 친칠라쪽 연구와 비슷한 결과가 나왔다고함
We arrive at a similar conclusion as Hoffmann et al. (2022), i.e., D and N should grow in equal proportions. We then explore the effect of scaling laws on downstream metrics.
Scaling laws
For each compute budget, we use the heuristic FLOPs ≈ 6ND (Kaplan et al., 2020) to determine how many tokens to train each model for.
FLOPs 계산 방법 봐야할듯
perform quadratic fits for each isoFLOPS band
Plotting these optimal Ns and optimal Ds against FLOPs
The text was updated successfully, but these errors were encountered:
Note
Author
Abstract
Introduction
Scaling law experiments
training data (D)
andmodel size (N)
, and reached the empirical conclusion that it follows a power law, with N needing to grow faster than D. Hoffmann et al. (2022) built upon this observation with a similar study that tuned smaller models’ hyperparameters better. Their results corroborated Kaplan et al. (2020)’s power law conclusion; however, they arrived at different results regarding the optimal ratios, showing that N and D should instead grow in equal proportions.Scaling laws
The text was updated successfully, but these errors were encountered: