Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PaLM 2 Technical Report #29

Open
eagle705 opened this issue Jun 1, 2023 · 0 comments
Open

PaLM 2 Technical Report #29

eagle705 opened this issue Jun 1, 2023 · 0 comments
Assignees

Comments

@eagle705
Copy link
Owner

eagle705 commented Jun 1, 2023

Note

Author

  • Google

Abstract

  • new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM
  • PaLM 2 is a Transformer-based model trained using a mixture of objectives similar to UL2

Introduction

  • PaLM 2 incorporates the following diverse set of research advances:
    • Compute-optimal scaling:
      • data and model size should be scaled roughly 1:1 to achieve the best performance for a given amount of training compute (as opposed to past trends, which scaled the model 3× faster than the dataset).
    • Improved dataset mixtures:
      • designed a more multilingual and diverse pre-training mixture, which extends across hundreds of languages and domains (e.g., programming languages, mathematics, and parallel multilingual documents)
    • Architectural and objective improvements:
      • Given the strong results of UL2 (Tay et al., 2023), we use a tuned mixture of different pre-training objectives in this model to train the model to understand different aspects of language
  • The largest model in the PaLM 2 family, PaLM 2-L, is significantly smaller than the largest PaLM model but uses more training compute. Our evaluation results show that PaLM 2 models significantly outperform PaLM on a variety of tasks.
image - PaLM 2 includes control tokens to enable inference-time control over toxicity, modifying only a fraction of pre-training as compared to prior work (Korbak et al., 2023). Special `‘canary’ token sequences` were injected into PaLM 2 pretraining data to enable `improved measures of memorization across languages` (Carlini et al., 2019, 2021). - PaLM 2 has lower average rates of verbatim memorization than PaLM, and for tail languages we observe that memorization rates increase above English only when data is repeated several times across documents.

Scaling law experiments

  • 두가지 연구(캐플란(OpenAI)쪽 연구(Scaling laws for neural language models)와 호프만(딥마인드)(친칠라))들이 비슷하지만 다른 값들을 도출했었음(optimal ratios 라던지, 친칠라에 따르면 N과 D가 동일한 비율로 커져야 된다던지)
    • Scaling Transformer language models has become a popular way to achieve state-of-the-art performance. Kaplan et al. (2020) studied the relationship between scaling the amount of training data (D) and model size (N), and reached the empirical conclusion that it follows a power law, with N needing to grow faster than D. Hoffmann et al. (2022) built upon this observation with a similar study that tuned smaller models’ hyperparameters better. Their results corroborated Kaplan et al. (2020)’s power law conclusion; however, they arrived at different results regarding the optimal ratios, showing that N and D should instead grow in equal proportions.
  • 이번 연구에서는 친칠라쪽 연구와 비슷한 결과가 나왔다고함
    • We arrive at a similar conclusion as Hoffmann et al. (2022), i.e., D and N should grow in equal proportions. We then explore the effect of scaling laws on downstream metrics.

Scaling laws

  • For each compute budget, we use the heuristic FLOPs ≈ 6ND (Kaplan et al., 2020) to determine how many tokens to train each model for.
    • FLOPs 계산 방법 봐야할듯
perform quadratic fits for each isoFLOPS band Plotting these optimal Ns and optimal Ds against FLOPs
image image
@eagle705 eagle705 self-assigned this Jun 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant