Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEFTune: Noisy Embeddings Improve Instruction Finetuning #33

Open
eagle705 opened this issue Nov 8, 2023 · 0 comments
Open

NEFTune: Noisy Embeddings Improve Instruction Finetuning #33

eagle705 opened this issue Nov 8, 2023 · 0 comments
Assignees

Comments

@eagle705
Copy link
Owner

eagle705 commented Nov 8, 2023

느낀점

  • https://github.com/neelsjain/NEFTune
  • 임베딩이 아닌 다른 레이어에도 더하면 어떻게 될까?
  • 생성 데이터 품질은 좋아지지만, 리더보드 성능은 비슷하다..?!
  • 튜닝셋에 오버핏되는걸 막는 역할이라는 가정
example 1 example 2
image image

Author

  • Neel Jain1∗, Ping-yeh Chiang1∗, Yuxin Wen1∗, John Kirchenbauer1, Hong-Min Chu1, Gowthami Somepalli1 , Brian R. Bartoldson2, Bhavya Kailkhura2, Avi Schwarzschild1, Aniruddha Saha1, Micah Goldblum3, Jonas Geiping1, Tom Goldstein1
    • 1 University of Maryland, 2 Lawrence Livermore National Laboratory, 3 New York University

Abstract

  • NEFTune adds noise to the embedding vectors during training
  • Standard finetuning of LLaMA-2-7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings
  • Models trained with Evol-Instruct see a 10% improvement, with ShareGPT an 8% improvement, and with OpenPlatypus an 8% improvement
  • Even powerful models further refined with RLHF such as LLaMA-2-Chat benefit from additional training with NEFTune

Introduction

  • propose to add random noise to the embedding vectors of the training data during the forward pass of fine-tuning
  • this simple trick can improve the outcome of instruction fine-tuning, often by a large margin, with no additional compute or data overhead
  • Noisy Embedding Instruction Fine Tuning (NEFTune)
  • When a raw LLM like LLaMA-2-7B is finetuned with noisy embeddings, its performance on AlpacaEval improves from 29.8% to 64.7% (Figure 1)
image

NEFTUNE: NOISY EMBEDDING INSTRUCTION FINETUNING

  • [1] Each step of NEFTune begins by sampling an instruction from the dataset, and converting its tokens to embedding vectors
  • [2] NEFTune then departs from standard training by adding a random noise vector to the embeddings
    • The noise is generated by sampling iid uniform entries, each in the range [−1, 1]
    • then scaling the entire noise vector by a factor of α/√Ld, where L is the sequence length, d is the embedding dimension, and α is a tunable parameter
    • Scaling rule은 Adversarial ML Literature (Freelb: Enhanced adversarial training for natural language understanding)에서 가져옴
image image

EXPERIMENTAL SET-UP

MODELS

  • conduct the majority of our experiments using 7B parameter LLMs.
  • LLaMA-1 (tokens: 1B), LLaMA-2(tokens: 2T), and OPT-6.7B (tokens: 180B)
  • Additionally, we improve RLHF models by finetuning the highly refined LLaMA-2-Chat (7B) model.

INSTRUCTION FINETUNING DATASETS

  • Alpaca (Taori et al., 2023): was constructed using the Self-Instruct
  • Evol-Instruct (Xu et al., 2023): contains 70k single-turn instructions (more complex than Alpaca)
  • Open-Platypus(Leeetal.,2023): is a curated dataset amalgamated from 11 open-source datasets, curated specifically towards improving LLM performance in STEM and logical domains
    • contains 25k questions where ≈ 10% are LLM-generated and the remainder human-written.
  • ShareGPT (Chiang et al., 2023): is a dataset of 70K voluntarily-shared ChatGPT conversations (ShareGPT, 2023)
    • Although ShareGPT is multiturn, we use the dataset version from Vicuna- v1.1 and split the multi-turn conversations closer to a single-turn format

System prompt & HyperParams

  • System prompt
    • Additionally, we finetune all models with the Alpaca system prompt, except for ShareGPT, where we use the Vicuna system prompt.
  • HyperParams
    • 7B Models
      • alpha = 5
      • bf16
      • 5e-5 lr, Adam Optim
      • 3 epochs
      • 128 global bs (4 GPU, bs 4, 8 gradient accumulation steps = 448 = 128)
      • AlpacaEval using ChatGPT as the evaluator
      • sequence lengths of 512 tokens (mainly for memory and speed)
    • 70B Models
      • use alpha = 15
  • alpha는 tunable parameter
image

EVALUATION

  • AlpacaEval
    • AlpacaEval is an automatic model-based evaluation that compares Text-Davinci-003 generations to the model generations over 805 instructions with the Win Rate reported.
      • The 805 test prompts are scraped from Vicuna, koala, Anthropic’s hh-rlhf, and other sources
    • The Win Rate is the rate at which the model in question is preferred to Text-Davinci- 003 as determined by model evaluator (GPT-4)
    • (이 연구에서) use both GPT-4 and ChatGPT as evaluators. We use ChatGPT as a precursor test to determine which models to evaluate on GPT-4. This is due to the cost and API restrictions of GPT-4
  • Hugging Face OpenLLM Leaderboard
  • verbalized multiclass classification datasets ARC (Clark et al., 2018), HellaSwag (Zellers et al., 2019), MMLU (Hendrycks et al., 2020), and TruthfulQA (Lin et al., 2022).

RESULTS

  • NEFTune Improves Text Quality
image
  • NEFTune Can Improve Chat Models
    • table2(GPT-4) 보면 chat model도 오른다
      • RLHF 이미 한건데도 좋아진다
image
  • Effect on Capabilities
    • evaluate on the OpenLLM Leaderboard tasks, using the LM- Eval Harness (Gao et al., 2021) implementation of MMLU, ARC, HellaSwag, and TruthfulQA
    • 성능상 개선효과는 없다..유지정도.. text generation quality가 올라가지만 성능개선은 없다라..
image
  • NEFTune Works with QLORA (Qlora: Efficient finetuning of quantized llms, 2023)
    • NEFTune는 QLoRA 성능 개선에도 도움을 줌
image
  • A Qualitative Example
Eng Ko
image image

ANALYSIS

  • We hypothesize that by adding noise to the embeddings at train time, the model overfits less to the specifics of the instruction-tuning dataset, such as formatting details, exact wording, and text length.
  • Longer, more verbose, completions are preferred by both human and machine evaluators on most datasets (Dubois et al., 2023), but we find that the increased verbosity is only the most visible side-effect from the reduced overfitting to the instruction distribution; increased verbosity alone cannot explain the measured gains in performance
    • 말이 길어지는 효과

OVERFITTING

  • examine the training loss of both models on the Alpaca dataset (both are evaluated without noise) and the “testing” loss on the Evol-Instruct dataset
  • 기존대비 train loss는 높아지지만 test loss는 낮아진다 -> 왜 데이터셋을 다른걸로 했을까? 각각의 데이터셋에 tr/ts가 없었나..?
image
  • 반면 trainining prompts에 Greedy decoding으로 생성해보면 ROUGE-L, BLEU가 낮게 나옴(성능 안좋게 해석되는 수치) -> 오버핏이 덜되었다고 얘기하려고하는듯
  • In contrast, NEFTune models overfit less without reduction in performance on the test set, and do not “lock-in” to the exact wording of the instruction data, as seen in the ROUGE-L metric.
    • test set에 대한 성능저하가 없으면서 학습 데이터셋에 갇히는 일이 없다 라고 해석
image

LENGTH VERSUS TOKEN DIVERSITY

  • The fixed length cuttoffs were 50 for models trained on Alpaca, 100 for Evol-Instruct, 150 for ShareGPT, and 150 for OpenPlatypus.
  • We choose the chunk lengths so that at least half of the generations were longer than the cutoff, and sequences of insufficient length were dropped. The diversity scores we compute are a summary measure of 2-, 3-, and 4-gram repetition rates called log-diversity(from "On the reliability of water-marks for large language models")
  • we see that NEFT models generate longer outputs than their counterparts. However, we also see that the 2-gram repetition rates as well as overall token log-diversity for models trained with and without NEFT are nearly identical, providing evidence that the longer responses do not come at the expense of repetition, and instead provide additional details
    • NEFT는 더 긴 텍스트를 생성하지만 2-gram repetition rates과 log-diversity를 보면 거의 NEFT를 적용하지 않은것과 비슷하기 때문에 생성 길이가 길다고 말을 반복하는게 아니라 additional details를 주는걸 알 수 있다
image
  • 길이가 길다고 꼭 더 승률이 높은건 아니다
  • Gaussian noise induces even longer outputs, but does not come with improved performance
image

LENGTH IS (NOT) ALL YOU NEED

image

CONCLUSIONS AND LIMITATIONS

  • The success of NEFTune points to the often ignored importance of algorithms and regularizers for LLM training
  • Unlike the computer vision community, which has studied regularization and overfitting for years, the LLM community tends to use standardized training loops that are designed for optimizer stability and not generalization
  • Given the consistent gains of NEFTune, and the tendency to overfit on small instruction datasets, it seems that regularization deserves to be revisited in the LLM setting.
  • Our study has several limitations
    • adopt AlpacaEval as our central measure of instruction- following ability for LLMs, which is subject to the biases of a single judge (GPT-4).
    • Additionally, due to limited compute resources, we were not able to validate the success of NEFTune on larger 70B variants across multiple datasets,
    • and we had to rely on fixed hyper-parameters for most NEFTune runs rather than sweeping.
    • Finally, despite our empirical studies, we do not have a conclusive understanding of why NEFTune works.

Appendix

  • alpha값은 5~15로 셋팅하는듯
image image image image
@eagle705 eagle705 self-assigned this Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant