NEFTune: Noisy Embeddings Improve Instruction Finetuning #33

eagle705 · 2023-11-08T10:11:04Z

느낀점

https://github.com/neelsjain/NEFTune
임베딩이 아닌 다른 레이어에도 더하면 어떻게 될까?
생성 데이터 품질은 좋아지지만, 리더보드 성능은 비슷하다..?!
튜닝셋에 오버핏되는걸 막는 역할이라는 가정

example 1	example 2

Author

Neel Jain1∗, Ping-yeh Chiang1∗, Yuxin Wen1∗, John Kirchenbauer1, Hong-Min Chu1, Gowthami Somepalli1 , Brian R. Bartoldson2, Bhavya Kailkhura2, Avi Schwarzschild1, Aniruddha Saha1, Micah Goldblum3, Jonas Geiping1, Tom Goldstein1
- 1 University of Maryland, 2 Lawrence Livermore National Laboratory, 3 New York University

Abstract

NEFTune adds noise to the embedding vectors during training
Standard finetuning of LLaMA-2-7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings
Models trained with Evol-Instruct see a 10% improvement, with ShareGPT an 8% improvement, and with OpenPlatypus an 8% improvement
Even powerful models further refined with RLHF such as LLaMA-2-Chat benefit from additional training with NEFTune

Introduction

propose to add random noise to the embedding vectors of the training data during the forward pass of fine-tuning
this simple trick can improve the outcome of instruction fine-tuning, often by a large margin, with no additional compute or data overhead
Noisy Embedding Instruction Fine Tuning (NEFTune)
When a raw LLM like LLaMA-2-7B is finetuned with noisy embeddings, its performance on AlpacaEval improves from 29.8% to 64.7% (Figure 1)

NEFTUNE: NOISY EMBEDDING INSTRUCTION FINETUNING

[1] Each step of NEFTune begins by sampling an instruction from the dataset, and converting its tokens to embedding vectors
[2] NEFTune then departs from standard training by adding a random noise vector to the embeddings
- The noise is generated by sampling iid uniform entries, each in the range [−1, 1]
- then scaling the entire noise vector by a factor of α/√Ld, where L is the sequence length, d is the embedding dimension, and α is a tunable parameter
- Scaling rule은 Adversarial ML Literature (Freelb: Enhanced adversarial training for natural language understanding)에서 가져옴

EXPERIMENTAL SET-UP

MODELS

conduct the majority of our experiments using 7B parameter LLMs.
LLaMA-1 (tokens: 1B), LLaMA-2(tokens: 2T), and OPT-6.7B (tokens: 180B)
Additionally, we improve RLHF models by finetuning the highly refined LLaMA-2-Chat (7B) model.

INSTRUCTION FINETUNING DATASETS

Alpaca (Taori et al., 2023): was constructed using the Self-Instruct
Evol-Instruct (Xu et al., 2023): contains 70k single-turn instructions (more complex than Alpaca)
Open-Platypus(Leeetal.,2023): is a curated dataset amalgamated from 11 open-source datasets, curated specifically towards improving LLM performance in STEM and logical domains
- contains 25k questions where ≈ 10% are LLM-generated and the remainder human-written.
ShareGPT (Chiang et al., 2023): is a dataset of 70K voluntarily-shared ChatGPT conversations (ShareGPT, 2023)
- Although ShareGPT is multiturn, we use the dataset version from Vicuna- v1.1 and split the multi-turn conversations closer to a single-turn format

System prompt & HyperParams

System prompt
- Additionally, we finetune all models with the Alpaca system prompt, except for ShareGPT, where we use the Vicuna system prompt.
HyperParams
- 7B Models
  - alpha = 5
  - bf16
  - 5e-5 lr, Adam Optim
  - 3 epochs
  - 128 global bs (4 GPU, bs 4, 8 gradient accumulation steps = 448 = 128)
  - AlpacaEval using ChatGPT as the evaluator
  - sequence lengths of 512 tokens (mainly for memory and speed)
- 70B Models
  - use alpha = 15
alpha는 tunable parameter

EVALUATION

AlpacaEval
- AlpacaEval is an automatic model-based evaluation that compares Text-Davinci-003 generations to the model generations over 805 instructions with the Win Rate reported.
  - The 805 test prompts are scraped from Vicuna, koala, Anthropic’s hh-rlhf, and other sources
- The Win Rate is the rate at which the model in question is preferred to Text-Davinci- 003 as determined by model evaluator (GPT-4)
- (이 연구에서) use both GPT-4 and ChatGPT as evaluators. We use ChatGPT as a precursor test to determine which models to evaluate on GPT-4. This is due to the cost and API restrictions of GPT-4
Hugging Face OpenLLM Leaderboard
verbalized multiclass classification datasets ARC (Clark et al., 2018), HellaSwag (Zellers et al., 2019), MMLU (Hendrycks et al., 2020), and TruthfulQA (Lin et al., 2022).

RESULTS

NEFTune Improves Text Quality

NEFTune Can Improve Chat Models
- table2(GPT-4) 보면 chat model도 오른다
  - RLHF 이미 한건데도 좋아진다

Effect on Capabilities
- evaluate on the OpenLLM Leaderboard tasks, using the LM- Eval Harness (Gao et al., 2021) implementation of MMLU, ARC, HellaSwag, and TruthfulQA
- 성능상 개선효과는 없다..유지정도.. text generation quality가 올라가지만 성능개선은 없다라..

NEFTune Works with QLORA (Qlora: Efficient finetuning of quantized llms, 2023)
- NEFTune는 QLoRA 성능 개선에도 도움을 줌

A Qualitative Example

Eng	Ko

ANALYSIS

We hypothesize that by adding noise to the embeddings at train time, the model overfits less to the specifics of the instruction-tuning dataset, such as formatting details, exact wording, and text length.
Longer, more verbose, completions are preferred by both human and machine evaluators on most datasets (Dubois et al., 2023), but we find that the increased verbosity is only the most visible side-effect from the reduced overfitting to the instruction distribution; increased verbosity alone cannot explain the measured gains in performance
- 말이 길어지는 효과

OVERFITTING

examine the training loss of both models on the Alpaca dataset (both are evaluated without noise) and the “testing” loss on the Evol-Instruct dataset
기존대비 train loss는 높아지지만 test loss는 낮아진다 -> 왜 데이터셋을 다른걸로 했을까? 각각의 데이터셋에 tr/ts가 없었나..?

반면 trainining prompts에 Greedy decoding으로 생성해보면 ROUGE-L, BLEU가 낮게 나옴(성능 안좋게 해석되는 수치) -> 오버핏이 덜되었다고 얘기하려고하는듯
In contrast, NEFTune models overfit less without reduction in performance on the test set, and do not “lock-in” to the exact wording of the instruction data, as seen in the ROUGE-L metric.
- test set에 대한 성능저하가 없으면서 학습 데이터셋에 갇히는 일이 없다 라고 해석

LENGTH VERSUS TOKEN DIVERSITY

The fixed length cuttoffs were 50 for models trained on Alpaca, 100 for Evol-Instruct, 150 for ShareGPT, and 150 for OpenPlatypus.
We choose the chunk lengths so that at least half of the generations were longer than the cutoff, and sequences of insufficient length were dropped. The diversity scores we compute are a summary measure of 2-, 3-, and 4-gram repetition rates called log-diversity(from "On the reliability of water-marks for large language models")
we see that NEFT models generate longer outputs than their counterparts. However, we also see that the 2-gram repetition rates as well as overall token log-diversity for models trained with and without NEFT are nearly identical, providing evidence that the longer responses do not come at the expense of repetition, and instead provide additional details
- NEFT는 더 긴 텍스트를 생성하지만 2-gram repetition rates과 log-diversity를 보면 거의 NEFT를 적용하지 않은것과 비슷하기 때문에 생성 길이가 길다고 말을 반복하는게 아니라 additional details를 주는걸 알 수 있다

길이가 길다고 꼭 더 승률이 높은건 아니다
Gaussian noise induces even longer outputs, but does not come with improved performance

LENGTH IS (NOT) ALL YOU NEED

CONCLUSIONS AND LIMITATIONS

The success of NEFTune points to the often ignored importance of algorithms and regularizers for LLM training
Unlike the computer vision community, which has studied regularization and overfitting for years, the LLM community tends to use standardized training loops that are designed for optimizer stability and not generalization
Given the consistent gains of NEFTune, and the tendency to overfit on small instruction datasets, it seems that regularization deserves to be revisited in the LLM setting.
Our study has several limitations
- adopt AlpacaEval as our central measure of instruction- following ability for LLMs, which is subject to the biases of a single judge (GPT-4).
- Additionally, due to limited compute resources, we were not able to validate the success of NEFTune on larger 70B variants across multiple datasets,
- and we had to rely on fixed hyper-parameters for most NEFTune runs rather than sweeping.
- Finally, despite our empirical studies, we do not have a conclusive understanding of why NEFTune works.

Appendix

alpha값은 5~15로 셋팅하는듯

The text was updated successfully, but these errors were encountered:

eagle705 self-assigned this Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NEFTune: Noisy Embeddings Improve Instruction Finetuning #33

NEFTune: Noisy Embeddings Improve Instruction Finetuning #33

eagle705 commented Nov 8, 2023 •

edited

Loading

NEFTune: Noisy Embeddings Improve Instruction Finetuning #33

NEFTune: Noisy Embeddings Improve Instruction Finetuning #33

Comments

eagle705 commented Nov 8, 2023 • edited Loading

느낀점

Author

Abstract

Introduction

NEFTUNE: NOISY EMBEDDING INSTRUCTION FINETUNING

EXPERIMENTAL SET-UP

MODELS

INSTRUCTION FINETUNING DATASETS

System prompt & HyperParams

EVALUATION

RESULTS

ANALYSIS

OVERFITTING

LENGTH VERSUS TOKEN DIVERSITY

LENGTH IS (NOT) ALL YOU NEED

CONCLUSIONS AND LIMITATIONS

Appendix

eagle705 commented Nov 8, 2023 •

edited

Loading