You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Neel Jain1∗, Ping-yeh Chiang1∗, Yuxin Wen1∗, John Kirchenbauer1, Hong-Min Chu1, Gowthami Somepalli1 , Brian R. Bartoldson2, Bhavya Kailkhura2, Avi Schwarzschild1, Aniruddha Saha1, Micah Goldblum3, Jonas Geiping1, Tom Goldstein1
1 University of Maryland, 2 Lawrence Livermore National Laboratory, 3 New York University
Abstract
NEFTune adds noise to the embedding vectors during training
Standard finetuning of LLaMA-2-7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings
Models trained with Evol-Instruct see a 10% improvement, with ShareGPT an 8% improvement, and with OpenPlatypus an 8% improvement
Even powerful models further refined with RLHF such as LLaMA-2-Chat benefit from additional training with NEFTune
Introduction
propose to add random noise to the embedding vectors of the training data during the forward pass of fine-tuning
this simple trick can improve the outcome of instruction fine-tuning, often by a large margin, with no additional compute or data overhead
Noisy Embedding Instruction Fine Tuning (NEFTune)
When a raw LLM like LLaMA-2-7B is finetuned with noisy embeddings, its performance on AlpacaEval improves from 29.8% to 64.7% (Figure 1)
NEFTUNE: NOISY EMBEDDING INSTRUCTION FINETUNING
[1] Each step of NEFTune begins by sampling an instruction from the dataset, and converting its tokens to embedding vectors
[2] NEFTune then departs from standard training by adding a random noise vector to the embeddings
The noise is generated by sampling iid uniform entries, each in the range [−1, 1]
then scaling the entire noise vector by a factor of α/√Ld, where L is the sequence length, d is the embedding dimension, and α is a tunable parameter
Scaling rule은 Adversarial ML Literature (Freelb: Enhanced adversarial training for natural language understanding)에서 가져옴
EXPERIMENTAL SET-UP
MODELS
conduct the majority of our experiments using 7B parameter LLMs.
LLaMA-1 (tokens: 1B), LLaMA-2(tokens: 2T), and OPT-6.7B (tokens: 180B)
Additionally, we improve RLHF models by finetuning the highly refined LLaMA-2-Chat (7B) model.
INSTRUCTION FINETUNING DATASETS
Alpaca (Taori et al., 2023): was constructed using the Self-Instruct
Evol-Instruct (Xu et al., 2023): contains 70k single-turn instructions (more complex than Alpaca)
Open-Platypus(Leeetal.,2023): is a curated dataset amalgamated from 11 open-source datasets, curated specifically towards improving LLM performance in STEM and logical domains
contains 25k questions where ≈ 10% are LLM-generated and the remainder human-written.
ShareGPT (Chiang et al., 2023): is a dataset of 70K voluntarily-shared ChatGPT conversations (ShareGPT, 2023)
Although ShareGPT is multiturn, we use the dataset version from Vicuna- v1.1 and split the multi-turn conversations closer to a single-turn format
System prompt & HyperParams
System prompt
Additionally, we finetune all models with the Alpaca system prompt, except for ShareGPT, where we use the Vicuna system prompt.
sequence lengths of 512 tokens (mainly for memory and speed)
70B Models
use alpha = 15
alpha는 tunable parameter
EVALUATION
AlpacaEval
AlpacaEval is an automatic model-based evaluation that compares Text-Davinci-003 generations to the model generations over 805 instructions with the Win Rate reported.
The 805 test prompts are scraped from Vicuna, koala, Anthropic’s hh-rlhf, and other sources
The Win Rate is the rate at which the model in question is preferred to Text-Davinci- 003 as determined by model evaluator (GPT-4)
(이 연구에서) use both GPT-4 and ChatGPT as evaluators. We use ChatGPT as a precursor test to determine which models to evaluate on GPT-4. This is due to the cost and API restrictions of GPT-4
Hugging Face OpenLLM Leaderboard
verbalized multiclass classification datasets ARC (Clark et al., 2018), HellaSwag (Zellers et al., 2019), MMLU (Hendrycks et al., 2020), and TruthfulQA (Lin et al., 2022).
RESULTS
NEFTune Improves Text Quality
NEFTune Can Improve Chat Models
table2(GPT-4) 보면 chat model도 오른다
RLHF 이미 한건데도 좋아진다
Effect on Capabilities
evaluate on the OpenLLM Leaderboard tasks, using the LM- Eval Harness (Gao et al., 2021) implementation of MMLU, ARC, HellaSwag, and TruthfulQA
성능상 개선효과는 없다..유지정도.. text generation quality가 올라가지만 성능개선은 없다라..
NEFTune Works with QLORA (Qlora: Efficient finetuning of quantized llms, 2023)
NEFTune는 QLoRA 성능 개선에도 도움을 줌
A Qualitative Example
Eng
Ko
ANALYSIS
We hypothesize that by adding noise to the embeddings at train time, the model overfits less to the specifics of the instruction-tuning dataset, such as formatting details, exact wording, and text length.
Longer, more verbose, completions are preferred by both human and machine evaluators on most datasets (Dubois et al., 2023), but we find that the increased verbosity is only the most visible side-effect from the reduced overfitting to the instruction distribution; increased verbosity alone cannot explain the measured gains in performance
말이 길어지는 효과
OVERFITTING
examine the training loss of both models on the Alpaca dataset (both are evaluated without noise) and the “testing” loss on the Evol-Instruct dataset
기존대비 train loss는 높아지지만 test loss는 낮아진다 -> 왜 데이터셋을 다른걸로 했을까? 각각의 데이터셋에 tr/ts가 없었나..?
반면 trainining prompts에 Greedy decoding으로 생성해보면 ROUGE-L, BLEU가 낮게 나옴(성능 안좋게 해석되는 수치) -> 오버핏이 덜되었다고 얘기하려고하는듯
In contrast, NEFTune models overfit less without reduction in performance on the test set, and do not “lock-in” to the exact wording of the instruction data, as seen in the ROUGE-L metric.
test set에 대한 성능저하가 없으면서 학습 데이터셋에 갇히는 일이 없다 라고 해석
LENGTH VERSUS TOKEN DIVERSITY
The fixed length cuttoffs were 50 for models trained on Alpaca, 100 for Evol-Instruct, 150 for ShareGPT, and 150 for OpenPlatypus.
We choose the chunk lengths so that at least half of the generations were longer than the cutoff, and sequences of insufficient length were dropped. The diversity scores we compute are a summary measure of 2-, 3-, and 4-gram repetition rates called log-diversity(from "On the reliability of water-marks for large language models")
we see that NEFT models generate longer outputs than their counterparts. However, we also see that the 2-gram repetition rates as well as overall token log-diversity for models trained with and without NEFT are nearly identical, providing evidence that the longer responses do not come at the expense of repetition, and instead provide additional details
NEFT는 더 긴 텍스트를 생성하지만 2-gram repetition rates과 log-diversity를 보면 거의 NEFT를 적용하지 않은것과 비슷하기 때문에 생성 길이가 길다고 말을 반복하는게 아니라 additional details를 주는걸 알 수 있다
길이가 길다고 꼭 더 승률이 높은건 아니다
Gaussian noise induces even longer outputs, but does not come with improved performance
LENGTH IS (NOT) ALL YOU NEED
CONCLUSIONS AND LIMITATIONS
The success of NEFTune points to the often ignored importance of algorithms and regularizers for LLM training
Unlike the computer vision community, which has studied regularization and overfitting for years, the LLM community tends to use standardized training loops that are designed for optimizer stability and not generalization
Given the consistent gains of NEFTune, and the tendency to overfit on small instruction datasets, it seems that regularization deserves to be revisited in the LLM setting.
Our study has several limitations
adopt AlpacaEval as our central measure of instruction- following ability for LLMs, which is subject to the biases of a single judge (GPT-4).
Additionally, due to limited compute resources, we were not able to validate the success of NEFTune on larger 70B variants across multiple datasets,
and we had to rely on fixed hyper-parameters for most NEFTune runs rather than sweeping.
Finally, despite our empirical studies, we do not have a conclusive understanding of why NEFTune works.
Appendix
alpha값은 5~15로 셋팅하는듯
The text was updated successfully, but these errors were encountered:
느낀점
Author
Abstract
noise to the embedding vectors
during training29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings
Even
powerful models furtherrefined with RLHF
such as LLaMA-2-Chatbenefit from additional training with NEFTune
Introduction
add random noise
to the embedding vectors of the training dataduring the forward pass
of fine-tuningNEFTUNE: NOISY EMBEDDING INSTRUCTION FINETUNING
iid uniform entries
, each in the range [−1, 1]Freelb: Enhanced adversarial training for natural language understanding
)에서 가져옴EXPERIMENTAL SET-UP
MODELS
improve RLHF models by finetuning the highly refined LLaMA-2-Chat (7B) model
.INSTRUCTION FINETUNING DATASETS
Self-Instruct
complex than Alpaca
)STEM and logical
domainsSystem prompt & HyperParams
EVALUATION
compares Text-Davinci-003 generations
to the model generationsover 805 instructions
with the Win Rate reported.model evaluator (GPT-4)
RESULTS
ANALYSIS
OVERFITTING
LENGTH VERSUS TOKEN DIVERSITY
LENGTH IS (NOT) ALL YOU NEED
CONCLUSIONS AND LIMITATIONS
Appendix
The text was updated successfully, but these errors were encountered: