(Humpback) Self-Alignment with Instruction Backtranslation #31

eagle705 · 2023-08-22T00:50:13Z

생각

LIMA를 만든곳에서 automatical instruction도 만들생각을 하는게 의아하다
- 고퀄데이터 조금 vs 적당 수준의 데이터 많이?
Github:
- https://github.com/facebookresearch/ParlAI/tree/main/projects/humpback
Self-Instruct와 비슷한 방법론이지만, Answer가 real world 데이터라는점, 데이터큐레이션을 5점 척도로 진행한다는점, iterative하게 모델을 개선하면서 데이터를 개선한다는 점이 차이가 있음
- 모델 자체는 backward모델과 finetuning 모델 여러개를 쓰다보니 공수가 조금 들어가는 편
- system prompt등을 사용했으나 부작용도 있어보인다
- 5점 척도로 품질 평가하는 부분이 마음에 들긴함

Author

Xian Li Ping Yu Chunting Zhou(LIMA 저자) Timo Schick
Luke Zettlemoyer Omer Levy Jason Weston Mike Lewis
- Meta AI

Abstract

present a scalable method to build a high quality instruction following language model by automatically labelling human-written text with corresponding instructions
Our approach, named instruction backtranslation,
- starts with a language model finetuned on a small amount of seed data, and a given web corpus
- The seed model is used to construct training examples by generating instruction prompts for web documents (self-augmentation)
- selecting high quality examples from among these candidates (self-curation)
- This data is then used to finetune a stronger model
Finetuning LLaMa on two iterations of our approach yields a model that outperforms all other LLaMa-based models on the Alpaca leaderboard not relying on distillation data, demonstrating highly effective self-alignment.

Introduction

Recent work highlights the importance of human-annotation data quality Zhou et al. [2023](Lima: Less is more for alignment), Köpf et al. [2023]. However, annotating instruction following datasets with such quality is hard to scale.
we instead leverage large amounts of unlabelled data to create a high quality instruction tuning dataset by developing an iterative self-training algorithm
Our approach, named instruction backtranslation, is inspired by the classic backtranslation method from machine translation
Our method starts
- with a seed instruction following model and a web corpus
- The model is first used to self-augment its training set: for each web document, it creates an instruction following training example by predicting a prompt (instruction) that would be correctly answered by (a portion of) that document.
  - Directly training on such data (similarly to Köksal et al. [2023]) gives poor results in our experiments, both because of the mixed quality of human written web text, and noise in the generated instructions.
- To remedy this, we show that the same seed model can be used to self-curate the set of newly created augmentation data by predicting their quality, and can then be self-trained on only the highest quality (instruction, output) pairs
- The procedure is then iterated, using the improved model to better curate the instruction data, and re-training to produce a better model.
Our resulting model, Humpback, outperforms all other existing non-distilled models on the Alpaca leaderboard

Method

The unlabelled data is a large, diverse set of human-written documents which includes writing about all manner of topics humans are interested in – but crucially is not paired with instructions
두가지 가정
- A first key assumption is that there exists some subset of this very large human-written text that would be suitable as gold generations for some user instructions.
  - 어떤 유저의 instruction에 적합한 문서가 very large human-written text의 서브셋안에 있을 것 (각 문서마다 있다고 가정하는게 아닌가?)
- A second key assumption is that we can predict instructions for these candidate gold answers that can be used as high quality example pairs to train an instruction following model.
  - 진짜 정답들의 후보들에 대한 인스트럭션을 예측할 수 있다
즉 두가지 가정은 누군가의 instruction의 답변으로 쓸만한 문서는 있을것이고 그 답변으로 쓸만한 문서의 instruction이 무엇인지도 예측해볼 수 있을 것이다!
instruction backtranslation, thus performs two core steps:
- 1. Self-augment: Generate instructions for unlabelled data, i.e. the web corpus, to produce candidate training data of (instruction, output) pairs for instruction tuning.
- 1. Self-curate: Self-select high quality demonstration examples as training data to finetune the base model to follow instructions. This approach is done iteratively where a better intermediate instruction-following model can improve on selecting data for finetuning in the next iteration.

2.1 Initialization

Seed data
- human-annotated (instruction, output) examples that will be used to fine-tune language models to give initial predictions in both directions: predicting an output given an instruction, and an instruction given an output.
  - inst에 대한 output도 생성하고 output에 대한 inst도 생성할 수 있도록!
    - Q) output은 단순한 문서인데 어떻게 inst를 생성해낼까?
Unlabelled data
- use a web corpus as a source of unlabelled data.
- For each document, we perform preprocessing to extract self-contained segments {yi}, which are portions of text following an HTML header.
- We further run deduplication, length filtering, and remove potential low quality segments with several heuristics such as the proportion of capitalized letters in the header.
  - 기본적인 디둡이나 전처리등을 진행함

2.2 Self-Augmentation (generating instructions)

We finetune the base language model with (output, instruction) pairs {(y_i, x_i)} from the seed data to obtain a backward model M_yx := p(x|y)
- 질문 생성처럼 output을 줬을때 instruction 생성하게끔 튜닝함
For each unlabelled example y_i, we run inference on the backward model to generate a candidate instruction xˆ_i from which we derive the candidate augmented paired data A := {(xˆ_i , y_i )}.
- sefl-instruct와 차이는 컨텍스트까지 생성하는게 아니라 컨텍스트는 real world unlabelled dataset을 사용한다는거정도일듯?

2.3 Self-Curation (selecting high-quality examples)

We start with a seed instruction model M_0 finetuned on (instruction, output) seed examples only. We then use M_0 to score each augmented example {(xˆ_i, y_i)} to derive a quality score a_i.
This is done using prompting, instructing the trained model to rate the quality of a candidate pair on a 5-point scale.

아래는 사용자의 지침과 후보 답변 내용입니다. 답변이 AI 어시스턴트가 사용자의 지침에 어떻게 응답해야 하는 좋은 예인지를 평가하십시오. 다음과 같은 5단계 평가 척도를 사용하여 점수를 할당하십시오.
1: 답변이 불완전하거나 모호하며 주제와 무관하거나 논란스러우며 사용자가 요구한 내용과 정확히 일치하지 않음을 의미합니다. 예를 들어 일부 내용이 누락되었거나 번호가 맨 처음부터 시작하지 않는 등이 있습니다. 개시 문장이 사용자의 질문을 반복하는 경우도 있습니다. 또는 다른 사람의 관점에서 개인적인 경험을 포함하거나 블로그 게시물에서 가져온 것처럼 보이거나 포럼에서 나온 것 같은 답변입니다. 또는 홍보 문구, 탐색 문구 또는 기타 관련 없는 정보를 포함할 수 있습니다.
2: 답변은 사용자의 대부분의 요청에 대응하지만 직접 사용자의 질문에 대한 정확한 해결책 대신 고수준 방법론만 제공합니다. 예를 들어 사용자의 질문에 정확한 해결책 대신 고수준 방법론만 제공하는 경우입니다.
3: 답변은 도움이 되지만 AI 어시스턴트가 작성한 것이 아닙니다. 사용자의 기본 요구 사항을 모두 다루며 완전하고 자체 포함되어 있지만 답변이 AI 어시스턴트의 관점이 아닌 다른 사람의 관점에서 작성되었습니다. 내용은 블로그 게시물, 웹 페이지 또는 웹 검색 결과에서 발췌한 것처럼 보입니다. 예를 들어 개인적인 경험이나 의견을 포함하거나 댓글 섹션을 언급하거나 소셜 미디어에서 공유하는 등입니다.
4: 답변은 AI 어시스턴트의 관점에서 작성되었으며 명확한 초점을 가지고 지시 사항에 대한 완전하고 명확하며 포괄적인 응답을 제공합니다. 누락 또는 관련 없는 정보 없이도 사용자의 질문 또는 지침에 대한 도움말을 제공하며, 잘 구성되어 있으며 자체 포함되어 있으며 도움말 톤으로 작성되었습니다. 약간의 개선 여지가 있을 수 있습니다. 더 간결하고 초점을 맞추는 것이 예시입니다.
5: 이것은 AI 어시스턴트로서 완벽한 답변입니다. 도움말 AI 어시스턴트에 명확한 초점이 맞추어져 있으며, 응답은 관련 없는 문장 없이 사용자의 질문 또는 지침에 대한 의도적으로 작성된 것처럼 보입니다. 답변은 높은 품질의 내용을 제공하며 해당 분야의 전문 지식을 나타내며 매우 잘 쓰여 있으며 논리적이며 따르기 쉽고 흥미롭고 통찰력이 있습니다.
먼저 사용한 평가 점수를 도출하는 데 사용한 간략한 이유를 제공한 다음 "점수: <점수>"를 마지막 줄에 작성하십시오. 
<생성된 지침> 
<출력>

We can then select a subset of the augmented examples with score a_i ≥ k to form a curated set A^(1)_k.
- 5점 척도 스케일링으로 점수를 측정하는데, threshold k를 넘는 서브셋을 cureated set A^(1)_k로 저장
- k는 1,2,3,4,5로 구성되어있음

Iterative self-curation

We further propose an iterative training method to produce higher quality predictions. On iteration t we use the curated augmentation data A^(t−1)_k from the previous iteration, along with the seed data as training data to finetune an improved model M_t
- t-1 번째 버전의 데이터로 t번째 튜닝모델 M_t 생성함
- 이 모델로 다시 데이터셋의 quality를 rescoring해서 A^(t)_k 를 생성함
When combining both seed data and augmented data for finetuning, we use tagging to distinguish these two data sources
Specifically, we append an additional sentence to examples (called “system prompt"). We use S_a := “Answer in the style of an AI Assistant." for seed data, and S_w := “Answer with knowledge from web search." for augmented data.
- 시드데이터와 augmented 데이터는 시스템 프롬프트같은걸로 구분함

Experiments

3.1 Experimental Setup

Seed data

use 3200 examples from the Open Assistant dataset(tr: 84.4k rows) [Köpf et al., 2023] as human-annotated seed data to train our models
Each example is an (instruction, output) pair {(xi, yi)}, chosen from the first turn of the conversation tree.
only sample English language responses that are high quality, based on their human annotated rank (rank 0)
- 샘플중에서 제일 좋은것만 썼다
Base model & finetuning
- use the pretrained LLaMA model [Touvron et al., 2023] with 7B, 33B and 65B parameters as the base models for finetuning
- we only optimize the loss on the output tokens, not the input tokens, thus deviating from the standard language modeling loss.
- use the same hyperparameters as existing supervised finetuning (SFT) methods (LIMA)[Zhou et al., 2023, Touvron et al., 2023] for most models: learning rate 1e − 5 which linearly decays to 9e − 6 at the end of training, weight decay 0.1, batch size 32 (examples) and dropout 0.1
- For finetuning with less than 3000 examples we use batch size 8 (more details in Table 18). We refer to our trained Llama-based instruction backtranslation model as Humpback1. For generation, we use nucleus sampling Holtzman et al. [2019] with temperature T = 0.7, p = 0.9.
  - sft랑 finetuning이랑 무슨 차이지..? self-augmentation쪽인가? Iterative self-curation쪽인가?
Unlabelled data
- use the English portion of the Clueweb corpus as the source of unlabelled data [Overwijk et al., 2022]. Among those, we sampled 502k segments
Baselines
- text-davinci-003
- LIMA
  - 1000 manually selected instruction examples from a mixture of community question & answering (e.g. StackOverflow, WikiHow, etc.) and human expert-written instruction and responses.
- Guanaco
  - LLaMA models finetuned with 9000 examples from the OpenAssistant dataset. The difference from the 3200 seed examples used in this paper is that Guanaco includes (instruction, output) pairs from all turns while we only used the first-turn of the conversations.
Evaluation
- evaluate on test prompts from several sources: Vicuna [Chiang et al., 2023] (80 prompts), Self-instruct [Zhang and Yang, 2023] (252 prompts), Open Assistant [Köpf et al., 2023] (188 prompts), Koala [Geng et al., 2023] (156 prompts), HH_RLHF [Bai et al., 2022a] (129 prompts), LIMA [Zhou et al., 2023] (300 prompts), crowdsourced from authors (64 prompts)
- In total there are 1130 unique prompts, providing a good coverage on a variety of task categories, e.g. writing, coding, mathematical reasoning, information seeking, advice, roleplay, safety, etc.
- We sample 250 prompts from them excluding those in the AlpacaEval test set as a dev set and another 250 prompts to perform generation quality evaluation. We ran both automatic evaluation using AlpacaEval [Li et al., 2023], which computes the win rate against baseline models based on GPT-4 judgements, as well as human preference evaluation.
  - GPT-4를 사용하는 AlpacaEval로 평가

3.2 Seed and Augmentation Data Statistics

Data statistics

- We can see that augmented data tends to have longer outputs compared to the seed data, and self-curated higher quality training data (A(2) and A(2)) has both shorter instructions and outputs among all augmented data, closer to the length of the original seed instruction data. - 개선할수록 Augmented data가 seed와 비슷한 길이를 갖는다? - **Generated Instructions**

- Furthermore, the augmented data increases the task diversity especially in the long tail.

3.3 Scaling Analysis

Data quality vs. data quantity
- compared finetuning on augmented data of different quality. Specifically, we compared finetuning on augmented data without quality-based selection (w/o curation), self-selected data in A(2) (score ≥ 4) and A(2) (score ≥ 4.5) categories

Data scaling efficiency
- compare the performance of various instruction-following models as we alter the amount of instruction following finetune data they use. We measure the win rate of each model against text-davinci-003 when finetuning 7B LLaMa with the given finetune dataset.
  - 데이터 양에 따른 성능 비교
- an estimate of this efficiency using the data scaling coefficient α, which is calculated by fitting empirical data with w = α log N + C, where w is the win rate measuring generation quality of the model finetuned on N examples.
  - backtranslation 방법론을 다른 데이터셋과 비교

Jointly scaling of data and model
- 모델 스케일이랑 데이터스케일 같이본거고 4마개부터는 약간 saturated되더라
- 데이터는 A_5 대상임

3.4 Generation Quality

AlpacaEval
- use the automatic evaluation (using GPT-4) from AlpacaEval to evaluate generation quality on 805 prompts from the Alpaca Leaderboard. AlpacaEval compares the pairwise win rate against the reference model text-davinci-003
  - Non-distilled: LLaMa models trained without relying on any external model (e.g. ChatGPT, GPT-4, etc.) for any form of supervision.
  - Distilled: models trained with a more powerful external model in the loop, e.g. using data distilled from an external model.
  - Proprietary: models trained with proprietary data and techniques.

Human Evaluation
- comparing our method to a given baseline model, and ask the human evaluator to choose from three options:
  - 1. output from the first model is significantly better than the second model;
  - 1. output from the second model is significantly better than the first model;
  - 1. there is no significant difference between the two outputs.
- We randomize the order the models are presented in to avoid position bias.

3.5 NLP Benchmarks

3.6 Ablations

3.6.1 Data selection quality

self-curation performance is improved in the second iteration (using M1 vs. M0) in terms of selecting high quality data (Precision/Recall). Further, this also corresponds to better instruction following when finetuning on the selected data, as shown by the Win Rate. A key observation is that although the intermediate models do not have very high precision, training on the selected data still improves instruction following. This helps explain the effectiveness of our method.
- iteration 돌면 좋아지긴하더라

3.6.2 Joint training

curation 필요하고, 안하면 시드만 못하고, 같이 학습하면 더 좋아지더라

System prompts

We found adding system prompts to distinguish augmented data from seed data is helpful
Interestingly, using a combined system prompt {Sa, Sw} at inference time, which concatenates the one for the seed data with the one for augmented data, is better than either no system prompt or using the seed data prompt, despite that the concatenation was not seen during training.
- 학습때 붙여서 넣은적없지만 붙여서 넣어서 인퍼런스했더니 더 좋아졌다라하는데 이거 prefix tuning처럼 된듯..

3.7 Further Analysis

시드개선하면 좋아진다

Limitations

Bias나 이런건 web corpus가 소스다보니 생길수있다
safety도 비슷, compared responses using different system prompts and found that using the seed data’s system prompt Sa tends to yield safer responses

Conclusion

proposed a scalable approach to finetune large language models to follow instructions. Our method leverages large amounts of unlabeled data by developing an iterative self-training algorithm that we dub instruction backtranslation
On the Alpaca leaderboard, our finetuned models outperform all other non-distilled instruction-following models, while using fewer human annotated examples
Future work should scale this method further by considering larger unlabeled corpora, which our analysis suggests should yield further gains

The text was updated successfully, but these errors were encountered:

eagle705 self-assigned this Aug 22, 2023

eagle705 changed the title ~~Self-Alignment with Instruction Backtranslation~~ (Humpback) Self-Alignment with Instruction Backtranslation Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Humpback) Self-Alignment with Instruction Backtranslation #31

(Humpback) Self-Alignment with Instruction Backtranslation #31

eagle705 commented Aug 22, 2023 •

edited

Loading

(Humpback) Self-Alignment with Instruction Backtranslation #31

(Humpback) Self-Alignment with Instruction Backtranslation #31

Comments

eagle705 commented Aug 22, 2023 • edited Loading

생각

Author

Abstract

Introduction

Method

2.1 Initialization

2.2 Self-Augmentation (generating instructions)

2.3 Self-Curation (selecting high-quality examples)

Iterative self-curation

Experiments

3.1 Experimental Setup

3.2 Seed and Augmentation Data Statistics

3.3 Scaling Analysis

3.4 Generation Quality

3.5 NLP Benchmarks

3.6 Ablations

3.6.1 Data selection quality

3.6.2 Joint training

System prompts

3.7 Further Analysis

Limitations

Conclusion

eagle705 commented Aug 22, 2023 •

edited

Loading