New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

想請教關於Fine tuning時的資料集要求 #57

Open

davidho27941 opened this issue Mar 26, 2024 · 1 comment

davidho27941 commented Mar 26, 2024

Hi

想請教一下在進行微調時是否有需要對資料的格式進行處理，在網路上有看到不同作法，例如：

以alpaca格式，儲存為jsonl形式後，直接作為資料集提供給SFTTrainer進行微調。
將對話以<s>[INST] {instruction} [/INST] {response} </s>的形式紀錄，並直接提供給SFTTrainer進行微調。
將(1)(2)的資料集以tokenizer進行處理後，取得attention_mask以及input_ids後才提供給SFTTrainer進行微調。

想詢問哪一種方式會是比較好的，同時也好奇attention_mask在微調過程中的必要性，以目前Hugging Face的SFTTrainer而言，並未有一個參數能指定這個mask的名稱，實在不確定提供了之後是否會被使用，以及這向資訊是否為必要的。

感謝撥冗閱讀，還請不吝賜教。

The text was updated successfully, but these errors were encountered:

Collaborator

adamlin120 commented May 16, 2024

如果你自己寫腳本訓練，我建議用 1 就好，簡單有效。

這問題可以回答有深有淺，會關乎你要不要 1. 訓練在 user input / 2. use flash attention? / 3. packing? 等等等，所以我建議你直接熟悉 axolotl 哈哈哈哈他會幫你準備這些 model input。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment