You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please check that this issue hasn't been reported before.
I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
When using chat_template dataset type, given a dataset in sharegpt format, human turns should be masked (label -100) and gpt turns should NOT be masked when roles_to_train: ["gpt"] is set.
How it should be (turns broken out one per line for readability):
I have no possible solution, but I have some debugging tips.
If you modify file src/axolotl/prompt_strategies/chat_template.py and change LOG.setLevel(logging.INFO) to LOG.setLevel(logging.DEBUG), you can see some decisions that chat_template logic is doing.
Depending on which tokenizer is used, you'll get two very different outputs. Here's output from LlamaTokenizer:
Note how at turn 6 it has Turn indices: start=-1, end=-1. After that turn, every single operation is wrong.
Here is another log from when LlamaTokenizerFast/AutoTokenizer is used:
Debug log
[tokenize_prompt:276] Processing turn 0: role=human, content=MASKED, train_turn=None, train_detail=None
[tokenize_prompt:290] Should train: False
[tokenize_prompt:296] Turn indices: start=-1, end=-1
[tokenize_prompt:276] Processing turn 1: role=gpt, content=NOT MASKED, train_turn=None, train_detail=None
[tokenize_prompt:290] Should train: True
[tokenize_prompt:296] Turn indices: start=-1, end=-1
[tokenize_prompt:328] EOS token set for training at index -1
[tokenize_prompt:276] Processing turn 2: role=human, content=MASKED, train_turn=None, train_detail=None
[tokenize_prompt:290] Should train: False
[tokenize_prompt:296] Turn indices: start=-1, end=-1
[tokenize_prompt:276] Processing turn 3: role=gpt, content=NOT MASKED, train_turn=None, train_detail=None
[tokenize_prompt:290] Should train: True
[tokenize_prompt:296] Turn indices: start=-1, end=-1
[tokenize_prompt:328] EOS token set for training at index -1
[tokenize_prompt:276] Processing turn 4: role=human, content=MASKED, train_turn=None, train_detail=None
[tokenize_prompt:290] Should train: False
[tokenize_prompt:296] Turn indices: start=-1, end=-1
[tokenize_prompt:276] Processing turn 5: role=gpt, content=NOT MASKED, train_turn=None, train_detail=None
[tokenize_prompt:290] Should train: True
[tokenize_prompt:296] Turn indices: start=-1, end=-1
[tokenize_prompt:328] EOS token set for training at index -1
[tokenize_prompt:276] Processing turn 6: role=human, content=MASKED, train_turn=None, train_detail=None
[tokenize_prompt:290] Should train: False
[tokenize_prompt:296] Turn indices: start=-1, end=-1
[tokenize_prompt:276] Processing turn 7: role=gpt, content=NOT MASKED, train_turn=None, train_detail=None
[tokenize_prompt:290] Should train: True
[tokenize_prompt:296] Turn indices: start=-1, end=-1
[tokenize_prompt:328] EOS token set for training at index -1
[tokenize_prompt:276] Processing turn 8: role=human, content=MASKED, train_turn=None, train_detail=None
[tokenize_prompt:290] Should train: False
[tokenize_prompt:296] Turn indices: start=-1, end=-1
[tokenize_prompt:276] Processing turn 9: role=gpt, content=NOT MASKED, train_turn=None, train_detail=None
[tokenize_prompt:290] Should train: True
[tokenize_prompt:296] Turn indices: start=-1, end=-1
[tokenize_prompt:328] EOS token set for training at index -1
[tokenize_prompt:276] Processing turn 10: role=human, content=MASKED, train_turn=None, train_detail=None
[tokenize_prompt:290] Should train: False
[tokenize_prompt:296] Turn indices: start=-1, end=-1
[tokenize_prompt:276] Processing turn 11: role=gpt, content=NOT MASKED, train_turn=None, train_detail=None
[tokenize_prompt:290] Should train: True
[tokenize_prompt:296] Turn indices: start=-1, end=-1
[tokenize_prompt:328] EOS token set for training at index -1
[tokenize_prompt:339] Final labels: [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 2]
Here, it couldn't find a single turn, and masked everything out but the final token.
I did verify that it happens regardless of chat_template: mistral_v2v3 or chat_template: chatml is set. This also happens with tokenizer default template string.
Please check that this issue hasn't been reported before.
Expected Behavior
When using
chat_template
dataset type, given a dataset in sharegpt format, human turns should be masked (label -100) and gpt turns should NOT be masked whenroles_to_train: ["gpt"]
is set.How it should be (turns broken out one per line for readability):
Current behaviour
Turns are chaotically masked out mid-conversation, or the whole conversation is masked out - depending on the tokenizer type used.
How it is - LlamaTokenizer:
Also note the extra space tokens (-100, 29473)
How it is - LlamaTokenizerFast/AutoTokenizer:
the whole thing but the very last
</s>
is masked out!Steps to reproduce
Use the following dataset:
(it has TWO samples because load_datasets in cli is off by one and won't start with just one sample)
Run
python3 -m axolotl.cli.preprocess config.yaml --debug
Check out turn 6 where it should be not masked (it's gpt turn), yet it is. There are extra space tokens if using slow tokenizer as well.
Config yaml
Possible solution
I have no possible solution, but I have some debugging tips.
If you modify file
src/axolotl/prompt_strategies/chat_template.py
and changeLOG.setLevel(logging.INFO)
toLOG.setLevel(logging.DEBUG)
, you can see some decisions that chat_template logic is doing.Depending on which tokenizer is used, you'll get two very different outputs. Here's output from LlamaTokenizer:
Debug log
Note how at turn 6 it has
Turn indices: start=-1, end=-1
. After that turn, every single operation is wrong.Here is another log from when LlamaTokenizerFast/AutoTokenizer is used:
Debug log
Here, it couldn't find a single turn, and masked everything out but the final token.
I did verify that it happens regardless of
chat_template: mistral_v2v3
orchat_template: chatml
is set. This also happens with tokenizer default template string.Which Operating Systems are you using?
Python Version
Python 3.11.10
axolotl branch-commit
d356740
Acknowledgements
The text was updated successfully, but these errors were encountered: