Not able to understand how my data should be #1949
-
I am fine tuning Llama-3-70B model and I am using a dataset which is in the alpaca format: https://huggingface.co/datasets/KamalConvai/nsfw_0 When i am preprocessing the dataset using the command that axolotl has provided its showing:
you can see the line: Saving the dataset (1/1 shards): : 0 examples [00:00, ? examples/s] the examples are not being read. You can check my dataset in the huggingface repo its exactly like the alpaca format dataset, i even checked running the same code with https://huggingface.co/datasets/tatsu-lab/alpaca and its working fine. What am I doing wrong? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
Hello @KamalUtla , could you provide more info like how your axolotl config looks like? Could you perhaps have used a low seq length causing it to be dropped? |
Beta Was this translation helpful? Give feedback.
Thanks for the config @KamalUtla . I took a quick run with it and found that since your sequences were longer than
1024
(your seq length), the sequences got dropped atDropping Long Sequences
stage. My recommendation would be to increasesequence_len
to a higher value.