Replies: 1 comment 2 replies
-
Hey, chunking was recently added in the past few months for When you do a training with |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm a noob who recently upgraded from training Loras in oobabooga text-generation-webui. I have mostly migrated my pipeline to axolotl now and I'm getting trained models out, but they seem very off and don't really respond the way I expect based on my older models.
Now, in ooba, there's parameters you can set where it will take an arbitrarily long text file and auto generate a bunch of N-length chunks that overlap by M tokens. I don't see that anywhere in the axolotl config, so I'm wondering if that has to do with the issues I'm noticing.
I just can't really find any information whatsoever about how axolotl is chunking the data. Is it truncating it? Blindly splitting it by sequence length? Something else?
To be clear I'm using a local completion dataset in my yml, and the dataset consists of about 100 txt files that are 4,000 to 200,000 tokens in length. What is the correct way to handle this?
Beta Was this translation helpful? Give feedback.
All reactions