-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved training raw string chunking logic #3476
base: main
Are you sure you want to change the base?
Conversation
Can you give one example of chunks generated before and after this change for the same dataset? As a simple sanity check. |
Sure here is an example. An input file and then two train_dataset_sample outputs - one from the current main branch and the other with the PR applied. I've run these with the default settings, except for setting a 30 character small chunks filter (which I don't think did anything). The issues being fixed are most pronounced on poetry, but would be the same on prose. Looking at these again, it looks like the main branch version is also dropping the last line before each hard cut. train_dataset_sample_main_branch.txt |
Thanks, that was helpful. The changes overall seem to be in the right direction. One thing I have noticed is that by keeping the default settings and using a raw text file as input, the chunks will consistently start with |
I think a lot of that is because the source file has four /n at many of the hard splits, and the hard split is defined as three. It might make sense to trim these after the split command. |
This happened on a dataset of my own, which was a random arXiv paper that I copied and pasted. It doesn't have any \n\n\n\n. I noted that this behavior stopped happening after setting "Prefer Newline Cut Length" to 0 in this PR's branch, so I don't know if those newlines at the beginning of each chunk are intended or not when this parameter is set. |
…t-generation-webui into raw_string_processing # Conflicts: # modules/training.py
Yeah you are right, looks like it was struggle with the fact that encode('\n') actually returns three tokens. Have updated now to deal with the newline tokens being a list - and cuts are looking a lot cleaner on newlines. Also added logic to strip newlines off the start and end of a hard split, if they are present. |
Updated example training dataset |
It would great if someone can also check the sentence chunking from my PR in comparison: |
I have been trying to understand the chunking method better (I didn't write it, it was contributed in a PR), and I feel like the current training implementation is more complex than it should be. More importantly, it seems to use parameters that are not used anywhere else:
For raw text files, I would prefer to simply encode the entire file, making sure to add a BOS at the beginning and an EOS at the very end, and then do a simple loop over the encoded IDS getting |
There may be a place for simplified training, but there is definitely some need to be able to control how a raw text string is converted into blocks for training purposes. I claim no great expertise either, but I would think it risks confusing the model if it is trained on too many samples that consist of the end of one story and the start of another one. We definitely got noticably worse results training with datasets that contained a lot of padding tokens. I believe that overlap and hard cuts are necessary to produce sensibly structured blocks; you could argue about the other ones I guess, but they definitely allow tidier string sections to be created. Honestly there are more features that would be useful in the raw dataset creation - like the sentence based cutoffs that @FartyPants wrote, multi-file handling, sample shuffling and the ability to pack small sections together to avoid padding. Maybe do these basic tidy up changes now and a longer-term plan could be to make dataset preparation available as an extension point? |
I decided to move the whole enhanced training as it's own Training PRO extension |
Some small fixes to the processing of raw text inputs:
Checklist: