Improved training raw string chunking logic #3476

dandm1 · 2023-08-06T11:33:47Z

Some small fixes to the processing of raw text inputs:

Switched the newline favouring code to work on tokens, rather than tokenising, encoding and re-tokenising.
Improved split-chunks logic to:
- prevent creating of a chunk that contains only overlap from the previous block,
- prevent dropping off the first characters after a hard cut if there is a newline soon after the cut (e.g. titles),
- if possible adjust the split location for the last chunk to avoid the need for padding.
- directly output dictionaries, rather than requiring a subsequent call to encode.
Also added a model check that outputs a warning when adding EOS tokens if the EOS and BOS tokens are the same in the model definition.

Checklist:

I have read the Contributing guidelines.

oobabooga · 2023-08-09T13:10:04Z

Can you give one example of chunks generated before and after this change for the same dataset? As a simple sanity check.

dandm1 · 2023-08-09T20:47:33Z

Sure here is an example. An input file and then two train_dataset_sample outputs - one from the current main branch and the other with the PR applied. I've run these with the default settings, except for setting a 30 character small chunks filter (which I don't think did anything).

The issues being fixed are most pronounced on poetry, but would be the same on prose. Looking at these again, it looks like the main branch version is also dropping the last line before each hard cut.

train_dataset_sample_main_branch.txt
train_dataset_sample_fixed.txt
input.txt

oobabooga · 2023-08-10T13:28:48Z

Thanks, that was helpful. The changes overall seem to be in the right direction. One thing I have noticed is that by keeping the default settings and using a raw text file as input, the chunks will consistently start with \n characters, while in the main branch this doesn't happen. Isn't that undesirable?

dandm1 · 2023-08-10T14:07:41Z

I think a lot of that is because the source file has four /n at many of the hard splits, and the hard split is defined as three. It might make sense to trim these after the split command.

oobabooga · 2023-08-10T14:14:41Z

This happened on a dataset of my own, which was a random arXiv paper that I copied and pasted. It doesn't have any \n\n\n\n.

I noted that this behavior stopped happening after setting "Prefer Newline Cut Length" to 0 in this PR's branch, so I don't know if those newlines at the beginning of each chunk are intended or not when this parameter is set.

…t-generation-webui into raw_string_processing # Conflicts: # modules/training.py

dandm1 · 2023-08-10T21:14:27Z

Yeah you are right, looks like it was struggle with the fact that encode('\n') actually returns three tokens. Have updated now to deal with the newline tokens being a list - and cuts are looking a lot cleaner on newlines. Also added logic to strip newlines off the start and end of a hard split, if they are present.

dandm1 · 2023-08-10T21:24:32Z

Updated example training dataset
train_dataset_sample.txt

FartyPants · 2023-08-12T23:19:03Z

It would great if someone can also check the sentence chunking from my PR in comparison:
#3382
I have currently access only to lenovo potato

oobabooga · 2023-08-27T16:37:48Z

I have been trying to understand the chunking method better (I didn't write it, it was contributed in a PR), and I feel like the current training implementation is more complex than it should be. More importantly, it seems to use parameters that are not used anywhere else:

Overlap Length
Prefer Newline Cut Length
Hard Cut String
Ignore small blocks

For raw text files, I would prefer to simply encode the entire file, making sure to add a BOS at the beginning and an EOS at the very end, and then do a simple loop over the encoded IDS getting Cutoff Length tokens at a time (is that a mainstream parameter?). The job of adding <s> and </s> mid-text is up to the user, and "Hard Cut String" should not exist. I am not confident to simply do that though because I have 0 experience with training and maybe there is something conceptually wrong about it.

dandm1 · 2023-08-27T22:29:01Z

There may be a place for simplified training, but there is definitely some need to be able to control how a raw text string is converted into blocks for training purposes. I claim no great expertise either, but I would think it risks confusing the model if it is trained on too many samples that consist of the end of one story and the start of another one. We definitely got noticably worse results training with datasets that contained a lot of padding tokens. I believe that overlap and hard cuts are necessary to produce sensibly structured blocks; you could argue about the other ones I guess, but they definitely allow tidier string sections to be created.

Honestly there are more features that would be useful in the raw dataset creation - like the sentence based cutoffs that @FartyPants wrote, multi-file handling, sample shuffling and the ability to pack small sections together to avoid padding. Maybe do these basic tidy up changes now and a longer-term plan could be to make dataset preparation available as an extension point?

FartyPants · 2023-09-18T02:53:21Z

I decided to move the whole enhanced training as it's own Training PRO extension

MB7979 added 2 commits August 6, 2023 11:57

Fixes to splitting of raw strings

334efe9

pycodestyle cleanup

08e6dfd

dandm1 changed the title ~~Improved raw string chunking logic~~ Improved training raw string chunking logic Aug 8, 2023

Minor changes

376fdb3

MB7979 added 2 commits August 10, 2023 22:01

Changes to handle multiple line break characters

94ea28c

Merge branch 'raw_string_processing' of https://github.com/dandm1/tex…

7bd293e

…t-generation-webui into raw_string_processing # Conflicts: # modules/training.py

oobabooga mentioned this pull request Aug 24, 2023

cut_chunk_for_newline can discard beginning text #3675

Closed

1 task

Merge branch 'main' into dandm1-raw_string_processing

4318c4c

oobabooga mentioned this pull request Feb 16, 2024

Mamba-Ssm - Loader for Mamba State Space models #5228

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved training raw string chunking logic #3476

Improved training raw string chunking logic #3476

dandm1 commented Aug 6, 2023

oobabooga commented Aug 9, 2023

dandm1 commented Aug 9, 2023

oobabooga commented Aug 10, 2023

dandm1 commented Aug 10, 2023

oobabooga commented Aug 10, 2023 •

edited

Loading

dandm1 commented Aug 10, 2023

dandm1 commented Aug 10, 2023

FartyPants commented Aug 12, 2023

oobabooga commented Aug 27, 2023 •

edited

Loading

dandm1 commented Aug 27, 2023

FartyPants commented Sep 18, 2023

Improved training raw string chunking logic #3476

Are you sure you want to change the base?

Improved training raw string chunking logic #3476

Conversation

dandm1 commented Aug 6, 2023

Checklist:

oobabooga commented Aug 9, 2023

dandm1 commented Aug 9, 2023

oobabooga commented Aug 10, 2023

dandm1 commented Aug 10, 2023

oobabooga commented Aug 10, 2023 • edited Loading

dandm1 commented Aug 10, 2023

dandm1 commented Aug 10, 2023

FartyPants commented Aug 12, 2023

oobabooga commented Aug 27, 2023 • edited Loading

dandm1 commented Aug 27, 2023

FartyPants commented Sep 18, 2023

oobabooga commented Aug 10, 2023 •

edited

Loading

oobabooga commented Aug 27, 2023 •

edited

Loading