Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop dropping samples every batch #15

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Lewington-pitsos
Copy link
Contributor

now each tokenization process keeps track of leftover samples and adds them back in when they accumulate to be long enough.

@norabelrose
Copy link
Member

Sorry I'm sort of inclined not to merge this just for code simplicity reasons. Is there a reason you think it's really important not to drop samples ever?

@Lewington-pitsos
Copy link
Contributor Author

The upshots I suppose are greater reproducibility and squeezing as much juice out of our dataset as we can (we end up dropping tens of thousands of samples for a reasonably sized run currently). Also gets more important if the user is forced to use a smaller batch size for some reason.

To me that is a very good trade in exchange for 20-30 lines of middlingly complicated code which don't touch anything else.

@dribnet
Copy link

dribnet commented Aug 31, 2024

Could you instead train for more than one epoch with different tokenization parameters to catch most of the leftovers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants