-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better default values for StreamingDataset args #479
Conversation
of this size, and samples within each block are shuffled. Defaults to ``1 << 18``. | ||
shuffle_block_size (int, optional): Unit of shuffle. A canonical node's samples are split | ||
into blocks of this size, and samples within each block are shuffled. If ``None``, its | ||
value is calculated as ``max(4_000_000 // num_canonical_nodes), 1 << 18)``. Defaults to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any specific reason for 4_000_000
? Just wondering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was based on the shuffle quality experiments, where a shuffle strength of 4_000_000
(i.e., each batch is drawn from 4_000_000 samples) gave good shuffle quality without making the number of downloads required too high.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about max(1 << 18, 1 << 22 // NCN)
lol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works too, it's essentially the same right? Gonna keep as-is for now
…into better_defaults merging main.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @knighton Can you take a look at it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but note lint/test failure
Description of changes:
Adding better defaults to
StreamingDataset
. This will help towards making sure StreamingDataset just works for most cases out of the box, giving good quality shuffles without degrading throughput and sacrificing resumption. This depends on therelaxed
partitioning algo being merged as well (#476).Testing:
Multi-stream deterministic resumption tests successfully ran -- see below:
Summary of Changes:
predownload
defaults to 8x device batch sizepartition_algo
defaults torelaxed
num_canonical_nodes
defaults to the 64 * number of physical nodes of the run only when usingpy1s
orpy2s
shuffling. In all other cases, it defaults to the number of physical nodes of the run.relaxed
partitioning takes care of resumption cases.shuffle_algo
defaults topy1e
. This better balances downloads and gives good shuffle quality.shuffle_block_size
defaults to 4 million /num_canonical_nodes
or 262144 == 1<<18, whichever one is greater. This value was selected based on the results of shuffle quality experiments shown below, where a shuffle strength of 4 million samples (the brown dot) gave comparable train loss stdev and batch composition stdev to stronger shuffles but with less downloads:Defaults will be propagated and changed as appropriate in diffusion and llm-foundry repos in conjunction with this PR and (#476)
Issue #, if available:
Merge Checklist:
Put an
x
without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
pre-commit
on my change. (check out thepre-commit
section of prerequisites)