Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relaxing divisibility constraints on num_canonical_nodes and num_physical_nodes #476

Merged
merged 10 commits into from
Oct 26, 2023

Conversation

snarayan21
Copy link
Collaborator

@snarayan21 snarayan21 commented Oct 20, 2023

Description of changes:

Relaxes constraints on num_canonical_nodes being divisible by physical_nodes or vice versa, opening up many more numbers of nodes to deterministically train and resume on. This can be enabled by setting partition_algo='reserved' in StreamingDataset instantiation, or by setting partition_algo: reserved in yamls for each dataset. The only assumptions we make are that:

  • num_canonical_nodes plays nicely with physical_nodes only in the initial run.
  • global batch size stays constant during resumption.
  • global batch size is divisible by the total number of devices during resumption.
    This reserved partition algo also preserves earlier functionality with determinism and should replace the orig algorithm as the default, since orig functionality is a subset of this partitioning algo (in an upcoming PR).

Testing:

  • Added unit tests proving deterministic sample ordering with multiple numbers of nodes that were not possible with the orig partition algorithm
  • Conducted local CPU tests with saving and loading dataset state dict to ensure that values were being set and read in correctly
  • Conducted resumption tests on interactive instance with 4 GPUs to ensure that state dict was being set and loaded correctly
  • Conducted multi-stream elastic deterministic resumption tests going from 2 to 3 nodes with num_canonical_nodes set to 2 and 64. This would not have been possible earlier, and results are shown below. Some numerical instability on newer clusters is likely responsible for slight deviations.
  • Conducted tests going from 1->2 nodes, to make sure behavior was same as orig partitioning.
  • Conducted tests going from 2 -> 3 -> 4 nodes, to make sure the initial physical nodes were being persisted in the state and partitions were still deterministic across multiple resumptions.

How it works:

  • if partitioning for the first time, use the orig partition method. Save the initial number of physical nodes in the state.
  • if resuming, partition over the initial number of physical nodes and the initial device batch size, then repartition this across the new number of physical nodes and the new device batch size. orig partition is still used if NCN and PN play nicely to preserve all earlier functionality and have nicer downloading for nice numbers of PN.

In the worst case, a node during the resumed run will have to do twice as much shard downloads as a node from the initial run. This is because of the way a global batch has to be split across the new number of nodes. When combined with better shuffling algorithms and batching methods, this should be an acceptable tradeoff.

Test with num_canonical_nodes=2.

  • Light blue: full 200 step run on 2 nodes.
  • Brown: first 100 steps on 2 nodes.
  • Gray: last 100 steps on 3 nodes.
Screenshot 2023-10-20 at 2 18 47 AM

Test with num_canonical_nodes=64.

  • Blue: full 200 step run on 2 nodes.
  • Red: first 100 steps on 2 nodes.
  • Yellow: last 100 steps on 3 nodes.
Screenshot 2023-10-20 at 2 19 22 AM

Test from 1->2 nodes, preserving orig partitioning behavior:

  • Blue: full 200 step run on 1 node.
  • Red: first 100 steps on 2 nodes.
  • Green: last 100 steps on 2 nodes.
Screenshot 2023-10-20 at 12 11 59 PM

Test from 2->3->4 nodes, with NCN=2, to test multiple resumptions:

  • Pink: full 200 step run on 2 nodes.
  • Purple: first 100 steps on 2 nodes.
  • Orange: middle 100 steps on 3 nodes.
  • Light Blue: last 100 steps on 4 nodes.
Screenshot 2023-10-20 at 4 29 34 PM

Issue #, if available:

Merge Checklist:

Put an x without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • I have read the contributor guidelines
  • This is a documentation change or typo fix. If so, skip the rest of this checklist.
  • I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the MosaicML team.
  • I have updated any necessary documentation, including README and API docs (if appropriate).

Tests

  • I ran pre-commit on my change. (check out the pre-commit section of prerequisites)
  • I have added tests that prove my fix is effective or that my feature works (if appropriate).
  • I ran the tests locally to make sure it pass. (check out testing)
  • I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes.

streaming/base/dataset.py Outdated Show resolved Hide resolved
streaming/base/dataset.py Show resolved Hide resolved
streaming/base/partition/relaxed.py Show resolved Hide resolved
streaming/base/dataset.py Outdated Show resolved Hide resolved
streaming/base/dataset.py Outdated Show resolved Hide resolved
tests/test_partition.py Show resolved Hide resolved
@snarayan21 snarayan21 changed the title Ncn constraint relaxation Relaxing divisibility constraints on num_canonical_nodes and num_physical_nodes! Oct 26, 2023
@snarayan21 snarayan21 changed the title Relaxing divisibility constraints on num_canonical_nodes and num_physical_nodes! Relaxing divisibility constraints on num_canonical_nodes and num_physical_nodes Oct 26, 2023
Copy link
Collaborator

@karan6181 karan6181 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank You!

@snarayan21 snarayan21 merged commit 217e66e into mosaicml:main Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants