-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding replay into GPT-NeoX #1200
base: main
Are you sure you want to change the base?
Conversation
Please ignore the above commits. I accidentally pushed to upstream when modifying this branch in my fork. |
|
||
Default = 0.05 | ||
|
||
Fraction of a batch dedicated to doing replay. For example, 0.1 means that in a batch of 100, 19 samples will come from the replay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, 0.1 means that in a batch of 100, 19 samples will come from the replay buffer.
Is this a typo? Why wouldn't it be 10 samples?
|
||
- **replay_seed**: int | ||
|
||
Default = 1234 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems important that the replay seed isn't the same as the general data seed from your other comments. If that's correct, let's use a different default.
This PR aims to add replay to GPT-NeoX. I had implemented this for the paper Simple and Scalable Strategies to Continually Pre-train Large Language Models that shows simple ways to efficiently continue to pretrain by improving adaptation to new data while mitigating forgetting of previous data. Note that this PR can serve as a basis to add the ability to resume training from a certain index in a dataset, based on how I implemented this feature for replay datasets.
How to use
I tried to make the descriptions of the replay args informative enough to serve as documentation. An example of a config using replay is also provided in
tests/config/example_replay_config.yml
.Unsupported/untested features:
replay_label_data
arg that would specify the prefix to the idx and data path of replay label data, then generate the specific replay label data path from the prefix, and treat it in a similar way as the training data in the blockPending tests
Currently, the tests required are:
The tests can follow the procedure described in
tests/model/test_batch_replicability.py
. Tests 1 and 3 were passed with the Summit version of NeoX, but I'll need to run them again on the replay implementation based on the current main branch of NeoX. I'll probably need someone else to test that label data support (test 2) did not break as I'm unfamiliar with this feature of NeoX and am currently too busy to take that on.