Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make data processing optional in run_training() #220

Merged
merged 3 commits into from
Oct 7, 2024

Conversation

MichaelClifford
Copy link
Contributor

This PR makes running data_process.main() optional in run_training(). This change is needed since it is not always desirable to process the data inside of the training function. Particularly in distributed training cases where its beneficial to processes the data once prior to training and then distribute the processed data along with the training function to each node.

The changes here have been made so that data processing inside of run_training() is still the default behavior.

I've updated the README.md to reflect how to run data processing independent of run_training().

@mergify mergify bot added documentation Improvements or additions to documentation ci-failure labels Sep 23, 2024
@mergify mergify bot added ci-failure and removed ci-failure labels Sep 23, 2024
Copy link
Contributor

@JamesKunstle JamesKunstle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good addition, we can probably merge over the pylint stuff and fix those things in another PR since they're for code you didn't touch.

@mergify mergify bot added the one-approval label Sep 24, 2024
@MichaelClifford
Copy link
Contributor Author

Thanks for the review @JamesKunstle!

Would it be helpful to open another PR to update the pylint config and make it a little less strict? Looks like it mainly comes down to this value being too low.

max-args=5

@Maxusmusti
Copy link
Contributor

Maxusmusti commented Sep 26, 2024

Thanks for adding this @MichaelClifford! I would just rebase this on main (might fix some linting issues for you for free), and then for the rest you can check via python3.11 -m tox p -e lint,mypy,ruff (but will likely be fine after rebase). Otherwise looks good!

Copy link
Contributor

mergify bot commented Sep 26, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @MichaelClifford please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@MichaelClifford
Copy link
Contributor Author

Thanks for the review @Maxusmusti and @JamesKunstle ! Sorry for the delay on the rebase. Should be good now, and looks like all the tests are passing :)

Copy link
Contributor

@Maxusmusti Maxusmusti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelClifford Looks good! Just a couple of quick comments, but nothing blocking really

README.md Outdated

```

If the machine's above have shared storage, users can preprocess the training dataset a single time so that it can then distributed to each machine with the following update:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

machine's -> machines
then distributed -> then be distributed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 😄 done.

src/instructlab/training/data_process.py Show resolved Hide resolved
@Maxusmusti
Copy link
Contributor

@MichaelClifford Awesome, thanks! Oleg is going to give this a spin, and then we'll get this merged today ✅

README.md Show resolved Hide resolved
@@ -28,9 +28,13 @@


# defer import of main_ds
def run_training(torch_args: TorchrunArgs, train_args: TrainingArgs) -> None:
def run_training(
torch_args: TorchrunArgs, train_args: TrainingArgs, process_data: bool = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the process_data arg out to live under train_args, unless there's a compelling reason to not do this. Keeping this function simple allows other consuming libraries to have a straightforward interface into our main training loop.

Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested locally and it seems to work. Please address the comment about process_data as an arg - we should keep this either as something that's a part of train_args or we need to make a really good case for why it shouldn't be. Presently we are trying to make our interface be very simple to make it easy for the CLI & other tools to consume.

MichaelClifford and others added 2 commits October 5, 2024 14:44
Co-authored-by: Michael Clifford <[email protected]>
Co-authored-by: Shreyanand <[email protected]>
Signed-off-by: Michael Clifford <[email protected]>
Signed-off-by: Michael Clifford <[email protected]>
Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mergify mergify bot merged commit 99e833a into instructlab:main Oct 7, 2024
14 checks passed
@mergify mergify bot removed the one-approval label Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants