Make data processing optional in run_training() #220

MichaelClifford · 2024-09-23T20:16:02Z

This PR makes running data_process.main() optional in run_training(). This change is needed since it is not always desirable to process the data inside of the training function. Particularly in distributed training cases where its beneficial to processes the data once prior to training and then distribute the processed data along with the training function to each node.

The changes here have been made so that data processing inside of run_training() is still the default behavior.

I've updated the README.md to reflect how to run data processing independent of run_training().

JamesKunstle

good addition, we can probably merge over the pylint stuff and fix those things in another PR since they're for code you didn't touch.

MichaelClifford · 2024-09-25T14:03:48Z

Thanks for the review @JamesKunstle!

Would it be helpful to open another PR to update the pylint config and make it a little less strict? Looks like it mainly comes down to this value being too low.

training/.pylintrc

Line 283 in b37c8ce

max-args=5

Maxusmusti · 2024-09-26T15:14:20Z

Thanks for adding this @MichaelClifford! I would just rebase this on main (might fix some linting issues for you for free), and then for the rest you can check via python3.11 -m tox p -e lint,mypy,ruff (but will likely be fine after rebase). Otherwise looks good!

mergify · 2024-09-26T18:07:17Z

This pull request has merge conflicts that must be resolved before it can be
merged. @MichaelClifford please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

MichaelClifford · 2024-10-01T20:48:51Z

Thanks for the review @Maxusmusti and @JamesKunstle ! Sorry for the delay on the rebase. Should be good now, and looks like all the tests are passing :)

Maxusmusti

@MichaelClifford Looks good! Just a couple of quick comments, but nothing blocking really

Maxusmusti · 2024-10-02T14:50:30Z

README.md

+
+```
+
+If the machine's above have shared storage, users can preprocess the training dataset a single time so that it can then distributed to each machine with the following update:


machine's -> machines
then distributed -> then be distributed

Thanks 😄 done.

src/instructlab/training/data_process.py

Maxusmusti · 2024-10-03T19:17:17Z

@MichaelClifford Awesome, thanks! Oleg is going to give this a spin, and then we'll get this merged today ✅

README.md

src/instructlab/training/main_ds.py

RobotSail · 2024-10-03T19:51:19Z

src/instructlab/training/__init__.py

@@ -28,9 +28,13 @@


 # defer import of main_ds
-def run_training(torch_args: TorchrunArgs, train_args: TrainingArgs) -> None:
+def run_training(
+    torch_args: TorchrunArgs, train_args: TrainingArgs, process_data: bool = True


I would move the process_data arg out to live under train_args, unless there's a compelling reason to not do this. Keeping this function simple allows other consuming libraries to have a straightforward interface into our main training loop.

RobotSail

I tested locally and it seems to work. Please address the comment about process_data as an arg - we should keep this either as something that's a part of train_args or we need to make a really good case for why it shouldn't be. Presently we are trying to make our interface be very simple to make it easy for the CLI & other tools to consume.

Co-authored-by: Michael Clifford <[email protected]> Co-authored-by: Shreyanand <[email protected]> Signed-off-by: Michael Clifford <[email protected]>

Signed-off-by: Michael Clifford <[email protected]>

RobotSail

LGTM!

mergify bot added documentation Improvements or additions to documentation ci-failure labels Sep 23, 2024

MichaelClifford force-pushed the data_processing branch from 9c1e00c to 9cd2ea6 Compare September 23, 2024 21:33

mergify bot added ci-failure and removed ci-failure labels Sep 23, 2024

JamesKunstle approved these changes Sep 24, 2024

View reviewed changes

mergify bot added the one-approval label Sep 24, 2024

mergify bot added the needs-rebase label Sep 26, 2024

MichaelClifford force-pushed the data_processing branch from 9cd2ea6 to 736f8cf Compare October 1, 2024 20:27

mergify bot removed the needs-rebase label Oct 1, 2024

MichaelClifford force-pushed the data_processing branch from 736f8cf to d6cb293 Compare October 1, 2024 20:46

mergify bot removed the ci-failure label Oct 1, 2024

JamesKunstle approved these changes Oct 1, 2024

View reviewed changes

Maxusmusti reviewed Oct 2, 2024

View reviewed changes

MichaelClifford force-pushed the data_processing branch from d6cb293 to d4cca46 Compare October 2, 2024 20:05

RobotSail reviewed Oct 3, 2024

View reviewed changes

README.md Show resolved Hide resolved

RobotSail reviewed Oct 3, 2024

View reviewed changes

src/instructlab/training/main_ds.py Outdated Show resolved Hide resolved

RobotSail reviewed Oct 3, 2024

View reviewed changes

src/instructlab/training/main_ds.py Show resolved Hide resolved

RobotSail reviewed Oct 3, 2024

View reviewed changes

MichaelClifford and others added 2 commits October 5, 2024 14:44

make data processing optional in run_training.

8bd4698

Co-authored-by: Michael Clifford <[email protected]> Co-authored-by: Shreyanand <[email protected]> Signed-off-by: Michael Clifford <[email protected]>

pre-commit formatting

aefde0e

Signed-off-by: Michael Clifford <[email protected]>

MichaelClifford force-pushed the data_processing branch from d4cca46 to 9a7986a Compare October 5, 2024 18:56

mergify bot added the ci-failure label Oct 5, 2024

move process_data arg into TrainingArgs

f97dca3

Signed-off-by: Michael Clifford <[email protected]>

MichaelClifford force-pushed the data_processing branch from 9a7986a to f97dca3 Compare October 5, 2024 19:00

mergify bot removed the ci-failure label Oct 5, 2024

RobotSail approved these changes Oct 7, 2024

View reviewed changes

mergify bot merged commit 99e833a into instructlab:main Oct 7, 2024
14 checks passed

mergify bot removed the one-approval label Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make data processing optional in run_training() #220

Make data processing optional in run_training() #220

MichaelClifford commented Sep 23, 2024

JamesKunstle left a comment

MichaelClifford commented Sep 25, 2024

Maxusmusti commented Sep 26, 2024 •

edited

Loading

mergify bot commented Sep 26, 2024

MichaelClifford commented Oct 1, 2024

Maxusmusti left a comment

Maxusmusti Oct 2, 2024

MichaelClifford Oct 2, 2024

Maxusmusti commented Oct 3, 2024

RobotSail Oct 3, 2024

RobotSail left a comment

RobotSail left a comment


		```

		If the machine's above have shared storage, users can preprocess the training dataset a single time so that it can then distributed to each machine with the following update:

Make data processing optional in run_training() #220

Make data processing optional in run_training() #220

Conversation

MichaelClifford commented Sep 23, 2024

JamesKunstle left a comment

Choose a reason for hiding this comment

MichaelClifford commented Sep 25, 2024

Maxusmusti commented Sep 26, 2024 • edited Loading

mergify bot commented Sep 26, 2024

MichaelClifford commented Oct 1, 2024

Maxusmusti left a comment

Choose a reason for hiding this comment

Maxusmusti Oct 2, 2024

Choose a reason for hiding this comment

MichaelClifford Oct 2, 2024

Choose a reason for hiding this comment

Maxusmusti commented Oct 3, 2024

RobotSail Oct 3, 2024

Choose a reason for hiding this comment

RobotSail left a comment

Choose a reason for hiding this comment

RobotSail left a comment

Choose a reason for hiding this comment

Maxusmusti commented Sep 26, 2024 •

edited

Loading