add smoketest for basic training runs #398

JamesKunstle · 2025-01-17T00:18:47Z

This PR adds a singular smoketest that will run "training" on the sample data that we have in this repo. This iteration of the training loop has minimal features- no LoRA, no CPU offloading, no flash-attention. (i.e. minimal external dependencies).

It does require four GPUs, however, as a default. Currently it takes about six minutes to finish on a 4xL40s machine because the sample dataset has been truncated by only accepting samples that have a sequence length of <160 tok.

It executes:

Preprocess sample data, store in cache.
Load model, store in cache.
Start 4x accelerator training, set up four cards.
run training on 1 epoch (roughly 5 minutes to finish on tested hardware)
save full state checkpoint to disk.
exit loop, clean up cache and checkpoint.

It includes boilerplate for future feature-coverage tests that include stuff like LoRA or CPU offloading.

The intention is that this test (and following tests like it) should be run less frequently than traditional unit tests, linting, and static code analysis (hence the @pytest.mark.slow decorator) but should still give a clear indication, later in the PR review process, that our features will run to completion without obvious problems.

Signed-off-by: James Kunstle <[email protected]>

mergify bot added CI/CD Affects CI/CD configuration testing Relates to testing ci-failure dependencies Pull requests that update a dependency file labels Jan 17, 2025

JamesKunstle force-pushed the test-train-entrypoints branch from f95dd01 to bcff9ca Compare January 17, 2025 00:22

mergify bot added ci-failure and removed ci-failure labels Jan 17, 2025

JamesKunstle added 3 commits January 24, 2025 15:43

adds basic training smoketest.

54f7606

Signed-off-by: James Kunstle <[email protected]>

adds matrix testing run boilerplate; minor refactor

4c848b0

Signed-off-by: James Kunstle <[email protected]>

fix misspelled parametrize

7834684

Signed-off-by: James Kunstle <[email protected]>

JamesKunstle force-pushed the test-train-entrypoints branch from bcff9ca to 7834684 Compare January 24, 2025 23:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add smoketest for basic training runs #398

add smoketest for basic training runs #398

JamesKunstle commented Jan 17, 2025 •

edited

Loading

add smoketest for basic training runs #398

Are you sure you want to change the base?

add smoketest for basic training runs #398

Conversation

JamesKunstle commented Jan 17, 2025 • edited Loading

JamesKunstle commented Jan 17, 2025 •

edited

Loading