Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add smoketest for basic training runs #398

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

JamesKunstle
Copy link
Contributor

@JamesKunstle JamesKunstle commented Jan 17, 2025

This PR adds a singular smoketest that will run "training" on the sample data that we have in this repo. This iteration of the training loop has minimal features- no LoRA, no CPU offloading, no flash-attention. (i.e. minimal external dependencies).

It does require four GPUs, however, as a default. Currently it takes about six minutes to finish on a 4xL40s machine because the sample dataset has been truncated by only accepting samples that have a sequence length of <160 tok.

It executes:

  1. Preprocess sample data, store in cache.
  2. Load model, store in cache.
  3. Start 4x accelerator training, set up four cards.
  4. run training on 1 epoch (roughly 5 minutes to finish on tested hardware)
  5. save full state checkpoint to disk.
  6. exit loop, clean up cache and checkpoint.

It includes boilerplate for future feature-coverage tests that include stuff like LoRA or CPU offloading.

The intention is that this test (and following tests like it) should be run less frequently than traditional unit tests, linting, and static code analysis (hence the @pytest.mark.slow decorator) but should still give a clear indication, later in the PR review process, that our features will run to completion without obvious problems.

@mergify mergify bot added CI/CD Affects CI/CD configuration testing Relates to testing ci-failure dependencies Pull requests that update a dependency file labels Jan 17, 2025
@JamesKunstle JamesKunstle force-pushed the test-train-entrypoints branch from f95dd01 to bcff9ca Compare January 17, 2025 00:22
@mergify mergify bot added ci-failure and removed ci-failure labels Jan 17, 2025
@JamesKunstle JamesKunstle force-pushed the test-train-entrypoints branch from bcff9ca to 7834684 Compare January 24, 2025 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Affects CI/CD configuration ci-failure dependencies Pull requests that update a dependency file testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant