[DO NOT LAND] compile more modules #1938

felipemello1 · 2024-11-01T03:38:59Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

We compile only transformer layers. However, we could compile embedding, norm and the output layer.

if hasattr(model, "norm"):
    model.norm.compile(backend=backend)

if hasattr(model, "chunked_output"):
    model.chunked_output = torch.compile(model.chunked_output, backend=backend)

if hasattr(model, "token_embeddings"):
    model.token_embeddings.compile(backend=backend)

Test plan

3b with packing

tune run full_finetune_single_device --config llama3_2/3B_full_single_device optimizer_in_bwd=True enable_activation_checkpointing=True enable_activation_offloading=True optimizer._component_=torch.optim.AdamW optimizer.fused=True compile=True dataset.packed=True dataset.split=train[:5%] tokenizer.max_seq_len=2048 metric_logger=torchtune.training.metric_logging.WandBLogger metric_logger.project=profiling log_every_n_steps=1 log_peak_memory_stats=True gradient_accumulation_steps=1 max_steps_per_epoch=15 epochs=1 batch_size=5 metric_logger.name=baseline loss=torchtune.modules.loss.CEWithChunkedOutputLoss

8b with packing

11b NO packing, NO act offloading

conclusion:

compiling the extra modules seems to help when there is tied embedding. However, if there is not packing, then there are more graph breaks, slowing down early training. We should fix graphs breaks and then potentially land this PR. Optionally, we can compile the extra layers only if we hav tied embeddings.

pytorch-bot · 2024-11-01T03:39:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1938

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 4 Cancelled Jobs

As of commit edfa188 with merge base eab21f0 ():

NEW FAILURES - The following jobs have failed:

GPU tests / gpu_test (3.11, stable) (gh)
tests/recipes/test_full_finetune_single_device.py::TestFullFinetuneSingleDeviceRecipe::test_training_state_on_resume
Recipe Tests / recipe_test (3.11) (gh)
tests/recipes/test_full_finetune_single_device.py::TestFullFinetuneSingleDeviceRecipe::test_training_state_on_resume

CANCELLED JOBS - The following jobs were cancelled. Please retry:

GPU tests / gpu_test (3.10, stable) (gh)
GPU tests / gpu_test (3.9, stable) (gh)
##[error]The operation was canceled.
Recipe Tests / recipe_test (3.10) (gh)
##[error]The operation was canceled.
Recipe Tests / recipe_test (3.9) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

felipemello1 · 2024-11-01T19:53:29Z

compiling the chunked_output will break for tied embeddings + fsdp

Felipe Mello added 2 commits October 31, 2024 20:21

add more modules to compile

c0381b9

add comment

edfa188

felipemello1 requested a review from ebsmothers November 1, 2024 03:38

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 1, 2024

felipemello1 marked this pull request as draft November 1, 2024 03:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT LAND] compile more modules #1938

[DO NOT LAND] compile more modules #1938

felipemello1 commented Nov 1, 2024

pytorch-bot bot commented Nov 1, 2024 •

edited

Loading

felipemello1 commented Nov 1, 2024

[DO NOT LAND] compile more modules #1938

Are you sure you want to change the base?

[DO NOT LAND] compile more modules #1938

Conversation

felipemello1 commented Nov 1, 2024

Context

Test plan

conclusion:

pytorch-bot bot commented Nov 1, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1938

❌ 2 New Failures, 4 Cancelled Jobs

felipemello1 commented Nov 1, 2024

pytorch-bot bot commented Nov 1, 2024 •

edited

Loading