Any suggestion for Llama-3.1-70b(128k seq len) deploy mesh with torchtian? #678

medivh-xp · 2024-11-15T03:36:20Z

Under the 128k long sequence, the activation value memory increases significantly.
CP8 + TP8 seems necessary (they reduce the activation value memory almost linearly), but there is still as much as 50G of activation value memory.
Reccompute the activations of the MLP can reduce it by about 9G, while the recalculation of the ATTENTION layer or MLP up linear seems rather costly.I noticed that the article at https://arxiv.org/pdf/2410.06511 mentioned Full checkpoint was applied to address the activation memory issue，which seems to significantly increase the execution time of recomputation？
Does TorchTitan plan to offload the activation values and reload them during the backward calculation to reduce the activation value memory?

gnadathur · 2024-11-15T22:49:55Z

cc: @XilunWu

XilunWu · 2024-11-15T22:55:27Z

The PR #592 enables CP in torchtitan. You can change context_parallel_degree (for example 8 for Cp8) in the toml file. See detail in the PR description.

CP8 is enough for 128K on H100 and A100. If you still encounter OOM, you can change selective checkpoint to "full" to further reduce peak memory usage.

gnadathur · 2024-11-15T22:59:31Z

cc: @lessw2020

medivh-xp · 2024-11-18T02:53:35Z

The PR #592 enables CP in torchtitan. You can change context_parallel_degree (for example 8 for Cp8) in the toml file. See detail in the PR description.

CP8 is enough for 128K on H100 and A100. If you still encounter OOM, you can change selective checkpoint to "full" to further reduce peak memory usage.

@XilunWu Thank you for your reply! I noticed that in PR #467, the activation values are reduced through activations offload. If a balance can be struck among computation, memory, and H2D bandwidth, it seems that Full-AC might not be necessary (I'm not sure if my understanding is correct. Full-AC recomputation will significantly reduce the MFU). So how should I choose between full-AC and activations offload? It seems that activations offload could theoretically achieve a higher MFU？

tianyu-l · 2024-11-18T22:13:38Z

@awgu can you share a bit more on the status of the activation offloading PR? E.g. is it ready to be used, and its performance vs. using full AC on llama models.

awgu · 2024-11-18T22:25:58Z

The PR is meant as a way to add activation offloading to your model with intrusive changes. The main concern is that for current gen Nvidia GPUs, the offloading may contend with inter-node collectives for PCIe bandwidth.

If you apply full activation checkpointing to each transformer block and then further apply activation offloading to the transformer block input, then you can accumulate no extra GPU memory per transformer block, which can help unblock long-sequence use cases.

There probably needs to be some extra work on the PR for that though.

XilunWu · 2024-11-19T21:23:33Z

@medivh-xp I think the general logic is:

try larger context parallel degree possible to see if it unblocks your long sequence use case. 128k works fine with llama3-8B model on H100 with 8 GPUs (dp_shard_degree=2 and context_parallel_degree=4). I haven't tested on llama3-70B model but you can easily try it out by changing the context_parallel_degree to a larger number and see if it works or not within 20 steps.
if not, you can try activation checkpointing to see if this helps.

XilunWu · 2024-11-20T01:08:16Z

I just realize that we have a bug in torchtitan if you want to use CP without combining DP. The consequence would be high memory usage and maybe diverging loss.

#685 is the fix cc @fegin @tianyu-l

tianyu-l added question Further information is requested enhancement New feature or request labels Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any suggestion for Llama-3.1-70b(128k seq len) deploy mesh with torchtian? #678

Any suggestion for Llama-3.1-70b(128k seq len) deploy mesh with torchtian? #678

medivh-xp commented Nov 15, 2024 •

edited

Loading

gnadathur commented Nov 15, 2024

XilunWu commented Nov 15, 2024 •

edited

Loading

gnadathur commented Nov 15, 2024

medivh-xp commented Nov 18, 2024 •

edited

Loading

tianyu-l commented Nov 18, 2024

awgu commented Nov 18, 2024

XilunWu commented Nov 19, 2024

XilunWu commented Nov 20, 2024

Any suggestion for Llama-3.1-70b(128k seq len) deploy mesh with torchtian? #678

Any suggestion for Llama-3.1-70b(128k seq len) deploy mesh with torchtian? #678

Comments

medivh-xp commented Nov 15, 2024 • edited Loading

gnadathur commented Nov 15, 2024

XilunWu commented Nov 15, 2024 • edited Loading

gnadathur commented Nov 15, 2024

medivh-xp commented Nov 18, 2024 • edited Loading

tianyu-l commented Nov 18, 2024

awgu commented Nov 18, 2024

XilunWu commented Nov 19, 2024

XilunWu commented Nov 20, 2024

medivh-xp commented Nov 15, 2024 •

edited

Loading

XilunWu commented Nov 15, 2024 •

edited

Loading

medivh-xp commented Nov 18, 2024 •

edited

Loading