Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Isn't the memory consumption should be dropped when using fp8? #1261

Open
JayC1208 opened this issue Oct 16, 2024 · 1 comment
Open

Isn't the memory consumption should be dropped when using fp8? #1261

JayC1208 opened this issue Oct 16, 2024 · 1 comment
Labels
question Further information is requested

Comments

@JayC1208
Copy link

Hi, I am just trying the example provided (https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/te_llama/tutorial_accelerate_hf_llama_with_te.html), with llama 2 model.

As it is 7B model, I assume the GPU memory usage for model should be around 14GB when using fp16 (which is default), and around 7B for fp8.
However, it still shows memory usage of 14B (I used model.get_memory_footprint() and nvidia-smi to check allocated memory).
Also, when I print out dtype of hidden states of the layers, it shows bfloat16.

Is it normal or is not working well on my side?
Please correct me if I understand sth wrong.

Thanks.

@timmoon10
Copy link
Collaborator

timmoon10 commented Oct 17, 2024

Your memory usage is expected, but it also depends on your workflow. First let's talk about FP8 weights. One limitation of FP8 support in Hopper and Lovelace is that the Tensor Cores only support TN GEMMs for FP8 (see cuBLAS docs). This is fine for the forward pass since that's the native format for torch.nn.Linear, but it means that we require transposes for the backward pass. Our default behavior when casting to FP8 is to use a fused cast-transpose kernel, which has better performance at the expense of having the same memory footprint as FP16/BF16. If you are performing inference and you know you will not require any backpropagation, then you can initialize the model within a torch.no_grad context and TE will not allocate memory for the transposes. You can also do this if you are willing to accept the performance penalty from unfused transpose kernels. We are aware this can be unintuitive, so we're working on nicer ways to specify this within the FP8 recipe. Note that these considerations may change with upcoming hardware and low-precision formats.

Next, activation tensors are usually not returned in FP8. We mostly use FP8 as GEMM inputs to take advantage of Tensor Cores, and we're generally cautious about using it for more numerically sensitive operations (e.g. activation functions, LayerNorm, GEMM outputs). Also, the internals of TE modules are somewhat messy, especially how they handle the per-tensor scaling factors needed for FP8 training. If you would like to manually access FP8 values, it might be worth looking into the experimental operation-based API (see the Quantize op from #1033), which makes use of a Float8Tensor class that mimics a plain PyTorch tensor.

@timmoon10 timmoon10 added the question Further information is requested label Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants