-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No Significant Improvement Observed in Model Training Speed #409
Comments
It is likely because the model is too small, it is not fully utilizing the GPU, which causes the effect of liger kernel to be non significant.
|
Thanks @ByronHsu for your response. Regarding the first point, I have configured the dataset in the YAML file to fully utilize the memory of the H100 GPU. However, I observed no significant difference in memory usage with or without Liger; the memory consumption remained unchanged. I will experiment with the second point as you suggested. Could you please explain why disabling gradient checkpointing is recommended? If it's related to Liger's underlying mechanism, I will review the relevant literature for a deeper understanding. I appreciate your third suggestion and will implement it in future trials. Thank you. |
Gradient checkpointing isn't related to Liger Kernel per-se, but it's a technique that trades off training speed for a reduction in memory consumption. If Liger is enabled (and reducing the memory consumption of your model's training step), then it could enable turning off grad checkpointing and thus speed up training. |
Another suggestion is to hook up a profiler to see what time is being spent on over the course of a few training steps. https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html#using-tracing-functionality |
🐛 Describe the bug
I am training the
meta-llama/Llama-3.2-1B
model using LLaMA-Factory with the following YAML configuration:However, I have noticed that enabling or disabling liger_kernel does not lead to any noticeable reduction in training time. The runtime metrics remain nearly identical in both cases. Are there specific parameter settings in my YAML configuration that might be preventing liger_kernel from functioning optimally? Thanks :(
Reproduce
Versions
Environment Report:
Operating System: Linux-5.15.0-107-generic-x86_64-with-glibc2.35
Python version: 3.12.7
PyTorch version: 2.5.1+cu124
CUDA version: 12.4
Triton version: 3.1.0
Transformers version: 4.46.1
The text was updated successfully, but these errors were encountered: