-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does it support distillation for large models like Qwen2-72B and LLaMA 3.1-70B? #2117
Comments
hey @lingq1, it should work with any model. The issue is how much memory you have available. In the tests performed, smaller models were used. I personally haven't tried the 70B. Please change the config and give it a try. Let us know if you hit any blockers. |
ok, ths |
@lingq1 You'll probably need to use the distributed recipe for larger teacher models. I've tested the KD distributed recipe on Llama3.1 70B. |
Awesome, so how much VRAM was approximately consumed in total and how is the performance? |
I don't remember the exact numbers. For reference, I used 8 80G A100 GPUs and didn't have any issues with the default distributed recipe set up. Roughly used about 60% of memory according to a recorded experiment, so maybe around 50GB? |
ths, how effective is the final distilled model, and is there anything special to consider in the configuration of the 70B large model distillation recipe? |
@lindawangg Could you provide more details on the scenarios where you used the 70B large model as a teacher model for distillation? I have fine-tuned a 7B model for a specific niche in monitoring scenarios, but I'm facing a shortage of GPU resources and would like to compress some GPU usage through distillation techniques. |
I haven't done any extensive studies on using the 70B model. We only did some experiments to make sure we're able to run the distributed recipe. I did find the 70B loss to be fairly high, so had to lower the |
ths |
Going to close this issue since it seems like the questions have been answered. @lingq1 feel free to reopen if you need any followup here |
Does it support distillation for large models like Qwen2-72B and LLaMA 3.1-70B?
The text was updated successfully, but these errors were encountered: