Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does it support distillation for large models like Qwen2-72B and LLaMA 3.1-70B? #2117

Closed
lingq1 opened this issue Dec 6, 2024 · 10 comments
Closed
Assignees
Labels
discussion Start a discussion

Comments

@lingq1
Copy link

lingq1 commented Dec 6, 2024

Does it support distillation for large models like Qwen2-72B and LLaMA 3.1-70B?

@felipemello1
Copy link
Contributor

hey @lingq1, it should work with any model. The issue is how much memory you have available. In the tests performed, smaller models were used. I personally haven't tried the 70B. Please change the config and give it a try. Let us know if you hit any blockers.

@lingq1
Copy link
Author

lingq1 commented Dec 6, 2024

hey @lingq1, it should work with any model. The issue is how much memory you have available. In the tests performed, smaller models were used. I personally haven't tried the 70B. Please change the config and give it a try. Let us know if you hit any blockers.

ok, ths

@lindawangg
Copy link
Contributor

@lingq1 You'll probably need to use the distributed recipe for larger teacher models. I've tested the KD distributed recipe on Llama3.1 70B.

@lingq1
Copy link
Author

lingq1 commented Dec 10, 2024

You'll probably need to use the distributed recipe for larger teacher models. I've tested the KD distributed recipe on Llama3.1 70B.

Awesome, so how much VRAM was approximately consumed in total and how is the performance?

@joecummings joecummings added the discussion Start a discussion label Dec 10, 2024
@lindawangg
Copy link
Contributor

You'll probably need to use the distributed recipe for larger teacher models. I've tested the KD distributed recipe on Llama3.1 70B.

Awesome, so how much VRAM was approximately consumed in total and how is the performance?

I don't remember the exact numbers. For reference, I used 8 80G A100 GPUs and didn't have any issues with the default distributed recipe set up. Roughly used about 60% of memory according to a recorded experiment, so maybe around 50GB?

@lingq1
Copy link
Author

lingq1 commented Dec 11, 2024

You'll probably need to use the distributed recipe for larger teacher models. I've tested the KD distributed recipe on Llama3.1 70B.

Awesome, so how much VRAM was approximately consumed in total and how is the performance?

I don't remember the exact numbers. For reference, I used 8 80G A100 GPUs and didn't have any issues with the default distributed recipe set up. Roughly used about 60% of memory according to a recorded experiment, so maybe around 50GB?

ths, how effective is the final distilled model, and is there anything special to consider in the configuration of the 70B large model distillation recipe?

@lingq1
Copy link
Author

lingq1 commented Dec 13, 2024

@lindawangg Could you provide more details on the scenarios where you used the 70B large model as a teacher model for distillation? I have fine-tuned a 7B model for a specific niche in monitoring scenarios, but I'm facing a shortage of GPU resources and would like to compress some GPU usage through distillation techniques.

@lindawangg
Copy link
Contributor

lindawangg commented Dec 13, 2024

@lindawangg Could you provide more details on the scenarios where you used the 70B large model as a teacher model for distillation? I have fine-tuned a 7B model for a specific niche in monitoring scenarios, but I'm facing a shortage of GPU resources and would like to compress some GPU usage through distillation techniques.

I haven't done any extensive studies on using the 70B model. We only did some experiments to make sure we're able to run the distributed recipe. I did find the 70B loss to be fairly high, so had to lower the kd_ratio to 0.1. I didn't see any significant improvements though. It could be interesting to try some of these other kd losses, which may yield better results than the default KL divergence loss: #2094.
image
image

@lingq1
Copy link
Author

lingq1 commented Dec 16, 2024

@lindawangg您能否详细介绍一下使用 70B 大型模型作为蒸馏的教师模型的场景?我已经针对监控场景中的特定领域对 7B 模型进行了微调,但我面临 GPU 资源短缺的问题,因此想通过蒸馏技术压缩一些 GPU 使用量。

ths

@ebsmothers
Copy link
Contributor

Going to close this issue since it seems like the questions have been answered. @lingq1 feel free to reopen if you need any followup here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Start a discussion
Projects
None yet
Development

No branches or pull requests

5 participants