Does it support distillation for large models like Qwen2-72B and LLaMA 3.1-70B? #2117

lingq1 · 2024-12-06T02:49:40Z

Does it support distillation for large models like Qwen2-72B and LLaMA 3.1-70B?

felipemello1 · 2024-12-06T03:46:32Z

hey @lingq1, it should work with any model. The issue is how much memory you have available. In the tests performed, smaller models were used. I personally haven't tried the 70B. Please change the config and give it a try. Let us know if you hit any blockers.

lingq1 · 2024-12-06T05:40:59Z

hey @lingq1, it should work with any model. The issue is how much memory you have available. In the tests performed, smaller models were used. I personally haven't tried the 70B. Please change the config and give it a try. Let us know if you hit any blockers.

ok， ths

lindawangg · 2024-12-09T18:23:12Z

@lingq1 You'll probably need to use the distributed recipe for larger teacher models. I've tested the KD distributed recipe on Llama3.1 70B.

lingq1 · 2024-12-10T01:50:56Z

You'll probably need to use the distributed recipe for larger teacher models. I've tested the KD distributed recipe on Llama3.1 70B.

Awesome, so how much VRAM was approximately consumed in total and how is the performance?

lindawangg · 2024-12-10T16:59:15Z

You'll probably need to use the distributed recipe for larger teacher models. I've tested the KD distributed recipe on Llama3.1 70B.

Awesome, so how much VRAM was approximately consumed in total and how is the performance?

I don't remember the exact numbers. For reference, I used 8 80G A100 GPUs and didn't have any issues with the default distributed recipe set up. Roughly used about 60% of memory according to a recorded experiment, so maybe around 50GB?

lingq1 · 2024-12-11T02:27:17Z

You'll probably need to use the distributed recipe for larger teacher models. I've tested the KD distributed recipe on Llama3.1 70B.

Awesome, so how much VRAM was approximately consumed in total and how is the performance?

I don't remember the exact numbers. For reference, I used 8 80G A100 GPUs and didn't have any issues with the default distributed recipe set up. Roughly used about 60% of memory according to a recorded experiment, so maybe around 50GB?

ths, how effective is the final distilled model, and is there anything special to consider in the configuration of the 70B large model distillation recipe?

lingq1 · 2024-12-13T02:21:26Z

@lindawangg Could you provide more details on the scenarios where you used the 70B large model as a teacher model for distillation? I have fine-tuned a 7B model for a specific niche in monitoring scenarios, but I'm facing a shortage of GPU resources and would like to compress some GPU usage through distillation techniques.

lindawangg · 2024-12-13T23:24:30Z

@lindawangg Could you provide more details on the scenarios where you used the 70B large model as a teacher model for distillation? I have fine-tuned a 7B model for a specific niche in monitoring scenarios, but I'm facing a shortage of GPU resources and would like to compress some GPU usage through distillation techniques.

I haven't done any extensive studies on using the 70B model. We only did some experiments to make sure we're able to run the distributed recipe. I did find the 70B loss to be fairly high, so had to lower the kd_ratio to 0.1. I didn't see any significant improvements though. It could be interesting to try some of these other kd losses, which may yield better results than the default KL divergence loss: #2094.

lingq1 · 2024-12-16T01:52:48Z

@lindawangg您能否详细介绍一下使用 70B 大型模型作为蒸馏的教师模型的场景？我已经针对监控场景中的特定领域对 7B 模型进行了微调，但我面临 GPU 资源短缺的问题，因此想通过蒸馏技术压缩一些 GPU 使用量。

ths

ebsmothers · 2024-12-19T01:57:42Z

Going to close this issue since it seems like the questions have been answered. @lingq1 feel free to reopen if you need any followup here

joecummings assigned lindawangg Dec 10, 2024

joecummings added the discussion Start a discussion label Dec 10, 2024

joecummings mentioned this issue Dec 17, 2024

How to retrieve the distilled model in a manner similar to the OpenAI API interface ? #2148

Open

ebsmothers closed this as completed Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does it support distillation for large models like Qwen2-72B and LLaMA 3.1-70B? #2117

Does it support distillation for large models like Qwen2-72B and LLaMA 3.1-70B? #2117

lingq1 commented Dec 6, 2024

felipemello1 commented Dec 6, 2024

lingq1 commented Dec 6, 2024

lindawangg commented Dec 9, 2024

lingq1 commented Dec 10, 2024

lindawangg commented Dec 10, 2024

lingq1 commented Dec 11, 2024

lingq1 commented Dec 13, 2024

lindawangg commented Dec 13, 2024 •

edited

Loading

lingq1 commented Dec 16, 2024

ebsmothers commented Dec 19, 2024

Does it support distillation for large models like Qwen2-72B and LLaMA 3.1-70B? #2117

Does it support distillation for large models like Qwen2-72B and LLaMA 3.1-70B? #2117

Comments

lingq1 commented Dec 6, 2024

felipemello1 commented Dec 6, 2024

lingq1 commented Dec 6, 2024

lindawangg commented Dec 9, 2024

lingq1 commented Dec 10, 2024

lindawangg commented Dec 10, 2024

lingq1 commented Dec 11, 2024

lingq1 commented Dec 13, 2024

lindawangg commented Dec 13, 2024 • edited Loading

lingq1 commented Dec 16, 2024

ebsmothers commented Dec 19, 2024

lindawangg commented Dec 13, 2024 •

edited

Loading