Request: DeepSeek Coder V2 and Coder Lite V2 #63

bitbottrap · 2024-08-29T18:24:09Z

bitbottrap
Aug 29, 2024

I tried this with DeepSeek Lite V2 and the resource improvement was really great. Of course I tried the same optimization files on the Coder models but that failed. This looks really promising for the usability of MOE models.

I'd really like to request CPU, 1GPU, and 2GPU (24GB per GPU) versions of these that supports their full 128K context. Even with your examples I'm not sure I'd get this right.

Azure-Tang · 2024-08-30T03:12:42Z

Azure-Tang
Aug 30, 2024
Collaborator

I’m glad to hear that you like this project! For now 1m doesn't support multi-gpu orz. We’re considering adding multi-GPU support to 1m in the next release.

Regarding the issue with the coder, could you let us know how you ran DeepseekV2 Coder? Please share the run command and the optimised YAML file you used. Keep in mind that the lite version has only 28 layers, while the coder version has 60 layers, which is important when writing your optimised YAML.

6 replies

Azure-Tang Aug 30, 2024
Collaborator

I believe you can run Coderv2 with the optimization rules used in Chat, as they share the same model framework.
If you encounter any issues, please feel free to ask here.

16x3b Sep 1, 2024

Thank you for your help. I got it working, but wonder if I should be getting better performance from [AMD EPYC™ 7282] 128gb ram- 4x3090 with nvlink.

Chat: how fast can you get back to me?
I'm here to assist you in real-time! What can I help you with?
prompt eval count: 16 token(s)
prompt eval duration: 38.15813732147217s
prompt eval rate: 0.4193076791250121 tokens/s
eval count: 20 token(s)
eval duration: 68.91125845909119s
eval rate: 0.2902283378248403 tokens/s

If there's anything I can do to optimize please let me know.

python -m ktransformers.local_chat --model_path deepseek-ai/DeepSeek-Coder-V2-Instruct --gguf_path ~/.cache/lm-studio/models/bartowski/DeepSeek-Coder-V2-Instruct-GGUF --optimize_rule_path ~/.cache/lm-studio/models/bartowski/DeepSeek-Coder-V2-Instruct-GGUF/DeepSeek-V2-Chat-multi-gpu-4.yaml

UnicornChan Sep 2, 2024
Maintainer

Thank you for your help. I got it working, but wonder if I should be getting better performance from [AMD EPYC™ 7282] 128gb ram- 4x3090 with nvlink.

Chat: how fast can you get back to me? I'm here to assist you in real-time! What can I help you with? prompt eval count: 16 token(s) prompt eval duration: 38.15813732147217s prompt eval rate: 0.4193076791250121 tokens/s eval count: 20 token(s) eval duration: 68.91125845909119s eval rate: 0.2902283378248403 tokens/s

If there's anything I can do to optimize please let me know.

python -m ktransformers.local_chat --model_path deepseek-ai/DeepSeek-Coder-V2-Instruct --gguf_path ~/.cache/lm-studio/models/bartowski/DeepSeek-Coder-V2-Instruct-GGUF --optimize_rule_path ~/.cache/lm-studio/models/bartowski/DeepSeek-Coder-V2-Instruct-GGUF/DeepSeek-V2-Chat-multi-gpu-4.yaml

I guess it should be due to insufficient memory. We used 136GB of DRAM when running DeepSeek v2. A total of 128GB DRAM means you are using swap, which will cause a significant performance drop. You can try smaller models or Q3 quantization. Here is a new PR #72 that we have not tested yet. It uses IQ4_XS and runs on 128GB DRAM in #55 , might be helpful.

16x3b Sep 4, 2024

@UnicornChan I will get back to you with the results of adding more ram next week.

I haven't tried it yet, but I noticed each of the four 3090's still have a decent amount of ram free I suppose because really the build only takes 24gb. I thought is I wonder if its possible to host small models on the remaining space. Like smaller 7-13b models. That way you could run Deepseek and 4 other smaller models or maybe layer another MOE. Let me know if this is already possible with just Ktransformers or if I should try this with Ktransformers + another inferencing backend like Ollama.

Fingers crossed I see improved performance after adding the extra ram.

16x3b Sep 6, 2024

@UnicornChan RAM arrived faster than expected.

Much improved! Good stuff!

Chat: How fast can you get back to me
I'll do my best to respond as quickly as possible!
prompt eval count: 15 token(s)
prompt eval duration: 1.0287458896636963s
prompt eval rate: 14.580860201447413 tokens/s
eval count: 14 token(s)
eval duration: 1.5694901943206787s
eval rate: 8.92009395831849 tokens/s

bitbottrap · 2024-08-31T21:31:58Z

bitbottrap
Aug 31, 2024
Author

Thank you. That was an oversight by me. It definitely does work with the Coder models.

Works great. I'll be looking to see if I can modify the yaml to better optimize the larger DeepSeek model for my particular environment. After the base functionality that seems like a great addition if it's possible without human direction.

python -m ktransformers.local_chat --model_path deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --gguf_path /mnt/models/sv-llm:deepseek-coder-v2:16b-lite-instruct-q8_0

1 reply

Azure-Tang Sep 2, 2024
Collaborator

LOL, human direction may be inevitable to tailor solutions to different environments. I’m aware that Nemo-megatron and Paddle have an auto-tune parameter function; we might consider implementing this later. But regardless, that will take some time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: DeepSeek Coder V2 and Coder Lite V2 #63

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Request: DeepSeek Coder V2 and Coder Lite V2 #63

bitbottrap Aug 29, 2024

Replies: 2 comments · 7 replies

Azure-Tang Aug 30, 2024 Collaborator

Azure-Tang Aug 30, 2024 Collaborator

16x3b Sep 1, 2024

UnicornChan Sep 2, 2024 Maintainer

16x3b Sep 4, 2024

16x3b Sep 6, 2024

bitbottrap Aug 31, 2024 Author

Azure-Tang Sep 2, 2024 Collaborator

bitbottrap
Aug 29, 2024

Replies: 2 comments 7 replies

Azure-Tang
Aug 30, 2024
Collaborator

Azure-Tang Aug 30, 2024
Collaborator

UnicornChan Sep 2, 2024
Maintainer

bitbottrap
Aug 31, 2024
Author

Azure-Tang Sep 2, 2024
Collaborator