Slow CUDA inference speed #763

ShelbyJenkins · 2024-09-08T20:20:18Z

This reports mistral.rs as being faster than llama.cpp: #612

But I'm seeing much slower speeds for the same prompt/settings.

Mistral.rs
Usage { completion_tokens: 501, prompt_tokens: 28, total_tokens: 529, avg_tok_per_sec: 16.980707, avg_prompt_tok_per_sec: 76.08695, avg_compl_tok_per_sec: 16.27416, total_time_sec: 31.153, total_prompt_time_sec: 0.368, total_completion_time_sec: 30.785 }

llama.cpp
timings: {\"predicted_ms\": 4007.64, \"prompt_per_token_ms\": 0.7041786, \"predicted_per_token_ms\": 8.01528, \"prompt_ms\": 19.717, \"prompt_per_second\": 1420.0944, \"predicted_n\": 500.0, \"prompt_n\": 28.0, \"predicted_per_second\": 124.7617},

The code I'm using to init mistral.rs:
https://github.com/ShelbyJenkins/llm_client/blob/b1edca89bbdc34b884907fd39be6eedabf10d81b/src/llm_backends/mistral_rs/builder.rs#L110

I'm using the basic completion tests here:
https://github.com/ShelbyJenkins/llm_client/blob/b1edca89bbdc34b884907fd39be6eedabf10d81b/src/basic_completion.rs#L158

Testing on ubuntu running an ubuntu docker container (FROM nvidia/cuda:12.3.2-cudnn9-devel-ubuntu22.04). I've tried loading the layers on to a single GPU using the device dummy map, and loading on both GPUs using the device mapper. These are 3090s and testing is done with Phi 3 mini.

The text was updated successfully, but these errors were encountered:

ShelbyJenkins · 2024-09-10T17:23:16Z

I need to test out the version of cuda specified in the docker container and if that doesn't work I will test the benchmark following the instructions from the announcement linked above.

ShelbyJenkins · 2024-09-20T00:45:15Z

Updated to the same docker image but not the dockerfile. No changes.

ShelbyJenkins added the bug Something isn't working label Sep 8, 2024

EricLBuehler added optimization and removed bug Something isn't working labels Sep 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow CUDA inference speed #763

Slow CUDA inference speed #763

ShelbyJenkins commented Sep 8, 2024 •

edited

Loading

ShelbyJenkins commented Sep 10, 2024

ShelbyJenkins commented Sep 20, 2024

Slow CUDA inference speed #763

Slow CUDA inference speed #763

Comments

ShelbyJenkins commented Sep 8, 2024 • edited Loading

ShelbyJenkins commented Sep 10, 2024

ShelbyJenkins commented Sep 20, 2024

ShelbyJenkins commented Sep 8, 2024 •

edited

Loading