Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a reason why backend couldn't be selected at runtime? #891

Open
Please-just-dont opened this issue Jul 13, 2024 · 5 comments
Open

Comments

@Please-just-dont
Copy link

We select the backend at build time by selecting CUDA, Vulkan, SYCL, etc. Wouldn't it be better if you build with the backends you want to support and then select the backend at runtime? It's literally just one runtime if statement and that would make it much easier to compare the performance of the different backends.

@slaren
Copy link
Collaborator

slaren commented Jul 15, 2024

Backends often need to link to a shared library that may not be available on the systems without the supported hardware drivers installed. Eg. you can't run the CUDA backend on systems without the CUDA driver. In the future I would like to move the backends to dynamic libraries that can be loaded at runtime, but that's a more complex change than an if statement.

@Please-just-dont
Copy link
Author

You can easily have the host-side cpu inference method behind an if statement, right? It would be really convenient to switch it out and see the performance difference. For example I found my Vulkan implementation performs about the same as my CPU with 4 threads.

@ngxson
Copy link
Contributor

ngxson commented Jul 27, 2024

Switching backend at runtime requires building all backend in the first place, which is complicated to setup, takes a lot of time and produces big binary size. For the same reason, pytorch offers different packages for CUDA/CPU/ROCm.

Out of the box, ggml comes with CPU + a backend of your choice. ggml_backend_sched interface can be used to do hybrid CPU+backend at the same time. Furthermore, rpc backend allow you to build one ggml "client" for each backend, and use sched to mix and match them. IMO that's already a lot of flexibility.

@slaren
Copy link
Collaborator

slaren commented Jul 27, 2024

There is nothing stopping you from building ggml with multiple backends and use all of them with ggml_backend_sched (other than maybe a broken build script). It's just not practical at the moment because for some backends, the resulting binary will fail to run on computers without the corresponding drivers installed.

@WilliamTambellini
Copy link
Contributor

+1 for that feature.
At least an easy way to choose at runtime between cuda and cpu backends.
It still does nt seem to be doable as today with llama-cli.
Could you just ref to the API to call in order to use the cpu backend when building with the ggml-cuda lib ?
Best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants