Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper on webGPU? #100

Closed
sandorkonya opened this issue Apr 25, 2023 · 16 comments · Fixed by #545
Closed

Whisper on webGPU? #100

sandorkonya opened this issue Apr 25, 2023 · 16 comments · Fixed by #545
Labels
question Further information is requested

Comments

@sandorkonya
Copy link

Somewhat related to this thread.

Is it within scope to implement a webGPU accelerated version of Whisper?

Not sure if this helps, but there is a C port for Whisper wirh CPU implementation, and as mentioned in this discussion, the main thing that needs to be offloaded to the GPU is the GGML_OP_MUL_MAT operator.

@sandorkonya sandorkonya added the question Further information is requested label Apr 25, 2023
@DK013
Copy link

DK013 commented Apr 25, 2023

Is it within scope to implement a webGPU accelerated version of Whisper?

As I understand, it's simply a matter of changing the Execution provider now to JSEP. The C++ port uses GGML format for the model and this repo uses onnx models alongside onnxruntime to run infrence. Both implementations are different. And with the WebGPU support for onnxruntime (check this PR: [js/web] WebGPU backend via JSEP #14579) which was merged today and official release build will come soon enough, I believe we don't have to worry about CUDA or DirectML endpoints, JSEP does the work for us. It's only a matter of updating the onnxruntime dependency and using JSEP for execution provider.

@xenova correct me if I'm wrong.

@xenova
Copy link
Collaborator

xenova commented Apr 25, 2023

Is it within scope to implement a webGPU accelerated version of Whisper?

As I understand, it's simply a matter of changing the Execution provider now to JSEP. The C++ port uses GGML format for the model and this repo uses onnx models alongside onnxruntime to run infrence. Both implementations are different. And with the WebGPU support for onnxruntime (check this PR: [js/web] WebGPU backend via JSEP #14579) which was merged today and official release build will come soon enough, I believe we don't have to worry about CUDA or DirectML endpoints, JSEP does the work for us. It's only a matter of updating the onnxruntime dependency and using JSEP for execution provider.

@xenova correct me if I'm wrong.

Yep, that's correct! It should be as simple as changing the execution provided to webgpu (vs. wasm)

Hopefully they will make the release soon, but in the meantime, I'll do some testing by building the main branch locally.

@sandorkonya
Copy link
Author

@DK013 & @xenova thank you for the clarification!

I would like to find a way to utilize the GPUs on edge devices (Android mobile) for inference.

As far i understand (as for now) webGPU works on Windows & iOS (my assumption based on this blog post), so we have to wait until webGPU targets the Android devices too?

Or am I simply wrong and onnxruntime won't be the way for edge devices?

best regards

@xenova
Copy link
Collaborator

xenova commented Apr 25, 2023

Yes, you are correct. WebGPU would need to be available in your browser, as onnxruntime just uses the api provided by the browser.

That said, you might not have to wait for very long. As stated in the blog post you linked: "This initial release of WebGPU is available on ChromeOS, macOS, and Windows. Support for other platforms is coming later this year." If you'd like to test while you develop (so you can be ready when it releases fully), you can test using Chrome canary. As demoed here, some users have already got webgpu running on their android devices with this browser (which is just an experimental version of chrome)

@drcodecamp
Copy link

@xenova how we can use gpu power when we use nodejs ?

i try to build a local server with node, all works but very slow on an AMD 5950X , i would like to use my RTX 4070TI
to transcribe but i couldnt find any document that talks about it

@Dolidodzik
Copy link

@xenova, are there any news? Will we be allowed to use webgpu with transformers.js any time soon?

@gabrielgrant
Copy link

gabrielgrant commented Nov 28, 2023

AFAIU onnx runtime's support for WebGPU is still pretty minimal/experimental, so likely isn't able to run Whisper today

Overview issue is here: microsoft/onnxruntime#15796

There doesn't seem to be much up-to-date detailed documentation about the current status publicly available, but as of May many operators were still yet to be ported: microsoft/onnxruntime#15952

@guschmue
Copy link
Contributor

guschmue commented Dec 7, 2023

ort-web on webgpu has now good ops coverage and we can run most models that transformers.js supports. whisper is fine, it is part of our test suite.
The reason why we have not been more public about it is that we still have a performance issue with generative decoders that go 1 token at a time (ie whisper decoder, t5-decoder).
We are debugging that, don't know what the cause is but we are sure it is not the shaders.
All encoders and vision models should have good perf gains.
Supported ops can be found here: https://github.com/microsoft/onnxruntime/blob/main/js/web/docs/webgpu-operators.md

@gabrielgrant
Copy link

thanks for the update @guschmue !

Is there a GH issue for the problem you're describing? Is it this? microsoft/onnxruntime#17373

@guschmue
Copy link
Contributor

guschmue commented Dec 8, 2023

That issue contains a couple of problems, like missing ops resulted in cross device copies and missing io-bindings resulted in a lot of cross device copies. I think we fix most of those. But this decoder issue has been in this too. Ie the io-bindings should have gained much more than they did.
Nasty issue, lots of gpu cycle available, kernel times look good, little cross device copy yet 2x slower than we want. Top of our list.
I can file a separate issue.

@guschmue
Copy link
Contributor

guschmue commented Dec 8, 2023

microsoft/onnxruntime#18754

@gokaybiz
Copy link

gokaybiz commented Dec 8, 2023

What about node.js? Will webGPU/GPU acceleration be available on server/desktop side w/o browser?

@tarekziade
Copy link

@xenova I am curious to try. Do you have builds with WebGPU ?

I've built onnxruntime with the jsep option but I am not entirely sure what are the spots to change in transformers.js - is it as simple as executionProviders to ort.InferenceSession.create ?

@DavidGOrtega
Copy link
Contributor

Additionally another optimization should be done: STFT

@nmstoker
Copy link

nmstoker commented Jun 9, 2024

For anyone coming here who didn't see it yet, there is webGPU support now thanks to Xenova's efforts described here

Code in this branch: https://github.com/xenova/whisper-web/tree/experimental-webgpu

@guschmue
Copy link
Contributor

What about node.js? Will webGPU/GPU acceleration be available on server/desktop side w/o browser?

There is some experimental code path in dawn that one could use to make onnxruntime work with webgpu on node.js.
But we are not sure if people would use that path since onnxruntime-node already supports cuda and directml which is faster than webgpu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.