-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WebNN should support NPU and QDQ operations #623
Comments
@wchao1115, thanks for this proposal that outlines key elements for NPU support. I'll schedule this important topic for discussion at our upcoming meeting. As you noted, these topics (NPU device type, support for quantized models) have been explored in the group prior and have been awaiting implementation experience. The timing is now appropriate for the group to reinvigorate this topic with NPU platforms more widely available and in the hands of consumers. Most importantly, the group can now validate proposed spec designs with implementation experience per our established work mode. I'm looking forward to this discussion. Meanwhile, questions and comments are welcome in this issue from everyone. |
Hi thanks for bringing this up! I'd like to highlight a couple things based on my current implementation experience:
|
We've discussed 3 possible options for extending the MLContextOptions::MLDeviceType:
Other considerationsError handling: If a device type does not exist at all, like asking for an NPU on a machine without one or a GPU on a headless server, then Ultimate fallback: If Quantized operators: These are necessary for NPU but are also independent, as they are useful for GPU and CPU too. FeedbackFeedback welcome below. I have my preferences, but want to hear from you, and whether any other options/considerations are missing. |
Option 3 seems the best to me, also used in e.g. OpenVINO, and allows future interpretation for split/combined execution on multiple accelerators. |
I actually like the simplicity of option 1. As long as we make it clear in the spec that the system may decide to fallback to other devices. The benefit of option 3 is use case like "I want to use anything except GPU" to load balance or something. But I feel we can explore this option when we gain more concrete needs from developers. A 4th option to satisfy the same need is : For now option 1 seems a good starting point? |
At the 2024-05-02 meeting, the W3C group agreed to start with option 1 (reserving the option to potentially expand if implementation experience shows need). Next step, spec update. |
There was feedback regarding the motivation for this. Why is MLDeviceType even necessary and shouldn't it be a browser implementation decision to choose the most appropriate processor given the existing MLPowerPreference? Or do we really need this for an application with extensive WebNN <-> WebGPU interop for instance? I read through the other related issues but couldn't find a good motivating factor. |
I think that could be possible even with the current API shape, if we spec it correctly. The details would be important, so please share examples on how would you like these scenarios happen. |
There's been some past discussion on the use cases for explicit device type in a few places. One that comes to mind is: ... where I ask a bunch of dumb questions about the need for |
After reading over #322 @inexorabletash I still don't fully understand the argument for the device type. Comparing to WebGPU, which can be implemented purely in software without a physical GPU, why is WebNN specifying a physical device type and not leaving this up to the implementation? Rather it would seem specifying that one wishes to interop with a WebGPU device is desirable. The WebGPU device may be created as a purely software device in which case running WebNN computations also on the cpu is desirable. In any case, this seems best left up to the browser implementation. It is the browser implementation which ensures WebNN computations are consistent across any physical hardware device it runs on. In that scenario, MLDeviceType should be removed from the WebNN API. |
AFAIK, there are two use cases need specifying a device type:
This scenario was discussed before as a "system" or "auto" device type. There were some relevant issues: webmachinelearning/model-loader#30 and #257. |
Summarizing the use cases, we seem to have the following set of constraints:
Strictly including the discussion above (and references),
Did I miss something? |
Another common use case would be the desire to run workloads on the GPU and ANE (and possibly CPU) simultaneously. |
Assuming the interpretation above, that could fall under If we'd like to make that explicit, together with the fallback option, [and eventually when we want an error,] we'd need to use option 3 from here, but there are downsides/complications. |
It would seem preferable to keep it implicit because from the web it would be hard, especially in a privacy preserving manner, for the website to make the correct decision for a given scenario. |
Note websites won't be the only clients, as installable web apps can also run locally via WebNN, and they know their scenarios moreso. In any case, the device type is a hint, not a requirement. Apps can also leave |
Related to issue #128 and #302, we've been talking about supporting the NPU for the last few years. Now that more commercial NPU platforms become available (e.g. with the more recent arrival of Intel Core Ultra NPU), it is time to formally define NPU support in the WebNN spec. There are two key elements of this specification:
quantizeLinear
anddequantizeLinear
operators. These two will be enough to handle quantized models by pairing them up at the right places in the model graph, the so-called tensor-oriented QDQ format used in ONNX. Additionally, two more prominent quantized operators, one for convolution, and another for matmul will allow more quantized models not already expressed in the QDQ format to function i.e.conv2dInt
andmatmulInt
.The text was updated successfully, but these errors were encountered: