Replies: 2 comments
-
Scope: https://onnxruntime.ai/docs/execution-providers/NNAPI-ExecutionProvider.html
|
Beta Was this translation helpful? Give feedback.
-
Mobile/edge device capabilities are clearly different to a server-side scenario, but for mobile you're not trying to maximize the usage of a machine with concurrent requests. Performance needs to be acceptable for the scenario on the target devices, but things like memory usage and model size may actually be more critical. The ORT flag will result in this being called: https://developer.android.com/ndk/reference/group/neural-networks#aneuralnetworksmodel_relaxcomputationfloat32tofloat16 I believe that allows NNAPI to use fp16 internally if/when it chooses. As it's up to NNAPI to make those choices I don't think there's any implied guarantee of performance improvement if the flag is set. That is also only part of the story as it will only apply to nodes in the model that ORT's NNAPI EP knows how to convert to an NNAPI model. You really need to check the node assignments for the model to know how many would be using NNAPI. That can be done by setting the log_severity_level in SessionOptions to VERBOSE (0) and providing the session options when creating the InferenceSession and looking for 'Node placements' and 'NnapiExecutionProvider::GetCapability' in the output. Also note that NNAPI performance can vary significantly across devices as it's highly dependent on the hardware vendor's implementation of the low-level NNAPI components. For example, NNAPI will fall back to a reference CPU implementation (simplest way to perform the operation with little to no optimization) if a hardware specific implementation is not available. If that happens, using the ORT CPU EP is likely to provide better performance. |
Beta Was this translation helpful? Give feedback.
-
Given that running inference in mobile is slow, knowing tips, tricks & pitfalls is critical to improve performance.
So I would like to open a discussion about NNAPI performance.
My first question is: I've been reading about using FP16 flag improves performance, but I haven't seen any performance improvement when loading a ONNX file with full graph optimization. Does it mean in order to get the benefits of FP16, the ONNX file needs to be converted to ORT using Fp16? or just setting FP16 flag and full graph optimization already tells the runtime to convert to FP16?
Relevant references:
-Converting onnx to ORT with nnapi support
-Issue using NNAPI on Android device
Beta Was this translation helpful? Give feedback.
All reactions