Zmodel using the device placement approach #2534

AlexandreEichenberger · 2023-09-27T02:06:46Z

Use the new infrastructure to move to NNPA only operations that are deemed profitable.

Currently optional, enable with --enable-zhigh-perf-model flag

… with tags Signed-off-by: Alexandre Eichenberger <[email protected]>

Signed-off-by: Alexandre Eichenberger <[email protected]>

AlexandreEichenberger · 2023-09-27T02:19:14Z

src/Accelerators/NNPA/Conversion/ONNXToZHigh/DevicePlacement.cpp

+      return WalkResult::advance();
+    // Now we have an operation that can work on the NNPA, check if its
+    // beneficial
+    if (useZHighCostModel && !isOpFasterOnNNPA(op, &dimAnalysis)) {


@tungld Please let me know if that is how you envisioned this to be used. I placed it there because to the extend possible, I only want to invoke the isOpFasterOnNNPA for operations that I know are candidates... otherwise there might be a lot of ops that are irrelevant to being with. Open to suggestion if you had something else in mind.

@AlexandreEichenberger I plan to do this earlier, at Line 74. The code here will respect what you add by the cost model (e.g. device = cpu). Something like this at Line74.

if (useZHighCostModel) { module.walk([&](Operation *op) { if (!isOpFasterOnNNPA(op, &dimAnalysis)) op->setAttr(DEVICE_ATTRIBUTE, StringAttr::get(context, CPU_DEVICE)); }); }

The legality check is aware of device=cpu in an operation if it is set, so the compiler will assign NNPA for the other ops only.

It's possible, but it will see all of the onnx ops, as opposed to only the ops that were deemed to be target for NNPA.
So it should run faster (seen fewer ops) and also report on only ops that qualify. That helps for printing debugging (only focusing on the important ops). So do you think its worth it to move the code before?

Yes, it is faster here. Actually, the code here is to annotate ops based the result of analysis of rewriting patterns, so it's a bit confusing if we add the cost model here. So, we will annotate the ops using the cost model first, and the rewrite patterns are internally respect device attribute annotated by the cost model.

@AlexandreEichenberger fyi, I am creating a PR to set device by using a json file: #2536. It works now but still needs to add some lit test.

AlexandreEichenberger · 2023-09-27T02:19:54Z

src/Dialect/ONNX/Rewrite.cpp

@@ -914,6 +914,11 @@ class PowToMulRewritePattern : public OpRewritePattern<ONNXPowOp> {
 private:
  // Check if a Pow can be simply rewritten as a sequence of multiply ops.
  bool CanExpandPowOpToMul(ONNXPowOp op, int64_t &powVal) const {
+#if 1 // hi alex


Will remove, there is still something I need to investigate.

Signed-off-by: Alexandre Eichenberger <[email protected]>

AlexandreEichenberger · 2023-09-27T17:52:03Z

@tungld I have a switch #define BEFORE that when set has the code where you suggested. Having it early makes the tracking the performance of the model a bit harder, as it report on all the ops, regardless of whether they actually work on NNPA or not. To give you an example, here is what I get with the placement before.

Test cost-benefit of CPU/NNPA for op %41 = "onnx.Mul"(%40, %15) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.math.truediv_6/truediv"} : (tensor<?x7xf32>, tensor<f32>) -> tensor<?x7xf32>
  Estimated times: nnpa 0.000002, cpu 0.000000
  Faster on CPU: Model estimates faster time on CPU. E2=32: dyn, assume full tile. For op:%41 = "onnx.Mul"(%40, %15) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.math.truediv_6/truediv"} : (tensor<?x7xf32>, tensor<f32>) -> tensor<?x7xf32>
Test cost-benefit of CPU/NNPA for op %42 = "onnx.Sin"(%41) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.math.sin_2/Sin"} : (tensor<?x7xf32>) -> tensor<?x7xf32>
  Faster on NNPA: Candidate for NNPA without model; please add. For op:%42 = "onnx.Sin"(%41) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.math.sin_2/Sin"} : (tensor<?x7xf32>) -> tensor<?x7xf32>
Test cost-benefit of CPU/NNPA for op %43 = "onnx.Unsqueeze"(%42, %6) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.expand_dims_6/ExpandDims"} : (tensor<?x7xf32>, tensor<1xi64>) -> tensor<?x7x1xf32>
  Faster on NNPA: Candidate for NNPA without model; please add. For op:%43 = "onnx.Unsqueeze"(%42, %6) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.expand_dims_6/ExpandDims"} : (tensor<?x7xf32>, tensor<1xi64>) -> tensor<?x7x1xf32>
Test cost-benefit of CPU/NNPA for op %44 = "onnx.Cos"(%41) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.math.cos_2/Cos"} : (tensor<?x7xf32>) -> tensor<?x7xf32>
  Faster on NNPA: Candidate for NNPA without model; please add. For op:%44 = "onnx.Cos"(%41) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.math.cos_2/Cos"} : (tensor<?x7xf32>) -> tensor<?x7xf32>
Test cost-benefit of CPU/NNPA for op %45 = "onnx.Unsqueeze"(%44, %6) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.expand_dims_7/ExpandDims"} : (tensor<?x7xf32>, tensor<1xi64>) -> tensor<?x7x1xf32>
  Faster on NNPA: Candidate for NNPA without model; please add. For op:%45 = "onnx.Unsqueeze"(%44, %6) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.expand_dims_7/ExpandDims"} : (tensor<?x7xf32>, tensor<1xi64>) -> tensor<?x7x1xf32>
Test cost-benefit of CPU/NNPA for op %46 = "onnx.Cast"(%arg7) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.cast_1/Cast", saturate = 1 : si64, to = f32} : (tensor<?x7xi32>) -> tensor<?x7xf32>
  Faster on NNPA: Candidate for NNPA without model; please add. For op:%46 = "onnx.Cast"(%arg7) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.cast_1/Cast", saturate = 1 : si64, to = f32} : (tensor<?x7xi32>) -> tensor<?x7xf32>
Test cost-benefit of CPU/NNPA for op %47 = "onnx.Mul"(%46, %18) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.math.truediv_4/truediv"} : (tensor<?x7xf32>, tensor<f32>) -> tensor<?x7xf32>
  Estimated times: nnpa 0.000002, cpu 0.000000
  Faster on CPU: Model estimates faster time on CPU. E2=32: dyn, assume full tile. For op:%47 = "onnx.Mul"(%46, %18) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.math.truediv_4/truediv"} : (tensor<?x7xf32>, tensor<f32>) -> tensor<?x7xf32>
Test cost-benefit of CPU/NNPA for op %48 = "onnx.Sin"(%47) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.math.sin/Sin"} : (tensor<?x7xf32>) -> tensor<?x7xf32>
  Faster on NNPA: Candidate for NNPA without model; please add. For op:%48 = "onnx.Sin"(%47) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.math.sin/Sin"} : (tensor<?x7xf32>) -> tensor<?x7xf32>
Test cost-benefit of CPU/NNPA for op %49 = "onnx.Unsqueeze"(%48, %6) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.expand_dims_2/ExpandDims"} : (tensor<?x7xf32>, tensor<1xi64>) -> tensor<?x7x1xf32>
  Faster on NNPA: Candidate for NNPA without model; please add. For op:%49 = "onnx.Unsqueeze"(%48, %6) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.expand_dims_2/ExpandDims"} : (tensor<?x7xf32>, tensor<1xi64>) -> tensor<?x7x1xf32>

Lots of ops that don't qualify. I can add all of the ops that might go to the NNPA and leave them without preference, and then force all the ones I know don't work and force them to CPU, but then we have a redundant system. The perf model then decides which one are forced to CPU because we don't support it, and if it get's it wrong, then the later legalize never sees them because they are already forced to CPU... Or I leave them undecided, and then the legalize also see all of the ops.

Let me know if you still feel it's better to have it before.

Signed-off-by: Alexandre Eichenberger <[email protected]>

tungld · 2023-09-28T00:53:06Z

Oh, now I understand that you would like to apply the cost model only for the NNPA operations, and put them back to CPU if beneficial. I thought that the cost model would finally analyze the whole model and relationship between the ops to give better decision, so I proposed to do it separately.
So, given the current cost model (e.g. applying to a single op), I am OK to do it in the current walk (not need BEFORE).

Just a note for me for the future: I am a bit concern about the inconsistency. The current walk is to annotate ops based the result of applying rewriting patterns. And, the rewriting patterns are able to detect device=cpu in an op to force the op on CPU.
In that sense, I am a bit hesitate in modifying the result of applying the patterns since it potentially causes inconsistency when we really apply the patterns to transform the code later. But, at this moment, I haven't had an example that causes inconsistency.

Or I leave them undecided, and then the legalize also see all of the ops.

Yes, legalization can see all the ops.

tungld

LGTM. Thanks for the great contribution!

tungld · 2023-09-28T00:54:27Z

src/Accelerators/NNPA/Conversion/ONNXToZHigh/DevicePlacement.cpp

+      return WalkResult::advance();
+    });
+  }
+#endif


I understand the situation now. You can remove this.

tungld · 2023-09-28T00:56:13Z

src/Accelerators/NNPA/Conversion/ONNXToZHigh/PerfModel.inc

+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+//===----------------- Auto-Generated, do not change  ---------------------===//


Could you give additional comments on how to generate this? e.g. scripts, commands

AlexandreEichenberger · 2023-09-28T14:33:40Z

Oh, now I understand that you would like to apply the cost model only for the NNPA operations, and put them back to CPU if beneficial. I thought that the cost model would finally analyze the whole model and relationship between the ops to give better decision, so I proposed to do it separately.

That is my next step, in that step, I also will need to know which op is legal for the NNPA. We can chat how to do that best.

Signed-off-by: Alexandre Eichenberger <[email protected]>

jenkins-droid · 2023-09-28T16:30:12Z

Jenkins Linux s390x Build #12857 [push] Zmodel using the device ... started at 12:30

jenkins-droid · 2023-09-28T16:30:14Z

Jenkins Linux ppc64le Build #11850 [push] Zmodel using the device ... started at 12:40

jenkins-droid · 2023-09-28T16:30:15Z

Jenkins Linux amd64 Build #12833 [push] Zmodel using the device ... started at 11:30

jenkins-droid · 2023-09-28T17:48:08Z

Jenkins Linux amd64 Build #12833 [push] Zmodel using the device ... failed after 1 hr 17 min

jenkins-droid · 2023-09-28T17:52:41Z

Jenkins Linux s390x Build #12857 [push] Zmodel using the device ... passed after 1 hr 22 min

jenkins-droid · 2023-09-28T18:12:38Z

Jenkins Linux ppc64le Build #11850 [push] Zmodel using the device ... passed after 1 hr 42 min

AlexandreEichenberger added 7 commits September 22, 2023 13:37

Added mesurement only on middle 2 quartile, default for simd analysis…

0023318

… with tags Signed-off-by: Alexandre Eichenberger <[email protected]>

first cut

8e077ad

Signed-off-by: Alexandre Eichenberger <[email protected]>

first version

2b5ff78

Signed-off-by: Alexandre Eichenberger <[email protected]>

first version

f437bc6

Signed-off-by: Alexandre Eichenberger <[email protected]>

option installed

034dc0d

Signed-off-by: Alexandre Eichenberger <[email protected]>

update

bf76d4b

update

627882d

Signed-off-by: Alexandre Eichenberger <[email protected]>

AlexandreEichenberger marked this pull request as draft September 27, 2023 02:16

AlexandreEichenberger commented Sep 27, 2023

View reviewed changes

AlexandreEichenberger added 2 commits September 27, 2023 11:55

fix to pow

4ed4990

Signed-off-by: Alexandre Eichenberger <[email protected]>

use zhigh-perf-model name uniformely

1429359

Signed-off-by: Alexandre Eichenberger <[email protected]>

AlexandreEichenberger marked this pull request as ready for review September 27, 2023 16:04

AlexandreEichenberger added 2 commits September 27, 2023 15:33

added a test file

1b5e37b

Signed-off-by: Alexandre Eichenberger <[email protected]>

update

28d9be8

tungld approved these changes Sep 28, 2023

View reviewed changes

AlexandreEichenberger added 2 commits September 28, 2023 10:35

keep version without BEFORE

ebefb1f

Signed-off-by: Alexandre Eichenberger <[email protected]>

format

2bb10d1

Signed-off-by: Alexandre Eichenberger <[email protected]>

AlexandreEichenberger merged commit 06a10c0 into onnx:main Sep 28, 2023

tungld mentioned this pull request Oct 3, 2023

ONNX to Zhigh guided by cost model #2507

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zmodel using the device placement approach #2534

Zmodel using the device placement approach #2534

AlexandreEichenberger commented Sep 27, 2023 •

edited

Loading

AlexandreEichenberger Sep 27, 2023

tungld Sep 27, 2023

tungld Sep 27, 2023

AlexandreEichenberger Sep 27, 2023

tungld Sep 27, 2023

tungld Sep 27, 2023

AlexandreEichenberger Sep 27, 2023

AlexandreEichenberger commented Sep 27, 2023

tungld commented Sep 28, 2023

tungld left a comment

tungld Sep 28, 2023

tungld Sep 28, 2023

AlexandreEichenberger commented Sep 28, 2023

jenkins-droid commented Sep 28, 2023

jenkins-droid commented Sep 28, 2023

jenkins-droid commented Sep 28, 2023

jenkins-droid commented Sep 28, 2023

jenkins-droid commented Sep 28, 2023

jenkins-droid commented Sep 28, 2023

Zmodel using the device placement approach #2534

Zmodel using the device placement approach #2534

Conversation

AlexandreEichenberger commented Sep 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexandreEichenberger commented Sep 27, 2023

tungld commented Sep 28, 2023

tungld left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexandreEichenberger commented Sep 28, 2023

jenkins-droid commented Sep 28, 2023

jenkins-droid commented Sep 28, 2023

jenkins-droid commented Sep 28, 2023

jenkins-droid commented Sep 28, 2023

jenkins-droid commented Sep 28, 2023

jenkins-droid commented Sep 28, 2023

AlexandreEichenberger commented Sep 27, 2023 •

edited

Loading