Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zmodel using the device placement approach #2534

Merged
merged 13 commits into from
Sep 28, 2023

Conversation

AlexandreEichenberger
Copy link
Collaborator

@AlexandreEichenberger AlexandreEichenberger commented Sep 27, 2023

Use the new infrastructure to move to NNPA only operations that are deemed profitable.

Currently optional, enable with --enable-zhigh-perf-model flag

Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
@AlexandreEichenberger AlexandreEichenberger marked this pull request as draft September 27, 2023 02:16
return WalkResult::advance();
// Now we have an operation that can work on the NNPA, check if its
// beneficial
if (useZHighCostModel && !isOpFasterOnNNPA(op, &dimAnalysis)) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tungld Please let me know if that is how you envisioned this to be used. I placed it there because to the extend possible, I only want to invoke the isOpFasterOnNNPA for operations that I know are candidates... otherwise there might be a lot of ops that are irrelevant to being with. Open to suggestion if you had something else in mind.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexandreEichenberger I plan to do this earlier, at Line 74. The code here will respect what you add by the cost model (e.g. device = cpu). Something like this at Line74.

 if (useZHighCostModel) {
   module.walk([&](Operation *op) {
    if (!isOpFasterOnNNPA(op, &dimAnalysis))
      op->setAttr(DEVICE_ATTRIBUTE, StringAttr::get(context, CPU_DEVICE));
    });
  }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The legality check is aware of device=cpu in an operation if it is set, so the compiler will assign NNPA for the other ops only.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible, but it will see all of the onnx ops, as opposed to only the ops that were deemed to be target for NNPA.
So it should run faster (seen fewer ops) and also report on only ops that qualify. That helps for printing debugging (only focusing on the important ops). So do you think its worth it to move the code before?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is faster here. Actually, the code here is to annotate ops based the result of analysis of rewriting patterns, so it's a bit confusing if we add the cost model here. So, we will annotate the ops using the cost model first, and the rewrite patterns are internally respect device attribute annotated by the cost model.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexandreEichenberger fyi, I am creating a PR to set device by using a json file: #2536. It works now but still needs to add some lit test.

@@ -914,6 +914,11 @@ class PowToMulRewritePattern : public OpRewritePattern<ONNXPowOp> {
private:
// Check if a Pow can be simply rewritten as a sequence of multiply ops.
bool CanExpandPowOpToMul(ONNXPowOp op, int64_t &powVal) const {
#if 1 // hi alex
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove, there is still something I need to investigate.

Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
@AlexandreEichenberger AlexandreEichenberger marked this pull request as ready for review September 27, 2023 16:04
@AlexandreEichenberger
Copy link
Collaborator Author

@tungld I have a switch #define BEFORE that when set has the code where you suggested. Having it early makes the tracking the performance of the model a bit harder, as it report on all the ops, regardless of whether they actually work on NNPA or not. To give you an example, here is what I get with the placement before.

Test cost-benefit of CPU/NNPA for op %41 = "onnx.Mul"(%40, %15) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.math.truediv_6/truediv"} : (tensor<?x7xf32>, tensor<f32>) -> tensor<?x7xf32>
  Estimated times: nnpa 0.000002, cpu 0.000000
  Faster on CPU: Model estimates faster time on CPU. E2=32: dyn, assume full tile. For op:%41 = "onnx.Mul"(%40, %15) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.math.truediv_6/truediv"} : (tensor<?x7xf32>, tensor<f32>) -> tensor<?x7xf32>
Test cost-benefit of CPU/NNPA for op %42 = "onnx.Sin"(%41) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.math.sin_2/Sin"} : (tensor<?x7xf32>) -> tensor<?x7xf32>
  Faster on NNPA: Candidate for NNPA without model; please add. For op:%42 = "onnx.Sin"(%41) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.math.sin_2/Sin"} : (tensor<?x7xf32>) -> tensor<?x7xf32>
Test cost-benefit of CPU/NNPA for op %43 = "onnx.Unsqueeze"(%42, %6) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.expand_dims_6/ExpandDims"} : (tensor<?x7xf32>, tensor<1xi64>) -> tensor<?x7x1xf32>
  Faster on NNPA: Candidate for NNPA without model; please add. For op:%43 = "onnx.Unsqueeze"(%42, %6) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.expand_dims_6/ExpandDims"} : (tensor<?x7xf32>, tensor<1xi64>) -> tensor<?x7x1xf32>
Test cost-benefit of CPU/NNPA for op %44 = "onnx.Cos"(%41) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.math.cos_2/Cos"} : (tensor<?x7xf32>) -> tensor<?x7xf32>
  Faster on NNPA: Candidate for NNPA without model; please add. For op:%44 = "onnx.Cos"(%41) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.math.cos_2/Cos"} : (tensor<?x7xf32>) -> tensor<?x7xf32>
Test cost-benefit of CPU/NNPA for op %45 = "onnx.Unsqueeze"(%44, %6) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.expand_dims_7/ExpandDims"} : (tensor<?x7xf32>, tensor<1xi64>) -> tensor<?x7x1xf32>
  Faster on NNPA: Candidate for NNPA without model; please add. For op:%45 = "onnx.Unsqueeze"(%44, %6) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.expand_dims_7/ExpandDims"} : (tensor<?x7xf32>, tensor<1xi64>) -> tensor<?x7x1xf32>
Test cost-benefit of CPU/NNPA for op %46 = "onnx.Cast"(%arg7) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.cast_1/Cast", saturate = 1 : si64, to = f32} : (tensor<?x7xi32>) -> tensor<?x7xf32>
  Faster on NNPA: Candidate for NNPA without model; please add. For op:%46 = "onnx.Cast"(%arg7) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.cast_1/Cast", saturate = 1 : si64, to = f32} : (tensor<?x7xi32>) -> tensor<?x7xf32>
Test cost-benefit of CPU/NNPA for op %47 = "onnx.Mul"(%46, %18) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.math.truediv_4/truediv"} : (tensor<?x7xf32>, tensor<f32>) -> tensor<?x7xf32>
  Estimated times: nnpa 0.000002, cpu 0.000000
  Faster on CPU: Model estimates faster time on CPU. E2=32: dyn, assume full tile. For op:%47 = "onnx.Mul"(%46, %18) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.math.truediv_4/truediv"} : (tensor<?x7xf32>, tensor<f32>) -> tensor<?x7xf32>
Test cost-benefit of CPU/NNPA for op %48 = "onnx.Sin"(%47) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.math.sin/Sin"} : (tensor<?x7xf32>) -> tensor<?x7xf32>
  Faster on NNPA: Candidate for NNPA without model; please add. For op:%48 = "onnx.Sin"(%47) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.math.sin/Sin"} : (tensor<?x7xf32>) -> tensor<?x7xf32>
Test cost-benefit of CPU/NNPA for op %49 = "onnx.Unsqueeze"(%48, %6) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.expand_dims_2/ExpandDims"} : (tensor<?x7xf32>, tensor<1xi64>) -> tensor<?x7x1xf32>
  Faster on NNPA: Candidate for NNPA without model; please add. For op:%49 = "onnx.Unsqueeze"(%48, %6) {onnx_node_name = "StatefulPartitionedCall/model_2/tf.expand_dims_2/ExpandDims"} : (tensor<?x7xf32>, tensor<1xi64>) -> tensor<?x7x1xf32>

Lots of ops that don't qualify. I can add all of the ops that might go to the NNPA and leave them without preference, and then force all the ones I know don't work and force them to CPU, but then we have a redundant system. The perf model then decides which one are forced to CPU because we don't support it, and if it get's it wrong, then the later legalize never sees them because they are already forced to CPU... Or I leave them undecided, and then the legalize also see all of the ops.

Let me know if you still feel it's better to have it before.

@tungld
Copy link
Collaborator

tungld commented Sep 28, 2023

Oh, now I understand that you would like to apply the cost model only for the NNPA operations, and put them back to CPU if beneficial. I thought that the cost model would finally analyze the whole model and relationship between the ops to give better decision, so I proposed to do it separately.
So, given the current cost model (e.g. applying to a single op), I am OK to do it in the current walk (not need BEFORE).

Just a note for me for the future: I am a bit concern about the inconsistency. The current walk is to annotate ops based the result of applying rewriting patterns. And, the rewriting patterns are able to detect device=cpu in an op to force the op on CPU.
In that sense, I am a bit hesitate in modifying the result of applying the patterns since it potentially causes inconsistency when we really apply the patterns to transform the code later. But, at this moment, I haven't had an example that causes inconsistency.

Or I leave them undecided, and then the legalize also see all of the ops.

Yes, legalization can see all the ops.

Copy link
Collaborator

@tungld tungld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the great contribution!

return WalkResult::advance();
});
}
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the situation now. You can remove this.

* SPDX-License-Identifier: Apache-2.0
*/

//===----------------- Auto-Generated, do not change ---------------------===//
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give additional comments on how to generate this? e.g. scripts, commands

@AlexandreEichenberger
Copy link
Collaborator Author

Oh, now I understand that you would like to apply the cost model only for the NNPA operations, and put them back to CPU if beneficial. I thought that the cost model would finally analyze the whole model and relationship between the ops to give better decision, so I proposed to do it separately.

That is my next step, in that step, I also will need to know which op is legal for the NNPA. We can chat how to do that best.

Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
@AlexandreEichenberger AlexandreEichenberger merged commit 06a10c0 into onnx:main Sep 28, 2023
@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #12857 [push] Zmodel using the device ... started at 12:30

@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #11850 [push] Zmodel using the device ... started at 12:40

@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #12833 [push] Zmodel using the device ... started at 11:30

@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #12833 [push] Zmodel using the device ... failed after 1 hr 17 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #12857 [push] Zmodel using the device ... passed after 1 hr 22 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #11850 [push] Zmodel using the device ... passed after 1 hr 42 min

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants