[NNPA] Use device attribute to control device placement for ONNX operations #2510

tungld · 2023-09-15T14:04:28Z

To set device for an ONNX operation (say, the op in the output of --EmitONNXIR where all onnx-to-onnx transformations has been applied), we set the device attribute for the ONNX operation, e.g.

%0 = "onnx.Add"(%arg0, %arg1) {device = "cpu", onnx_node_name = "test/add0"} : (tensor<10x10xf32>, tensor<10x10xf32>) -> tensor<*xf32>

will place onnx.Add on CPU.

%0 = "onnx.Add"(%arg0, %arg1) {device = "nnpa", onnx_node_name = "test/add0"} : (tensor<10x10xf32>, tensor<10x10xf32>) -> tensor<*xf32>

will place onnx.Add on NNPA.

If there is no device attribute, the compiler will make decision.

device attribute would facilitates our next steps, in particular, using a cost model proposed in #2507, or using a configuration file by users to specify where to place an ONNX op.

Next step: create a pass, e.g. device-placement to place ONNX operations by using a cost model or a configuration file.

By using device, the current way of forcing an op to CPU by using --execNodesOnCPU can be removed.

Signed-off-by: Tung D. Le <[email protected]>

imaihal · 2023-09-16T15:08:30Z

How can users specify specific operations to run cpu or nnpa?

AlexandreEichenberger · 2023-09-18T15:57:22Z

src/Accelerators/NNPA/Conversion/ONNXToZHigh/ONNXToZHighCommon.hpp

+    if (device && device.getValue().equals_insensitive(CPU_DEVICE))
+      return true;
+    // If device is NNPA, force to run the op on NNPA.
+    if (device && device.getValue().equals_insensitive(NNPA_DEVICE))


I would suggest to do a check of legality check.

Do you mean something like isNNPA() && isLegality()? device=NNPA can be forcing to NNPA or maybe good for NNPA. I am OK going with one of them.

Forcing to NNPA is convenient when we annotate an op with device=NNPA directly and we really want that op go to NNPA despite of compiler optimizations.

maybe good for NNPA is safe when we use a cost model, since the cost model may have a mistake in assigning an op to NNPA (e.g. that op is not suitable for NNPA)

Forcing to NNPA is also useful when we have dynamic shape, and we want an op to run NNPA because the compiler is not able to know if it is suitable for CPU or NNPA.

I should have been clearer

assert(isLegal(xxx) && "trying to force an op to NNPA that is not perceived as legal for NNPA");

tungld · 2023-09-19T01:02:58Z

How can users specify specific operations to run cpu or nnpa?

We will provide later a configuration file (e.g. a json file), so that users can specify which op to run on cpu or nnpa. In the configuration file, users can use operation types (e.g. ONNXConv) or onnx_node_name to identity an op.

imaihal · 2023-09-19T02:21:13Z

We will provide later a configuration file

I see. Since I sometime use execNodeOnCpu option, could you keep the option for a while until the configuration are added? It's ok If you plan to add it soon.

Signed-off-by: Tung D. Le <[email protected]>

…NPA ops Signed-off-by: Tung D. Le <[email protected]>

AlexandreEichenberger · 2023-09-19T15:00:12Z

It would be good if we can have the following flow.

compile and print the mlir with annotation on (automatically generated by some cost/legality analysis)
let experimenter change assignment
restart compilation with that assignment

Conversely, if using the json file is easier, that could also be done. There is a certain simplicity in having the info directly in the MLIR.

I would venture that it would also be convenient if a user knew (either JSON or other annotation of the ops) that an op is legal on NNPA (regardless of if it is assigned or not)

tungld · 2023-09-20T13:08:57Z

@AlexandreEichenberger yes, I am going with the flow you have in mind. Will ping you when it's available.

Signed-off-by: Tung D. Le <[email protected]>

AlexandreEichenberger

LGTM, I would just clarify a bit the warning:

Warning: though the following operation was specified "
                       "to run on NNPA, the compiler found that NNPA did not "
                       "support that operation. It's potentially that the "
                       "compiler was not able to check broadcasting in case of "
                       "dynamic shape so that it thought the operation was not "
                       "legal for NNPA.

Does not specify to the user what is the result. I would change the wording to make sure the user understand that the op will still go to the NNPA.

Maybe

Warning, the following operation will run on the NNPA device even though the compiler believes that it is not legal to do so for this operation. The compiler may not have the full information necessary to accurately determine legality, for example in the presence dynamic shapes to exclude broadcasting, or for some other reasons. If the model does not work properly, you may want to double-check the validity of mapping that operation to the NNPA device.

Signed-off-by: Tung D. Le <[email protected]>

tungld · 2023-09-21T07:18:01Z

@AlexandreEichenberger now with this patch we can do the following things:

-EmitONNXIR --maccel=nnpa will produce an IR where NNPA ops are annotated with attribute device = "nnpa", saying which operations will potentially run on NNPA.
Users may change the device placement in the IR say, setting device = "cpu" to force an op run on CPU.
Continue compiling the editted IR with --EmitZHighIR --maccel=nnpa or --EmitLib --maccel=nnpa.

For example, this is an output of using --EmitONNXIR --maccel=NNPA for the mnist model:

func.func @main_graph(%arg0: tensor<1x1x28x28xf32>) -> tensor<1x10xf32> attributes {input_names = ["Input3"], output_names = ["Plus214_Output_0"]} {
    %0 = onnx.Constant dense<[-0.0822488219, -0.108868778, -0.141039595, -0.204869166, -0.17913565, -0.215438381, -0.133805066, -0.195724562, -0.268250644, -0.258212209, -0.0761560649, 0.0132841459, -0.00444464432, -0.414740831, -0.17879115, -0.0386558883]> : tensor<16xf32>
    %1 = onnx.Constant dense<[-0.161539719, -0.433835655, 0.091641359, -0.0168522168, -0.0650264397, -0.131737873, 0.0204175506, -0.121110231]> : tensor<8xf32>
    %2 = onnx.Constant dense_resource<__elided__> : tensor<16x4x4x10xf32>
    %3 = onnx.Constant dense_resource<__elided__> : tensor<16x8x5x5xf32>
    %4 = onnx.Constant dense_resource<__elided__> : tensor<8x1x5x5xf32>
    %5 = onnx.Constant dense<[1, 256]> : tensor<2xi64>
    %6 = onnx.Constant dense<[256, 10]> : tensor<2xi64>
    %7 = onnx.Constant dense<[[-0.0448560268, 0.00779166119, 0.0681008175, 0.0299937408, -0.126409635, 0.14021875, -0.0552849025, -0.0493838154, 0.0843220502, -0.0545404144]]> : tensor<1x10xf32>
    %8 = "onnx.Reshape"(%2, %6) {allowzero = 0 : si64, onnx_node_name = "Times212_reshape1"} : (tensor<16x4x4x10xf32>, tensor<2xi64>) -> tensor<256x10xf32>
    %9 = "onnx.Conv"(%arg0, %4, %1) {auto_pad = "SAME_UPPER", device = "nnpa", dilations = [1, 1], group = 1 : si64, kernel_shape = [5, 5], strides = [1, 1]} : (tensor<1x1x28x28xf32>, tensor<8x1x5x5xf32>, tensor<8xf32>) -> tensor<1x8x28x28xf32>
    %10 = "onnx.Relu"(%9) {device = "nnpa", onnx_node_name = "ReLU32"} : (tensor<1x8x28x28xf32>) -> tensor<1x8x28x28xf32>
    %11 = "onnx.MaxPoolSingleOut"(%10) {auto_pad = "NOTSET", ceil_mode = 0 : si64, device = "nnpa", kernel_shape = [2, 2], onnx_node_name = "Pooling66", pads = [0, 0, 0, 0], storage_order = 0 : si64, strides = [2, 2]} : (tensor<1x8x28x28xf32>) -> tensor<1x8x14x14xf32>
    %12 = "onnx.Conv"(%11, %3, %0) {auto_pad = "SAME_UPPER", device = "nnpa", dilations = [1, 1], group = 1 : si64, kernel_shape = [5, 5], strides = [1, 1]} : (tensor<1x8x14x14xf32>, tensor<16x8x5x5xf32>, tensor<16xf32>) -> tensor<1x16x14x14xf32>
    %13 = "onnx.Relu"(%12) {device = "nnpa", onnx_node_name = "ReLU114"} : (tensor<1x16x14x14xf32>) -> tensor<1x16x14x14xf32>
    %14 = "onnx.MaxPoolSingleOut"(%13) {auto_pad = "NOTSET", ceil_mode = 0 : si64, device = "nnpa", kernel_shape = [3, 3], onnx_node_name = "Pooling160", pads = [0, 0, 0, 0], storage_order = 0 : si64, strides = [3, 3]} : (tensor<1x16x14x14xf32>) -> tensor<1x16x4x4xf32>
    %15 = "onnx.Reshape"(%14, %5) {allowzero = 0 : si64, onnx_node_name = "Times212_reshape0"} : (tensor<1x16x4x4xf32>, tensor<2xi64>) -> tensor<1x256xf32>
    %16 = "onnx.Gemm"(%15, %8, %7) {alpha = 1.000000e+00 : f32, beta = 1.000000e+00 : f32, device = "nnpa", transA = 0 : si64, transB = 0 : si64} : (tensor<1x256xf32>, tensor<256x10xf32>, tensor<1x10xf32>) -> tensor<1x10xf32>
    return %16 : tensor<1x10xf32>
  }

From this, we can see which operation will run on NNPA by looking device = "nnpa".

Signed-off-by: Tung D. Le <[email protected]>

tungld · 2023-09-21T07:53:49Z

@imaihal I am sorry that I will have another PR for users of onnx-mlir to set cpu ops. With the current patch, you can instead work directly with the output of --EmitONNXIR as I mentioned above.

AlexandreEichenberger · 2023-09-21T13:57:18Z

@tungld outstanding, and in the meanwhile, I run tests on z16, generating csv files of data for each ops which are then processed by a python file generating me this code:

bool isFasterOnNnpa_Div_3ds(double e3, double e2, double e1) {
  // Operation has cross over at complexity = 2483
  // Regression for CPU with r2 = 0.9999989217222092
  double complexityCpu = e3 * e2 * e1;
  double estimatedCpuTime = 1.4517483410281062e-09 * complexityCpu + 4.819629870926124e-07;
  // Regression for NNPA with r2 = 0.993964642126504
  double complexityNnpa = e3 * (ceil(e2/32.0)*32.0) * (ceil(e1/64.0)*64.0);
  double estimatedNnpaTime = 1.0448506395503133e-10 *  complexityNnpa + 3.8276566890878624e-06;
  return estimatedNnpaTime < estimatedCpuTime;
}

which I will then use to evaluate benefits of NNPA vs CPU.

I was going to integrate this to your pass with a flag that indicates whether to use benefits or not. This seems less redundant code as writing a new pass that does the same but with benefits. Feel free to comment here if you prefer one vs another

tungld · 2023-09-21T14:43:47Z

Yes, I expected that we can use --device-placement pass for that purpose, and can have a flag for benefit. I put a comment in the pass to mark the position we can use for the cost model. Basically, you just walk through all ops of interest and set device, e.g.

module.walk([&](Operation *op) {
  if (!isFasterOnNNPA(op))
    op->setAttr(DEVICE_ATTRIBUTE, StringAttr::get(context, CPU_DEVICE));
});

jenkins-droid · 2023-09-21T14:52:08Z

Jenkins Linux amd64 Build #12756 [push] [NNPA] Use device attrib... started at 09:52

jenkins-droid · 2023-09-21T14:52:08Z

Jenkins Linux s390x Build #12779 [push] [NNPA] Use device attrib... started at 10:52

jenkins-droid · 2023-09-21T14:52:11Z

Jenkins Linux ppc64le Build #11772 [push] [NNPA] Use device attrib... started at 11:01

jenkins-droid · 2023-09-21T16:00:15Z

Jenkins Linux amd64 Build #12756 [push] [NNPA] Use device attrib... failed after 1 hr 8 min

jenkins-droid · 2023-09-21T16:27:52Z

Jenkins Linux s390x Build #12779 [push] [NNPA] Use device attrib... passed after 1 hr 35 min

jenkins-droid · 2023-09-21T16:43:08Z

Jenkins Linux ppc64le Build #11772 [push] [NNPA] Use device attrib... passed after 1 hr 50 min

tungld added 3 commits September 15, 2023 17:36

remove execNodesOnCPU and add device attribute

4027852

Signed-off-by: Tung D. Le <[email protected]>

lambda function

c7c8ede

Signed-off-by: Tung D. Le <[email protected]>

refactoring

698d095

Signed-off-by: Tung D. Le <[email protected]>

tungld mentioned this pull request Sep 15, 2023

ONNX to Zhigh guided by cost model #2507

Closed

AlexandreEichenberger reviewed Sep 18, 2023

View reviewed changes

tungld added 3 commits September 19, 2023 11:28

Merge branch 'main' into nnpa-device-placement

a93a39f

Merge

6d3b927

Signed-off-by: Tung D. Le <[email protected]>

Add warning of there is a mismatch between users and compiler about N…

a589939

…NPA ops Signed-off-by: Tung D. Le <[email protected]>

Merge branch 'main' into nnpa-device-placement

b5aa9fb

Add DevicePlacement pass

9fddaaa

Signed-off-by: Tung D. Le <[email protected]>

AlexandreEichenberger approved these changes Sep 20, 2023

View reviewed changes

tungld added 2 commits September 21, 2023 16:01

introduce device-placement pass

b451245

Signed-off-by: Tung D. Le <[email protected]>

Merge branch 'main' into nnpa-device-placement

b1205e0

run EmitZHighIR lit test on big endian machine

253e370

Signed-off-by: Tung D. Le <[email protected]>

tungld merged commit ee11bc5 into onnx:main Sep 21, 2023
7 checks passed

tungld mentioned this pull request Oct 4, 2023

[NNPA] Check element types in rewrite-onnx-for-zhigh pass #2548

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NNPA] Use device attribute to control device placement for ONNX operations #2510

[NNPA] Use device attribute to control device placement for ONNX operations #2510

tungld commented Sep 15, 2023 •

edited

Loading

imaihal commented Sep 16, 2023

AlexandreEichenberger Sep 18, 2023

tungld Sep 19, 2023

tungld Sep 19, 2023

AlexandreEichenberger Sep 19, 2023

tungld commented Sep 19, 2023 •

edited

Loading

imaihal commented Sep 19, 2023

AlexandreEichenberger commented Sep 19, 2023

tungld commented Sep 20, 2023 •

edited

Loading

AlexandreEichenberger left a comment

tungld commented Sep 21, 2023

tungld commented Sep 21, 2023

AlexandreEichenberger commented Sep 21, 2023

tungld commented Sep 21, 2023 •

edited

Loading

jenkins-droid commented Sep 21, 2023

jenkins-droid commented Sep 21, 2023

jenkins-droid commented Sep 21, 2023

jenkins-droid commented Sep 21, 2023

jenkins-droid commented Sep 21, 2023

jenkins-droid commented Sep 21, 2023

[NNPA] Use device attribute to control device placement for ONNX operations #2510

[NNPA] Use device attribute to control device placement for ONNX operations #2510

Conversation

tungld commented Sep 15, 2023 • edited Loading

imaihal commented Sep 16, 2023

AlexandreEichenberger Sep 18, 2023

Choose a reason for hiding this comment

tungld Sep 19, 2023

Choose a reason for hiding this comment

tungld Sep 19, 2023

Choose a reason for hiding this comment

AlexandreEichenberger Sep 19, 2023

Choose a reason for hiding this comment

tungld commented Sep 19, 2023 • edited Loading

imaihal commented Sep 19, 2023

AlexandreEichenberger commented Sep 19, 2023

tungld commented Sep 20, 2023 • edited Loading

AlexandreEichenberger left a comment

Choose a reason for hiding this comment

tungld commented Sep 21, 2023

tungld commented Sep 21, 2023

AlexandreEichenberger commented Sep 21, 2023

tungld commented Sep 21, 2023 • edited Loading

jenkins-droid commented Sep 21, 2023

jenkins-droid commented Sep 21, 2023

jenkins-droid commented Sep 21, 2023

jenkins-droid commented Sep 21, 2023

jenkins-droid commented Sep 21, 2023

jenkins-droid commented Sep 21, 2023

tungld commented Sep 15, 2023 •

edited

Loading

tungld commented Sep 19, 2023 •

edited

Loading

tungld commented Sep 20, 2023 •

edited

Loading

tungld commented Sep 21, 2023 •

edited

Loading