Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NNPA] Use device attribute to control device placement for ONNX operations #2510

Merged
merged 11 commits into from
Sep 21, 2023

Conversation

tungld
Copy link
Collaborator

@tungld tungld commented Sep 15, 2023

To set device for an ONNX operation (say, the op in the output of --EmitONNXIR where all onnx-to-onnx transformations has been applied), we set the device attribute for the ONNX operation, e.g.

%0 = "onnx.Add"(%arg0, %arg1) {device = "cpu", onnx_node_name = "test/add0"} : (tensor<10x10xf32>, tensor<10x10xf32>) -> tensor<*xf32>

will place onnx.Add on CPU.

%0 = "onnx.Add"(%arg0, %arg1) {device = "nnpa", onnx_node_name = "test/add0"} : (tensor<10x10xf32>, tensor<10x10xf32>) -> tensor<*xf32>

will place onnx.Add on NNPA.

If there is no device attribute, the compiler will make decision.

device attribute would facilitates our next steps, in particular, using a cost model proposed in #2507, or using a configuration file by users to specify where to place an ONNX op.

Next step: create a pass, e.g. device-placement to place ONNX operations by using a cost model or a configuration file.

By using device, the current way of forcing an op to CPU by using --execNodesOnCPU can be removed.

@imaihal
Copy link
Collaborator

imaihal commented Sep 16, 2023

How can users specify specific operations to run cpu or nnpa?

if (device && device.getValue().equals_insensitive(CPU_DEVICE))
return true;
// If device is NNPA, force to run the op on NNPA.
if (device && device.getValue().equals_insensitive(NNPA_DEVICE))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to do a check of legality check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean something like isNNPA() && isLegality()? device=NNPA can be forcing to NNPA or maybe good for NNPA. I am OK going with one of them.

Forcing to NNPA is convenient when we annotate an op with device=NNPA directly and we really want that op go to NNPA despite of compiler optimizations.

maybe good for NNPA is safe when we use a cost model, since the cost model may have a mistake in assigning an op to NNPA (e.g. that op is not suitable for NNPA)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forcing to NNPA is also useful when we have dynamic shape, and we want an op to run NNPA because the compiler is not able to know if it is suitable for CPU or NNPA.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have been clearer

assert(isLegal(xxx) && "trying to force an op to NNPA that is not perceived as legal for NNPA");

@tungld
Copy link
Collaborator Author

tungld commented Sep 19, 2023

How can users specify specific operations to run cpu or nnpa?

We will provide later a configuration file (e.g. a json file), so that users can specify which op to run on cpu or nnpa. In the configuration file, users can use operation types (e.g. ONNXConv) or onnx_node_name to identity an op.

@imaihal
Copy link
Collaborator

imaihal commented Sep 19, 2023

We will provide later a configuration file

I see. Since I sometime use execNodeOnCpu option, could you keep the option for a while until the configuration are added? It's ok If you plan to add it soon.

@AlexandreEichenberger
Copy link
Collaborator

It would be good if we can have the following flow.

  1. compile and print the mlir with annotation on (automatically generated by some cost/legality analysis)
  2. let experimenter change assignment
  3. restart compilation with that assignment

Conversely, if using the json file is easier, that could also be done. There is a certain simplicity in having the info directly in the MLIR.

I would venture that it would also be convenient if a user knew (either JSON or other annotation of the ops) that an op is legal on NNPA (regardless of if it is assigned or not)

@tungld
Copy link
Collaborator Author

tungld commented Sep 20, 2023

@AlexandreEichenberger yes, I am going with the flow you have in mind. Will ping you when it's available.

Signed-off-by: Tung D. Le <[email protected]>
Copy link
Collaborator

@AlexandreEichenberger AlexandreEichenberger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I would just clarify a bit the warning:

Warning: though the following operation was specified "
                       "to run on NNPA, the compiler found that NNPA did not "
                       "support that operation. It's potentially that the "
                       "compiler was not able to check broadcasting in case of "
                       "dynamic shape so that it thought the operation was not "
                       "legal for NNPA.

Does not specify to the user what is the result. I would change the wording to make sure the user understand that the op will still go to the NNPA.

Maybe

Warning, the following operation will run on the NNPA device even though the compiler believes that it is not legal to do so for this operation. The compiler may not have the full information necessary to accurately determine legality, for example in the presence dynamic shapes to exclude broadcasting, or for some other reasons. If the model does not work properly, you may want to double-check the validity of mapping that operation to the NNPA device.

@tungld
Copy link
Collaborator Author

tungld commented Sep 21, 2023

@AlexandreEichenberger now with this patch we can do the following things:

  1. -EmitONNXIR --maccel=nnpa will produce an IR where NNPA ops are annotated with attribute device = "nnpa", saying which operations will potentially run on NNPA.
  2. Users may change the device placement in the IR say, setting device = "cpu" to force an op run on CPU.
  3. Continue compiling the editted IR with --EmitZHighIR --maccel=nnpa or --EmitLib --maccel=nnpa.

For example, this is an output of using --EmitONNXIR --maccel=NNPA for the mnist model:

func.func @main_graph(%arg0: tensor<1x1x28x28xf32>) -> tensor<1x10xf32> attributes {input_names = ["Input3"], output_names = ["Plus214_Output_0"]} {
    %0 = onnx.Constant dense<[-0.0822488219, -0.108868778, -0.141039595, -0.204869166, -0.17913565, -0.215438381, -0.133805066, -0.195724562, -0.268250644, -0.258212209, -0.0761560649, 0.0132841459, -0.00444464432, -0.414740831, -0.17879115, -0.0386558883]> : tensor<16xf32>
    %1 = onnx.Constant dense<[-0.161539719, -0.433835655, 0.091641359, -0.0168522168, -0.0650264397, -0.131737873, 0.0204175506, -0.121110231]> : tensor<8xf32>
    %2 = onnx.Constant dense_resource<__elided__> : tensor<16x4x4x10xf32>
    %3 = onnx.Constant dense_resource<__elided__> : tensor<16x8x5x5xf32>
    %4 = onnx.Constant dense_resource<__elided__> : tensor<8x1x5x5xf32>
    %5 = onnx.Constant dense<[1, 256]> : tensor<2xi64>
    %6 = onnx.Constant dense<[256, 10]> : tensor<2xi64>
    %7 = onnx.Constant dense<[[-0.0448560268, 0.00779166119, 0.0681008175, 0.0299937408, -0.126409635, 0.14021875, -0.0552849025, -0.0493838154, 0.0843220502, -0.0545404144]]> : tensor<1x10xf32>
    %8 = "onnx.Reshape"(%2, %6) {allowzero = 0 : si64, onnx_node_name = "Times212_reshape1"} : (tensor<16x4x4x10xf32>, tensor<2xi64>) -> tensor<256x10xf32>
    %9 = "onnx.Conv"(%arg0, %4, %1) {auto_pad = "SAME_UPPER", device = "nnpa", dilations = [1, 1], group = 1 : si64, kernel_shape = [5, 5], strides = [1, 1]} : (tensor<1x1x28x28xf32>, tensor<8x1x5x5xf32>, tensor<8xf32>) -> tensor<1x8x28x28xf32>
    %10 = "onnx.Relu"(%9) {device = "nnpa", onnx_node_name = "ReLU32"} : (tensor<1x8x28x28xf32>) -> tensor<1x8x28x28xf32>
    %11 = "onnx.MaxPoolSingleOut"(%10) {auto_pad = "NOTSET", ceil_mode = 0 : si64, device = "nnpa", kernel_shape = [2, 2], onnx_node_name = "Pooling66", pads = [0, 0, 0, 0], storage_order = 0 : si64, strides = [2, 2]} : (tensor<1x8x28x28xf32>) -> tensor<1x8x14x14xf32>
    %12 = "onnx.Conv"(%11, %3, %0) {auto_pad = "SAME_UPPER", device = "nnpa", dilations = [1, 1], group = 1 : si64, kernel_shape = [5, 5], strides = [1, 1]} : (tensor<1x8x14x14xf32>, tensor<16x8x5x5xf32>, tensor<16xf32>) -> tensor<1x16x14x14xf32>
    %13 = "onnx.Relu"(%12) {device = "nnpa", onnx_node_name = "ReLU114"} : (tensor<1x16x14x14xf32>) -> tensor<1x16x14x14xf32>
    %14 = "onnx.MaxPoolSingleOut"(%13) {auto_pad = "NOTSET", ceil_mode = 0 : si64, device = "nnpa", kernel_shape = [3, 3], onnx_node_name = "Pooling160", pads = [0, 0, 0, 0], storage_order = 0 : si64, strides = [3, 3]} : (tensor<1x16x14x14xf32>) -> tensor<1x16x4x4xf32>
    %15 = "onnx.Reshape"(%14, %5) {allowzero = 0 : si64, onnx_node_name = "Times212_reshape0"} : (tensor<1x16x4x4xf32>, tensor<2xi64>) -> tensor<1x256xf32>
    %16 = "onnx.Gemm"(%15, %8, %7) {alpha = 1.000000e+00 : f32, beta = 1.000000e+00 : f32, device = "nnpa", transA = 0 : si64, transB = 0 : si64} : (tensor<1x256xf32>, tensor<256x10xf32>, tensor<1x10xf32>) -> tensor<1x10xf32>
    return %16 : tensor<1x10xf32>
  }

From this, we can see which operation will run on NNPA by looking device = "nnpa".

@tungld
Copy link
Collaborator Author

tungld commented Sep 21, 2023

@imaihal I am sorry that I will have another PR for users of onnx-mlir to set cpu ops. With the current patch, you can instead work directly with the output of --EmitONNXIR as I mentioned above.

@AlexandreEichenberger
Copy link
Collaborator

@tungld outstanding, and in the meanwhile, I run tests on z16, generating csv files of data for each ops which are then processed by a python file generating me this code:

bool isFasterOnNnpa_Div_3ds(double e3, double e2, double e1) {
  // Operation has cross over at complexity = 2483
  // Regression for CPU with r2 = 0.9999989217222092
  double complexityCpu = e3 * e2 * e1;
  double estimatedCpuTime = 1.4517483410281062e-09 * complexityCpu + 4.819629870926124e-07;
  // Regression for NNPA with r2 = 0.993964642126504
  double complexityNnpa = e3 * (ceil(e2/32.0)*32.0) * (ceil(e1/64.0)*64.0);
  double estimatedNnpaTime = 1.0448506395503133e-10 *  complexityNnpa + 3.8276566890878624e-06;
  return estimatedNnpaTime < estimatedCpuTime;
}

which I will then use to evaluate benefits of NNPA vs CPU.

I was going to integrate this to your pass with a flag that indicates whether to use benefits or not. This seems less redundant code as writing a new pass that does the same but with benefits. Feel free to comment here if you prefer one vs another

@tungld
Copy link
Collaborator Author

tungld commented Sep 21, 2023

Yes, I expected that we can use --device-placement pass for that purpose, and can have a flag for benefit. I put a comment in the pass to mark the position we can use for the cost model. Basically, you just walk through all ops of interest and set device, e.g.

module.walk([&](Operation *op) {
  if (!isFasterOnNNPA(op))
    op->setAttr(DEVICE_ATTRIBUTE, StringAttr::get(context, CPU_DEVICE));
});

@tungld tungld merged commit ee11bc5 into onnx:main Sep 21, 2023
7 checks passed
@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #12756 [push] [NNPA] Use device attrib... started at 09:52

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #12779 [push] [NNPA] Use device attrib... started at 10:52

@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #11772 [push] [NNPA] Use device attrib... started at 11:01

@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #12756 [push] [NNPA] Use device attrib... failed after 1 hr 8 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #12779 [push] [NNPA] Use device attrib... passed after 1 hr 35 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #11772 [push] [NNPA] Use device attrib... passed after 1 hr 50 min

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants