Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[compilation][cpu]: failed to legalize operation onnx.Multinomial #18268

Closed
pdhirajkumarprasad opened this issue Aug 19, 2024 · 18 comments
Closed
Assignees
Labels
bug 🐞 Something isn't working hal/cpu Runtime Host/CPU-based HAL backend integrations/onnx ONNX integration work

Comments

@pdhirajkumarprasad
Copy link

pdhirajkumarprasad commented Aug 19, 2024

What happened?

for the given IR

module {
  func.func @"torch-jit-export"( %arg6: !torch.vtensor<[?,4],f32>) -> (!torch.vtensor<[?,1],si64>) attributes {torch.onnx_meta.ir_version = 6 : si64, torch.onnx_meta.opset_version = 21 : si64, torch.onnx_meta.producer_name = "pytorch", torch.onnx_meta.producer_version = "1.7"} {
    %82 = torch.operator "onnx.Multinomial"(%arg6) {torch.onnx.dtype = 7 : si64, torch.onnx.sample_size = 1 : si64} : (!torch.vtensor<[?,4],f32>) -> !torch.vtensor<[?,1],si64> 
    return %82: !torch.vtensor<[?,1],si64>
  }
}

getting following error

t1.mlir:3:11: error: 'arith.addf' op requires the same type for all operands and results
    %82 = torch.operator "onnx.Multinomial"(%arg6) {torch.onnx.dtype = 7 : si64, torch.onnx.sample_size = 1 : si64} : (!torch.vtensor<[?,4],f32>) -> !torch.vtensor<[?,1],si64> 
          ^
// -----// IR Dump After ConvertTorchToLinalg Failed (convert-torch-to-linalg) //----- //
"func.func"() <{function_type = (!torch.vtensor<[?,4],f32>) -> !torch.vtensor<[?,1],si64>, sym_name = "torch-jit-export"}> ({
^bb0(%arg0: !torch.vtensor<[?,4],f32> loc("t1.mlir":2:34)):
  %0 = "torch_c.to_builtin_tensor"(%arg0) : (!torch.vtensor<[?,4],f32>) -> tensor<?x4xf32> loc("t1.mlir":3:11)
  %1 = "torch.constant.int"() <{value = 1 : i64}> : () -> !torch.int loc("t1.mlir":3:11)
  %2 = "arith.constant"() <{value = 1 : i64}> : () -> i64 loc("t1.mlir":3:11)
  %3 = "torch.constant.none"() : () -> !torch.none loc("t1.mlir":3:11)
  %4 = "torch.constant.bool"() <{value = true}> : () -> !torch.bool loc("t1.mlir":3:11)
  %5 = "arith.constant"() <{value = 0 : i64}> : () -> i64 loc("t1.mlir":3:11)
  %6 = "arith.constant"() <{value = 1 : i64}> : () -> i64 loc("t1.mlir":3:11)
  %7 = "arith.constant"() <{value = 0 : index}> : () -> index loc("t1.mlir":3:11)
  %8 = "arith.constant"() <{value = 1 : index}> : () -> index loc("t1.mlir":3:11)
  %9 = "arith.index_cast"(%2) : (i64) -> index loc("t1.mlir":3:11)
  %10 = "tensor.dim"(%0, %7) : (tensor<?x4xf32>, index) -> index loc("t1.mlir":3:11)
  %11 = "tensor.dim"(%0, %8) : (tensor<?x4xf32>, index) -> index loc("t1.mlir":3:11)

IREE Version:
IREE compiler version 20240819.990 @ aeda149
LLVM version 20.0.0git

Steps to reproduce your issue

Command to reproduce the issue:

iree-compile --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-hal-target-backends=llvm-cpu model.torch_onnx.mlir

What component(s) does this issue relate to?

Compiler

Version information

No response

Additional context

No response

@pdhirajkumarprasad pdhirajkumarprasad added the bug 🐞 Something isn't working label Aug 19, 2024
@nirvedhmeshram
Copy link
Contributor

@pdhirajkumarprasad For this and all the other bugs you filed recently, could you please add this flag to --mlir-print-ir-after-all to the iree-compile command and copy the dump to a file and share a link on the issue. it will help us understand which layer the issue is in and who to assign on it. cc @MaheshRavishankar

@lialan lialan added the hal/cpu Runtime Host/CPU-based HAL backend label Aug 19, 2024
@ScottTodd ScottTodd added the integrations/onnx ONNX integration work label Aug 19, 2024
@pdhirajkumarprasad
Copy link
Author

@pdhirajkumarprasad For this and all the other bugs you filed recently, could you please add this flag to --mlir-print-ir-after-all to the iree-compile command and copy the dump to a file and share a link on the issue. it will help us understand which layer the issue is in and who to assign on it. cc @MaheshRavishankar

@nirvedhmeshram I have updated all my issues with stage where it's failing along with version used.

@pdhirajkumarprasad
Copy link
Author

Also this may be related with llvm/torch-mlir#3630

@lialan
Copy link
Contributor

lialan commented Aug 20, 2024

@zjgarvey @rsuderman Can you provide some context input on this issue? As @pdhirajkumarprasad mentioned, it is possible that it is caused by it.

@zjgarvey
Copy link
Contributor

zjgarvey commented Aug 20, 2024

Yeah, with the PR linked above, this op will still fail to compile. I'll post a linalg reproducer for the next compilation issue here today. This is another op that will successfully compile through our ref-backend pipeline in torch-mlir e2e testing, but not through iree compile.

@pashu123
Copy link
Contributor

@zjgarvey The error that's showing over here is orthogonal to the IREE issue. It's not converting from torch to linalg. Maybe it was recently added and the torch-mlir submodule isn't bumped in IREE.

@zjgarvey
Copy link
Contributor

@pashu123 The PR here llvm/torch-mlir#3630 resolves the torch-mlir error.

@zjgarvey
Copy link
Contributor

If someone wants to review that PR we can get it merged and work on the next issue for this op not compiling.

@lialan
Copy link
Contributor

lialan commented Aug 20, 2024

@zjgarvey So @pashu123 has approved the PR, and it looks clear. please proceed.

@zjgarvey
Copy link
Contributor

Ah, right. Here is the IR generated from an e2e test in torch mlir: "MultinomialModule2D_basic".

This successfully compiles through the ref backend pipeline in torch-mlir, but fails to compile with iree for llvm-cpu.

#map = affine_map<(d0, d1) -> (d0, d1)>
#map1 = affine_map<(d0, d1) -> (d0)>
#map2 = affine_map<(d0, d1) -> ()>
#map3 = affine_map<() -> ()>
module attributes {torch.debug_module_name = "MultinomialModule2D"} {
  ml_program.global private mutable @global_seed(dense<0> : tensor<i64>) : tensor<i64>
  func.func @forward(%arg0: tensor<?x?xf64>) -> tensor<f64> {
    %c1048576_i64 = arith.constant 1048576 : i64
    %c0_i64 = arith.constant 0 : i64
    %c1_i64 = arith.constant 1 : i64
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %cst = arith.constant 0.000000e+00 : f64
    %c32_i64 = arith.constant 32 : i64
    %c2_i64 = arith.constant 2 : i64
    %c1048576 = arith.constant 1048576 : index
    %cst_0 = arith.constant 5.4210107999999998E-20 : f64
    %c6364136223846793005_i64 = arith.constant 6364136223846793005 : i64
    %c1442695040888963407_i64 = arith.constant 1442695040888963407 : i64
    %dim = tensor.dim %arg0, %c0 : tensor<?x?xf64>
    %dim_1 = tensor.dim %arg0, %c1 : tensor<?x?xf64>
    %0 = arith.index_cast %dim : index to i64
    %1 = arith.index_cast %dim_1 : index to i64
    %2 = tensor.empty(%dim) : tensor<?x1048576xi64>
    %3 = tensor.empty(%dim) : tensor<?xf64>
    %4 = linalg.fill ins(%cst : f64) outs(%3 : tensor<?xf64>) -> tensor<?xf64>
    %5 = linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["parallel", "reduction"]} ins(%arg0 : tensor<?x?xf64>) outs(%4 : tensor<?xf64>) {
    ^bb0(%in: f64, %out: f64):
      %13 = arith.addf %in, %out : f64
      linalg.yield %13 : f64
    } -> tensor<?xf64>
    %6 = scf.for %arg1 = %c0_i64 to %0 step %c1_i64 iter_args(%arg2 = %2) -> (tensor<?x1048576xi64>)  : i64 {
      %13 = arith.index_cast %arg1 : i64 to index
      %extracted = tensor.extract %5[%13] : tensor<?xf64>
      %14 = tensor.empty(%dim_1) : tensor<?xf64>
      %15 = scf.for %arg3 = %c0_i64 to %1 step %c1_i64 iter_args(%arg4 = %14) -> (tensor<?xf64>)  : i64 {
        %19 = arith.cmpi sgt, %arg3, %c0_i64 : i64
        %20 = arith.index_cast %arg3 : i64 to index
        %extracted_4 = tensor.extract %arg0[%13, %20] : tensor<?x?xf64>
        %21 = arith.divf %extracted_4, %extracted : f64
        %22 = scf.if %19 -> (f64) {
          %23 = arith.subi %arg3, %c1_i64 : i64
          %24 = arith.index_cast %23 : i64 to index
          %extracted_6 = tensor.extract %arg4[%24] : tensor<?xf64>
          %25 = arith.addf %21, %extracted_6 : f64
          scf.yield %25 : f64
        } else {
          scf.yield %21 : f64
        }
        %inserted_5 = tensor.insert %22 into %arg4[%20] : tensor<?xf64>
        scf.yield %inserted_5 : tensor<?xf64>
      }
      %global_seed = ml_program.global_load @global_seed : tensor<i64>
      %extracted_3 = tensor.extract %global_seed[] : tensor<i64>
      %16 = arith.muli %extracted_3, %c6364136223846793005_i64 : i64
      %17 = arith.addi %16, %c1442695040888963407_i64 : i64
      %inserted = tensor.insert %17 into %global_seed[] : tensor<i64>
      ml_program.global_store @global_seed = %inserted : tensor<i64>
      %18 = scf.for %arg3 = %c0_i64 to %c1048576_i64 step %c1_i64 iter_args(%arg4 = %arg2) -> (tensor<?x1048576xi64>)  : i64 {
        %19 = arith.muli %arg3, %17 : i64
        %20 = arith.addi %19, %17 : i64
        %21 = arith.muli %19, %19 : i64
        %22 = arith.addi %21, %19 : i64
        %23 = arith.shli %22, %c32_i64 : i64
        %24 = arith.shrui %22, %c32_i64 : i64
        %25 = arith.ori %23, %24 : i64
        %26 = arith.muli %25, %25 : i64
        %27 = arith.addi %26, %20 : i64
        %28 = arith.shli %27, %c32_i64 : i64
        %29 = arith.shrui %27, %c32_i64 : i64
        %30 = arith.ori %28, %29 : i64
        %31 = arith.muli %30, %30 : i64
        %32 = arith.addi %31, %19 : i64
        %33 = arith.shli %32, %c32_i64 : i64
        %34 = arith.shrui %32, %c32_i64 : i64
        %35 = arith.ori %33, %34 : i64
        %36 = arith.muli %35, %35 : i64
        %37 = arith.addi %36, %20 : i64
        %38 = arith.shli %37, %c32_i64 : i64
        %39 = arith.shrui %37, %c32_i64 : i64
        %40 = arith.ori %38, %39 : i64
        %41 = arith.muli %40, %40 : i64
        %42 = arith.addi %41, %19 : i64
        %43 = arith.shrui %42, %c32_i64 : i64
        %44 = arith.xori %37, %43 : i64
        %45 = arith.uitofp %44 : i64 to f64
        %46 = arith.mulf %45, %cst_0 : f64
        %47 = arith.addf %46, %cst : f64
        %48:2 = scf.while (%arg5 = %c0_i64, %arg6 = %1) : (i64, i64) -> (i64, i64) {
          %50 = arith.cmpi sgt, %arg6, %arg5 : i64
          scf.condition(%50) %arg5, %arg6 : i64, i64
        } do {
        ^bb0(%arg5: i64, %arg6: i64):
          %50 = arith.subi %arg6, %arg5 : i64
          %51 = arith.divsi %50, %c2_i64 : i64
          %52 = arith.addi %arg5, %51 : i64
          %53 = arith.index_cast %52 : i64 to index
          %extracted_5 = tensor.extract %15[%53] : tensor<?xf64>
          %54 = arith.cmpf olt, %extracted_5, %47 : f64
          %55 = arith.select %54, %arg6, %52 : i64
          %56 = scf.if %54 -> (i64) {
            %57 = arith.addi %52, %c1_i64 : i64
            scf.yield %57 : i64
          } else {
            scf.yield %arg5 : i64
          }
          scf.yield %56, %55 : i64, i64
        }
        %49 = arith.index_cast %arg3 : i64 to index
        %inserted_4 = tensor.insert %48#0 into %arg4[%13, %49] : tensor<?x1048576xi64>
        scf.yield %inserted_4 : tensor<?x1048576xi64>
      }
      scf.yield %18 : tensor<?x1048576xi64>
    }
    %7 = tensor.empty() : tensor<f64>
    %8 = linalg.fill ins(%cst : f64) outs(%7 : tensor<f64>) -> tensor<f64>
    %9 = linalg.generic {indexing_maps = [#map, #map2], iterator_types = ["reduction", "reduction"]} ins(%6 : tensor<?x1048576xi64>) outs(%8 : tensor<f64>) {
    ^bb0(%in: i64, %out: f64):
      %13 = arith.sitofp %in : i64 to f64
      %14 = arith.addf %13, %out : f64
      linalg.yield %14 : f64
    } -> tensor<f64>
    %dim_2 = tensor.dim %6, %c0 : tensor<?x1048576xi64>
    %10 = arith.muli %dim_2, %c1048576 : index
    %11 = arith.index_cast %10 : index to i64
    %12 = linalg.generic {indexing_maps = [#map3, #map3], iterator_types = []} ins(%9 : tensor<f64>) outs(%7 : tensor<f64>) {
    ^bb0(%in: f64, %out: f64):
      %13 = arith.sitofp %11 : i64 to f64
      %14 = arith.divf %in, %13 : f64
      linalg.yield %14 : f64
    } -> tensor<f64>
    return %12 : tensor<f64>
  }
}

repro:

iree-compile --iree-hal-target-backends=llvm-cpu repro.mlir -o repro.vmfb

hits an assertion:

iree-compile: iree/compiler/src/iree/compiler/Dialect/Util/Transforms/HoistIntoGlobals.cpp:105: auto mlir::iree_compiler::IREE::Util::(anonymous namespace)::HoistIntoGlobalsPass::runOnOperation()::(anonymous class)::operator()(mlir::Operation *) const: Assertion `resultInfo && "must have const-expr info"' failed.

@MaheshRavishankar
Copy link
Contributor

ohk... we havent looked at programs like this. I have no idea what happens when we compile this. Might take some effort. Would be good to get a signal on the priority of this.

@MaheshRavishankar
Copy link
Contributor

Changed it to me for now. It will take me a bit to look at this, but I can reprioritize it if needed.

@zjgarvey
Copy link
Contributor

ohk... we havent looked at programs like this. I have no idea what happens when we compile this. Might take some effort. Would be good to get a signal on the priority of this.

It's a failure in one of the migraphx models, so I imagine it's moderate to high priority.

@MaheshRavishankar
Copy link
Contributor

Ok, but this seems like it is pretty much lowered to scalar code form the outset... I am not sure what the computation is and if we can lower it to something like linalg. All the scf.if and scf.for at this level is not what you expect into IREE. THere is anything really good we can do with such code (and yes we should be able to compile and get shitty performance, but that will take cycles).

@benvanik
Copy link
Collaborator

yeah, that loop is going to be a problem (it's going to run entirely on the host in the VM interpreter) - tensorizing may help a bit but it really just shouldn't be scalarized like that.

@zjgarvey
Copy link
Contributor

For context:

The outermost scf loop is essentially over batches. Inside that, the first scf loop is doing a cumulative sum of the input distribution for the given batch. Maybe this could be converted to a TM tensor scan op instead. This step is done to generate a cumulative distribution function from the input distribution.

Afterwards, for each desired sample, it computes a "random" probability value from [0,1], then performs a binary search through the CDF outputs to find the last event whose cumulative probabilty is lesser or equal to the randomly selected probability value.

@benvanik
Copy link
Collaborator

I'd start by looking at converting the ops if there are equivalents - it's absolutely required that those loops end up inside a dispatch region, and doing that with linalg ops (or things that lower to them or something adjacent) is pretty much the only way. Basic loops around tensor operations can be ok (though we have some things to do to make those better), but doing global stores inside the loop and any tensor.extract/tensor.insert_slice is on the "it'd be faster to do this work on pen and paper territory" and a big red flag when looking at models. You can get in the habit of grepping for tensor.extract and if you find any of them know that you're performance is going to be somewhere between bad to indistinguishable-from-hangs :P

@pdhirajkumarprasad
Copy link
Author

closing this one as open nod-ai/SHARK-ModelDev#854 to track the front-end issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working hal/cpu Runtime Host/CPU-based HAL backend integrations/onnx ONNX integration work
Projects
Status: Done
Development

No branches or pull requests

8 participants