-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[compilation][cpu]: failed to legalize operation onnx.Multinomial #18268
Comments
@pdhirajkumarprasad For this and all the other bugs you filed recently, could you please add this flag to |
@nirvedhmeshram I have updated all my issues with stage where it's failing along with version used. |
Also this may be related with llvm/torch-mlir#3630 |
@zjgarvey @rsuderman Can you provide some context input on this issue? As @pdhirajkumarprasad mentioned, it is possible that it is caused by it. |
Yeah, with the PR linked above, this op will still fail to compile. I'll post a linalg reproducer for the next compilation issue here today. This is another op that will successfully compile through our ref-backend pipeline in torch-mlir e2e testing, but not through iree compile. |
@zjgarvey The error that's showing over here is orthogonal to the IREE issue. It's not converting from torch to linalg. Maybe it was recently added and the torch-mlir submodule isn't bumped in IREE. |
@pashu123 The PR here llvm/torch-mlir#3630 resolves the torch-mlir error. |
If someone wants to review that PR we can get it merged and work on the next issue for this op not compiling. |
Ah, right. Here is the IR generated from an e2e test in torch mlir: "MultinomialModule2D_basic". This successfully compiles through the ref backend pipeline in torch-mlir, but fails to compile with iree for llvm-cpu. #map = affine_map<(d0, d1) -> (d0, d1)>
#map1 = affine_map<(d0, d1) -> (d0)>
#map2 = affine_map<(d0, d1) -> ()>
#map3 = affine_map<() -> ()>
module attributes {torch.debug_module_name = "MultinomialModule2D"} {
ml_program.global private mutable @global_seed(dense<0> : tensor<i64>) : tensor<i64>
func.func @forward(%arg0: tensor<?x?xf64>) -> tensor<f64> {
%c1048576_i64 = arith.constant 1048576 : i64
%c0_i64 = arith.constant 0 : i64
%c1_i64 = arith.constant 1 : i64
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%cst = arith.constant 0.000000e+00 : f64
%c32_i64 = arith.constant 32 : i64
%c2_i64 = arith.constant 2 : i64
%c1048576 = arith.constant 1048576 : index
%cst_0 = arith.constant 5.4210107999999998E-20 : f64
%c6364136223846793005_i64 = arith.constant 6364136223846793005 : i64
%c1442695040888963407_i64 = arith.constant 1442695040888963407 : i64
%dim = tensor.dim %arg0, %c0 : tensor<?x?xf64>
%dim_1 = tensor.dim %arg0, %c1 : tensor<?x?xf64>
%0 = arith.index_cast %dim : index to i64
%1 = arith.index_cast %dim_1 : index to i64
%2 = tensor.empty(%dim) : tensor<?x1048576xi64>
%3 = tensor.empty(%dim) : tensor<?xf64>
%4 = linalg.fill ins(%cst : f64) outs(%3 : tensor<?xf64>) -> tensor<?xf64>
%5 = linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["parallel", "reduction"]} ins(%arg0 : tensor<?x?xf64>) outs(%4 : tensor<?xf64>) {
^bb0(%in: f64, %out: f64):
%13 = arith.addf %in, %out : f64
linalg.yield %13 : f64
} -> tensor<?xf64>
%6 = scf.for %arg1 = %c0_i64 to %0 step %c1_i64 iter_args(%arg2 = %2) -> (tensor<?x1048576xi64>) : i64 {
%13 = arith.index_cast %arg1 : i64 to index
%extracted = tensor.extract %5[%13] : tensor<?xf64>
%14 = tensor.empty(%dim_1) : tensor<?xf64>
%15 = scf.for %arg3 = %c0_i64 to %1 step %c1_i64 iter_args(%arg4 = %14) -> (tensor<?xf64>) : i64 {
%19 = arith.cmpi sgt, %arg3, %c0_i64 : i64
%20 = arith.index_cast %arg3 : i64 to index
%extracted_4 = tensor.extract %arg0[%13, %20] : tensor<?x?xf64>
%21 = arith.divf %extracted_4, %extracted : f64
%22 = scf.if %19 -> (f64) {
%23 = arith.subi %arg3, %c1_i64 : i64
%24 = arith.index_cast %23 : i64 to index
%extracted_6 = tensor.extract %arg4[%24] : tensor<?xf64>
%25 = arith.addf %21, %extracted_6 : f64
scf.yield %25 : f64
} else {
scf.yield %21 : f64
}
%inserted_5 = tensor.insert %22 into %arg4[%20] : tensor<?xf64>
scf.yield %inserted_5 : tensor<?xf64>
}
%global_seed = ml_program.global_load @global_seed : tensor<i64>
%extracted_3 = tensor.extract %global_seed[] : tensor<i64>
%16 = arith.muli %extracted_3, %c6364136223846793005_i64 : i64
%17 = arith.addi %16, %c1442695040888963407_i64 : i64
%inserted = tensor.insert %17 into %global_seed[] : tensor<i64>
ml_program.global_store @global_seed = %inserted : tensor<i64>
%18 = scf.for %arg3 = %c0_i64 to %c1048576_i64 step %c1_i64 iter_args(%arg4 = %arg2) -> (tensor<?x1048576xi64>) : i64 {
%19 = arith.muli %arg3, %17 : i64
%20 = arith.addi %19, %17 : i64
%21 = arith.muli %19, %19 : i64
%22 = arith.addi %21, %19 : i64
%23 = arith.shli %22, %c32_i64 : i64
%24 = arith.shrui %22, %c32_i64 : i64
%25 = arith.ori %23, %24 : i64
%26 = arith.muli %25, %25 : i64
%27 = arith.addi %26, %20 : i64
%28 = arith.shli %27, %c32_i64 : i64
%29 = arith.shrui %27, %c32_i64 : i64
%30 = arith.ori %28, %29 : i64
%31 = arith.muli %30, %30 : i64
%32 = arith.addi %31, %19 : i64
%33 = arith.shli %32, %c32_i64 : i64
%34 = arith.shrui %32, %c32_i64 : i64
%35 = arith.ori %33, %34 : i64
%36 = arith.muli %35, %35 : i64
%37 = arith.addi %36, %20 : i64
%38 = arith.shli %37, %c32_i64 : i64
%39 = arith.shrui %37, %c32_i64 : i64
%40 = arith.ori %38, %39 : i64
%41 = arith.muli %40, %40 : i64
%42 = arith.addi %41, %19 : i64
%43 = arith.shrui %42, %c32_i64 : i64
%44 = arith.xori %37, %43 : i64
%45 = arith.uitofp %44 : i64 to f64
%46 = arith.mulf %45, %cst_0 : f64
%47 = arith.addf %46, %cst : f64
%48:2 = scf.while (%arg5 = %c0_i64, %arg6 = %1) : (i64, i64) -> (i64, i64) {
%50 = arith.cmpi sgt, %arg6, %arg5 : i64
scf.condition(%50) %arg5, %arg6 : i64, i64
} do {
^bb0(%arg5: i64, %arg6: i64):
%50 = arith.subi %arg6, %arg5 : i64
%51 = arith.divsi %50, %c2_i64 : i64
%52 = arith.addi %arg5, %51 : i64
%53 = arith.index_cast %52 : i64 to index
%extracted_5 = tensor.extract %15[%53] : tensor<?xf64>
%54 = arith.cmpf olt, %extracted_5, %47 : f64
%55 = arith.select %54, %arg6, %52 : i64
%56 = scf.if %54 -> (i64) {
%57 = arith.addi %52, %c1_i64 : i64
scf.yield %57 : i64
} else {
scf.yield %arg5 : i64
}
scf.yield %56, %55 : i64, i64
}
%49 = arith.index_cast %arg3 : i64 to index
%inserted_4 = tensor.insert %48#0 into %arg4[%13, %49] : tensor<?x1048576xi64>
scf.yield %inserted_4 : tensor<?x1048576xi64>
}
scf.yield %18 : tensor<?x1048576xi64>
}
%7 = tensor.empty() : tensor<f64>
%8 = linalg.fill ins(%cst : f64) outs(%7 : tensor<f64>) -> tensor<f64>
%9 = linalg.generic {indexing_maps = [#map, #map2], iterator_types = ["reduction", "reduction"]} ins(%6 : tensor<?x1048576xi64>) outs(%8 : tensor<f64>) {
^bb0(%in: i64, %out: f64):
%13 = arith.sitofp %in : i64 to f64
%14 = arith.addf %13, %out : f64
linalg.yield %14 : f64
} -> tensor<f64>
%dim_2 = tensor.dim %6, %c0 : tensor<?x1048576xi64>
%10 = arith.muli %dim_2, %c1048576 : index
%11 = arith.index_cast %10 : index to i64
%12 = linalg.generic {indexing_maps = [#map3, #map3], iterator_types = []} ins(%9 : tensor<f64>) outs(%7 : tensor<f64>) {
^bb0(%in: f64, %out: f64):
%13 = arith.sitofp %11 : i64 to f64
%14 = arith.divf %in, %13 : f64
linalg.yield %14 : f64
} -> tensor<f64>
return %12 : tensor<f64>
}
} repro:
hits an assertion:
|
ohk... we havent looked at programs like this. I have no idea what happens when we compile this. Might take some effort. Would be good to get a signal on the priority of this. |
Changed it to me for now. It will take me a bit to look at this, but I can reprioritize it if needed. |
It's a failure in one of the migraphx models, so I imagine it's moderate to high priority. |
Ok, but this seems like it is pretty much lowered to scalar code form the outset... I am not sure what the computation is and if we can lower it to something like linalg. All the |
yeah, that loop is going to be a problem (it's going to run entirely on the host in the VM interpreter) - tensorizing may help a bit but it really just shouldn't be scalarized like that. |
For context: The outermost scf loop is essentially over batches. Inside that, the first scf loop is doing a cumulative sum of the input distribution for the given batch. Maybe this could be converted to a TM tensor scan op instead. This step is done to generate a cumulative distribution function from the input distribution. Afterwards, for each desired sample, it computes a "random" probability value from [0,1], then performs a binary search through the CDF outputs to find the last event whose cumulative probabilty is lesser or equal to the randomly selected probability value. |
I'd start by looking at converting the ops if there are equivalents - it's absolutely required that those loops end up inside a dispatch region, and doing that with linalg ops (or things that lower to them or something adjacent) is pretty much the only way. Basic loops around tensor operations can be ok (though we have some things to do to make those better), but doing global stores inside the loop and any |
closing this one as open nod-ai/SHARK-ModelDev#854 to track the front-end issue |
What happened?
for the given IR
getting following error
IREE Version:
IREE compiler version 20240819.990 @ aeda149
LLVM version 20.0.0git
Steps to reproduce your issue
Command to reproduce the issue:
What component(s) does this issue relate to?
Compiler
Version information
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: