Enable Constant Propagation for ReduceLogSum Backend tests #2517

hamptonm1 · 2023-09-20T01:18:00Z

I reenabled two of the ReduceLogSum that were failing because constant propagation needed to be turned on.

tungld · 2023-09-21T15:07:02Z

Let @gongsu832 make the final decision as the changes are related to dockers.

hamptonm1 · 2023-09-21T15:23:48Z

Let @gongsu832 make the final decision as the changes are related to dockers.

@gongsu832 Would you be able to look over this PR please? Thanks

docker/Dockerfile.onnx-mlir

gongsu832 · 2023-09-21T17:06:13Z

docker/Dockerfile.onnx-mlir-dev

+# Enable Constant Propagation 
+    && TEST_CONSTANT_PROP=${TEST_CONSTANT_PROP:-$([ "$(uname -m)" = "s390x" ] && echo true || \
+                                                ([ "$(uname -m)" = "x86_64" ] &&  echo true || \
+                                                ([ "$(uname -m)" = "ppc64le" ] && echo true || echo true)))} \


Same comment as above.

Gotcha... makes sense

AlexandreEichenberger · 2023-09-21T17:20:56Z

I reenabled two of the ReduceLogSum that were failing because constant propagation needed to be turned on.

What does that mean? When we run at -O3, no changes were done by your PR, so that should not be an issue.

Why is then a test failing at -O0? Can you elaborate on the failure we are trying to avoid?

hamptonm1 · 2023-09-21T17:58:57Z

I reenabled two of the ReduceLogSum that were failing because constant propagation needed to be turned on.

What does that mean? When we run at -O3, no changes were done by your PR, so that should not be an issue.

Why is then a test failing at -O0? Can you elaborate on the failure we are trying to avoid?

@AlexandreEichenberger The default behavior is set to false meaning constant propagation is disabled unless I enable it. So if we look at the lit tests from the PR I just merged in. I had to set the enable-constant-prop flag to true or else those tests failed. The same behavior applies to these two backend tests. I purposely commented the tests out because they were failing. For individual tests, it appears that the flag needs to be manually set in order for this to work which is in harmony with other flags that have been created used by the inference_backend.py script.

Also if we are running at -O0 which is the default behavior constant propagation is disabled now and previously we ran the default with constant propagation enabled everywhere. So these tests are depending on constant propagation to be re-enabled. I asked Gong which level are we running JNI at to see what the root problem is. That should resolve everything.

hamptonm1 · 2023-09-21T18:16:40Z

This was the failure below that only occurred once I made my code updates:

----------------------------- Captured stderr call -----------------------------
['java', '-cp', '/workdir/onnx-mlir/build/test/backend/Debug/check-onnx-backend-constant-jni/test_reduce_log_sum_default/test_reduce_log_sum_default.jar:/usr/share/java/jsoniter-0.9.23.jar', 'com.ibm.onnxmlir.OMRunner']
munmap_chunk(): invalid pointer
=========================== short test summary info ============================
 Debug/test.py::OnnxBackendNodeModelTest::test_reduce_log_sum_default_cpu - j...
1 failed, 472 passed, 2165 skipped in 679.74s (0:11:19)
make[3]: *** [test/backend/CMakeFiles/check-onnx-backend-constant-jni.dir/build.make:71: test/backend/CMakeFiles/check-onnx-backend-constant-jni] Error 1
make[2]: *** [CMakeFiles/Makefile2:19511: test/backend/CMakeFiles/check-onnx-backend-constant-jni.dir/all] Error 2
10/10 Test  #1: TestConv .........................   Passed  931.22 sec

100% tests passed, 0 tests failed out of 10

Label Time Summary:
numerical    = 1817.38 sec*proc (10 tests)

Total Test time (real) = 931.22 sec

['java', '-cp', '/workdir/onnx-mlir/build/test/backend/Debug/check-onnx-backend-constant-jni/test_reduce_log_sum_default/test_reduce_log_sum_default.jar:/usr/share/java/jsoniter-0.9.23.jar', 'com.ibm.onnxmlir.OMRunner']
munmap_chunk(): invalid pointer

Just to clarify @gongsu832 is JNI check-onnx-backend-constant was built with O0, O1, O2, or O3?

AlexandreEichenberger · 2023-09-21T18:25:41Z

I would like to know why that benchmark fails if we don't do constant propagation. Because this may be the tell tale that we have a problem.

Also, why was this not caught in our regular CIs? Since your original patch could only have gone through with successful CIs on our key machines.

hamptonm1 · 2023-09-21T18:31:14Z

I would like to know why that benchmark fails if we don't do constant propagation. Because this may be the tell tale that we have a problem.

Also, why was this not caught in our regular CIs? Since your original patch could only have gone through with successful CIs on our key machines.

I commented the tests out. When the backend tests was enabled before this PR it failed in the CI. It only failed for Jenkins job. When I build on my Mac, the tests had no issues and it passed here too. Like I mentioned earlier, my assumption is JNI is dependent on constant propagation.

Only thing I am certain of is these two tests worked before my code changes... I saw the failure once creating constant propagation flag. The same applies for about 5 or 6 lit tests that failed.

gongsu832 · 2023-09-21T19:15:04Z

Just to clarify @gongsu832 is JNI check-onnx-backend-constant was built with O0, O1, O2, or O3?

JNI itself doesn't have the notion of O0-3. The native model.so is built according to whatever O level set for the C/C++ tests. However, we run tests with -O0 for the dev image and with -O3 for the user image. If constant propagation requires -O3, you need to turn it off when building the dev image.

tungld · 2023-09-22T02:51:54Z

Why is then a test failing at -O0? Can you elaborate on the failure we are trying to avoid?

Want to know too. Any test should be passed with all options/combinations of -O{0,1,2,3} and constant propagation on/off. Otherwise, we need to find out the bug to fix.

@hamptonm1 could you run the failed test test_reduce_log_sum_default_cpu alone (not via check-onnx-backend-constant-jni), e.g. using RunONNXModel.py and see why it only passed with -O3? We need it runnable with -O{0,1,2} too.

hamptonm1 · 2023-09-22T13:07:57Z

Why is then a test failing at -O0? Can you elaborate on the failure we are trying to avoid?

Want to know too. Any test should be passed with all options/combinations of -O{0,1,2,3} and constant propagation on/off. Otherwise, we need to find out the bug to fix.

@hamptonm1 could you run the failed test test_reduce_log_sum_default_cpu alone (not via check-onnx-backend-constant-jni), e.g. using RunONNXModel.py and see why it only passed with -O3? We need it runnable with -O{0,1,2} too.

Okay let me test that out now and I will post results.

AlexandreEichenberger · 2023-09-22T13:58:44Z

Thanks @hamptonm1 , I know that disabling is quicker but getting to the bottom of this error now is much easier than having to go on chasing the same error on a very large model. So this is a big help. Thanks for finding the issue and helping understand what might be wrong here, much appreciated.

hamptonm1 · 2023-09-22T14:29:13Z

Thanks @hamptonm1 , I know that disabling is quicker but getting to the bottom of this error now is much easier than having to go on chasing the same error on a very large model. So this is a big help. Thanks for finding the issue and helping understand what might be wrong here, much appreciated.

@AlexandreEichenberger I re-enabled the tests in the PR that was always the plan.... so it is no longer disabled. The only thing I did was create a flag to enable constant propagation for the backend tests seeing that it passes once I include said flag. However, I am fine with running the tests via the python script to see if I can collect any other data.

hamptonm1 · 2023-09-22T18:55:59Z

@AlexandreEichenberger @tungld Okay here are the results below for test_reduce_log_sum_default_cpu (please let me know if you need me to test with any other flags/parameters):

meganhampton@Megans-MacBook-Pro-2 test_reduce_log_sum_default % ONNX_MLIR_HOME=/Users/meganhampton/zDLC/onnx-mlir/build/Debug /Users/meganhampton/zDLC/onnx-mlir/utils/RunONNXModel.py --model test_reduce_log_sum_default.onnx
Temporary directory has been created at /var/folders/9x/_g7t3dzn3h1011649yt57r740000gn/T/tmpox7_qr9l
Compiling the model ...
  took  0.17607170902192593  seconds.

Loading the compiled model ...
  took  0.11666995892301202  seconds.

Generating random inputs using seed 42 ...
  - 1st input's shape (3, 4, 5), element type float32. Value ranges [-0.1, 0.1]
The shape of the 2nd input is unknown. Use --shape-info to set.
 - The input signature:  {'type': 'i64', 'dims': [-1], 'name': 'axes'}

What stands out to me is this message The shape of the 2nd input is unknown. Use --shape-info to set.

I also tested using test_reduce_log_sum_negative_axes_cpu which worked all along and here are the results for comparison purposes:

meganhampton@Megans-MacBook-Pro-2 test_reduce_log_sum_negative_axes % ONNX_MLIR_HOME=/Users/meganhampton/zDLC/onnx-mlir/build/Debug /Users/meganhampton/zDLC/onnx-mlir/utils/RunONNXModel.py --model test_reduce_log_sum_negative_axes.onnx 
Temporary directory has been created at /var/folders/9x/_g7t3dzn3h1011649yt57r740000gn/T/tmpilim2vad
Compiling the model ...
  took  0.331482709152624  seconds.

Loading the compiled model ...
  took  0.12550391699187458  seconds.

Generating random inputs using seed 42 ...
  - 1st input's shape (3, 4, 5), element type float32. Value ranges [-0.1, 0.1]
  - 2nd input's shape (1,), element type int64. Value ranges [-10, 10]
  done.

Running inference ...
  1st iteration: 4.8625050112605095e-05 seconds

hamptonm1 · 2023-09-28T18:22:03Z

I am going to close tis PR out and hopefully this PR should solve our issues: #2537. Thanks!

MegoHam21 and others added 5 commits September 19, 2023 21:17

Testing backend test

b940fed

Enable constant prop for backend tests

b2cb7e1

Fix variable name

82d3b2f

Merge branch 'main' into test

5c4983a

Remove test

a629a1f

hamptonm1 changed the title ~~Testing backend test~~ Add Enable Constant Propagation for ReduceLogSum Backend tests Sep 21, 2023

hamptonm1 changed the title ~~Add Enable Constant Propagation for ReduceLogSum Backend tests~~ Enable Constant Propagation for ReduceLogSum Backend tests Sep 21, 2023

hamptonm1 self-assigned this Sep 21, 2023

hamptonm1 marked this pull request as ready for review September 21, 2023 01:05

hamptonm1 requested review from tungld, AlexandreEichenberger and chentong319 September 21, 2023 01:05

hamptonm1 added the Ready for Review label Sep 21, 2023

hamptonm1 requested review from gongsu832 and removed request for chentong319 September 21, 2023 15:23

Merge branch 'main' into test

7c74255

gongsu832 reviewed Sep 21, 2023

View reviewed changes

hamptonm1 added 2 commits September 21, 2023 14:01

Simplify argument

99eb04d

Simplify argument

7d5d713

hamptonm1 requested a review from gongsu832 September 21, 2023 18:03

Merge branch 'main' into test

ca5c3eb

hamptonm1 closed this Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Constant Propagation for ReduceLogSum Backend tests #2517

Enable Constant Propagation for ReduceLogSum Backend tests #2517

hamptonm1 commented Sep 20, 2023 •

edited

Loading

tungld commented Sep 21, 2023

hamptonm1 commented Sep 21, 2023

gongsu832 Sep 21, 2023

hamptonm1 Sep 21, 2023

AlexandreEichenberger commented Sep 21, 2023

hamptonm1 commented Sep 21, 2023 •

edited

Loading

hamptonm1 commented Sep 21, 2023

AlexandreEichenberger commented Sep 21, 2023 •

edited

Loading

hamptonm1 commented Sep 21, 2023 •

edited

Loading

gongsu832 commented Sep 21, 2023

tungld commented Sep 22, 2023

hamptonm1 commented Sep 22, 2023

AlexandreEichenberger commented Sep 22, 2023

hamptonm1 commented Sep 22, 2023 •

edited

Loading

hamptonm1 commented Sep 22, 2023 •

edited

Loading

hamptonm1 commented Sep 28, 2023

Enable Constant Propagation for ReduceLogSum Backend tests #2517

Enable Constant Propagation for ReduceLogSum Backend tests #2517

Conversation

hamptonm1 commented Sep 20, 2023 • edited Loading

tungld commented Sep 21, 2023

hamptonm1 commented Sep 21, 2023

gongsu832 Sep 21, 2023

Choose a reason for hiding this comment

hamptonm1 Sep 21, 2023

Choose a reason for hiding this comment

AlexandreEichenberger commented Sep 21, 2023

hamptonm1 commented Sep 21, 2023 • edited Loading

hamptonm1 commented Sep 21, 2023

AlexandreEichenberger commented Sep 21, 2023 • edited Loading

hamptonm1 commented Sep 21, 2023 • edited Loading

gongsu832 commented Sep 21, 2023

tungld commented Sep 22, 2023

hamptonm1 commented Sep 22, 2023

AlexandreEichenberger commented Sep 22, 2023

hamptonm1 commented Sep 22, 2023 • edited Loading

hamptonm1 commented Sep 22, 2023 • edited Loading

hamptonm1 commented Sep 28, 2023

hamptonm1 commented Sep 20, 2023 •

edited

Loading

hamptonm1 commented Sep 21, 2023 •

edited

Loading

AlexandreEichenberger commented Sep 21, 2023 •

edited

Loading

hamptonm1 commented Sep 21, 2023 •

edited

Loading

hamptonm1 commented Sep 22, 2023 •

edited

Loading

hamptonm1 commented Sep 22, 2023 •

edited

Loading