Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Constant Propagation for ReduceLogSum Backend tests #2517

Closed
wants to merge 9 commits into from
Closed

Enable Constant Propagation for ReduceLogSum Backend tests #2517

wants to merge 9 commits into from

Conversation

hamptonm1
Copy link
Collaborator

@hamptonm1 hamptonm1 commented Sep 20, 2023

I reenabled two of the ReduceLogSum that were failing because constant propagation needed to be turned on.

@hamptonm1 hamptonm1 changed the title Testing backend test Add Enable Constant Propagation for ReduceLogSum Backend tests Sep 21, 2023
@hamptonm1 hamptonm1 changed the title Add Enable Constant Propagation for ReduceLogSum Backend tests Enable Constant Propagation for ReduceLogSum Backend tests Sep 21, 2023
@hamptonm1 hamptonm1 self-assigned this Sep 21, 2023
@hamptonm1 hamptonm1 marked this pull request as ready for review September 21, 2023 01:05
@tungld
Copy link
Collaborator

tungld commented Sep 21, 2023

Let @gongsu832 make the final decision as the changes are related to dockers.

@hamptonm1 hamptonm1 requested review from gongsu832 and removed request for chentong319 September 21, 2023 15:23
@hamptonm1
Copy link
Collaborator Author

Let @gongsu832 make the final decision as the changes are related to dockers.

@gongsu832 Would you be able to look over this PR please? Thanks

docker/Dockerfile.onnx-mlir Outdated Show resolved Hide resolved
# Enable Constant Propagation
&& TEST_CONSTANT_PROP=${TEST_CONSTANT_PROP:-$([ "$(uname -m)" = "s390x" ] && echo true || \
([ "$(uname -m)" = "x86_64" ] && echo true || \
([ "$(uname -m)" = "ppc64le" ] && echo true || echo true)))} \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha... makes sense

@AlexandreEichenberger
Copy link
Collaborator

I reenabled two of the ReduceLogSum that were failing because constant propagation needed to be turned on.

What does that mean? When we run at -O3, no changes were done by your PR, so that should not be an issue.

Why is then a test failing at -O0? Can you elaborate on the failure we are trying to avoid?

@hamptonm1
Copy link
Collaborator Author

hamptonm1 commented Sep 21, 2023

I reenabled two of the ReduceLogSum that were failing because constant propagation needed to be turned on.

What does that mean? When we run at -O3, no changes were done by your PR, so that should not be an issue.

Why is then a test failing at -O0? Can you elaborate on the failure we are trying to avoid?

@AlexandreEichenberger The default behavior is set to false meaning constant propagation is disabled unless I enable it. So if we look at the lit tests from the PR I just merged in. I had to set the enable-constant-prop flag to true or else those tests failed. The same behavior applies to these two backend tests. I purposely commented the tests out because they were failing. For individual tests, it appears that the flag needs to be manually set in order for this to work which is in harmony with other flags that have been created used by the inference_backend.py script.

Also if we are running at -O0 which is the default behavior constant propagation is disabled now and previously we ran the default with constant propagation enabled everywhere. So these tests are depending on constant propagation to be re-enabled. I asked Gong which level are we running JNI at to see what the root problem is. That should resolve everything.

@hamptonm1 hamptonm1 requested a review from gongsu832 September 21, 2023 18:03
@hamptonm1
Copy link
Collaborator Author

This was the failure below that only occurred once I made my code updates:

----------------------------- Captured stderr call -----------------------------
['java', '-cp', '/workdir/onnx-mlir/build/test/backend/Debug/check-onnx-backend-constant-jni/test_reduce_log_sum_default/test_reduce_log_sum_default.jar:/usr/share/java/jsoniter-0.9.23.jar', 'com.ibm.onnxmlir.OMRunner']
munmap_chunk(): invalid pointer
=========================== short test summary info ============================
 Debug/test.py::OnnxBackendNodeModelTest::test_reduce_log_sum_default_cpu - j...
1 failed, 472 passed, 2165 skipped in 679.74s (0:11:19)
make[3]: *** [test/backend/CMakeFiles/check-onnx-backend-constant-jni.dir/build.make:71: test/backend/CMakeFiles/check-onnx-backend-constant-jni] Error 1
make[2]: *** [CMakeFiles/Makefile2:19511: test/backend/CMakeFiles/check-onnx-backend-constant-jni.dir/all] Error 2
10/10 Test  #1: TestConv .........................   Passed  931.22 sec

100% tests passed, 0 tests failed out of 10

Label Time Summary:
numerical    = 1817.38 sec*proc (10 tests)

Total Test time (real) = 931.22 sec

['java', '-cp', '/workdir/onnx-mlir/build/test/backend/Debug/check-onnx-backend-constant-jni/test_reduce_log_sum_default/test_reduce_log_sum_default.jar:/usr/share/java/jsoniter-0.9.23.jar', 'com.ibm.onnxmlir.OMRunner']
munmap_chunk(): invalid pointer

Just to clarify @gongsu832 is JNI check-onnx-backend-constant was built with O0, O1, O2, or O3?

@AlexandreEichenberger
Copy link
Collaborator

AlexandreEichenberger commented Sep 21, 2023

I would like to know why that benchmark fails if we don't do constant propagation. Because this may be the tell tale that we have a problem.

Also, why was this not caught in our regular CIs? Since your original patch could only have gone through with successful CIs on our key machines.

@hamptonm1
Copy link
Collaborator Author

hamptonm1 commented Sep 21, 2023

I would like to know why that benchmark fails if we don't do constant propagation. Because this may be the tell tale that we have a problem.

Also, why was this not caught in our regular CIs? Since your original patch could only have gone through with successful CIs on our key machines.

I commented the tests out. When the backend tests was enabled before this PR it failed in the CI. It only failed for Jenkins job. When I build on my Mac, the tests had no issues and it passed here too. Like I mentioned earlier, my assumption is JNI is dependent on constant propagation.

Only thing I am certain of is these two tests worked before my code changes... I saw the failure once creating constant propagation flag. The same applies for about 5 or 6 lit tests that failed.

@gongsu832
Copy link
Collaborator

Just to clarify @gongsu832 is JNI check-onnx-backend-constant was built with O0, O1, O2, or O3?

JNI itself doesn't have the notion of O0-3. The native model.so is built according to whatever O level set for the C/C++ tests. However, we run tests with -O0 for the dev image and with -O3 for the user image. If constant propagation requires -O3, you need to turn it off when building the dev image.

@tungld
Copy link
Collaborator

tungld commented Sep 22, 2023

Why is then a test failing at -O0? Can you elaborate on the failure we are trying to avoid?

Want to know too. Any test should be passed with all options/combinations of -O{0,1,2,3} and constant propagation on/off. Otherwise, we need to find out the bug to fix.

@hamptonm1 could you run the failed test test_reduce_log_sum_default_cpu alone (not via check-onnx-backend-constant-jni), e.g. using RunONNXModel.py and see why it only passed with -O3? We need it runnable with -O{0,1,2} too.

@hamptonm1
Copy link
Collaborator Author

Why is then a test failing at -O0? Can you elaborate on the failure we are trying to avoid?

Want to know too. Any test should be passed with all options/combinations of -O{0,1,2,3} and constant propagation on/off. Otherwise, we need to find out the bug to fix.

@hamptonm1 could you run the failed test test_reduce_log_sum_default_cpu alone (not via check-onnx-backend-constant-jni), e.g. using RunONNXModel.py and see why it only passed with -O3? We need it runnable with -O{0,1,2} too.

Okay let me test that out now and I will post results.

@AlexandreEichenberger
Copy link
Collaborator

Thanks @hamptonm1 , I know that disabling is quicker but getting to the bottom of this error now is much easier than having to go on chasing the same error on a very large model. So this is a big help. Thanks for finding the issue and helping understand what might be wrong here, much appreciated.

@hamptonm1
Copy link
Collaborator Author

hamptonm1 commented Sep 22, 2023

Thanks @hamptonm1 , I know that disabling is quicker but getting to the bottom of this error now is much easier than having to go on chasing the same error on a very large model. So this is a big help. Thanks for finding the issue and helping understand what might be wrong here, much appreciated.

@AlexandreEichenberger I re-enabled the tests in the PR that was always the plan.... so it is no longer disabled. The only thing I did was create a flag to enable constant propagation for the backend tests seeing that it passes once I include said flag. However, I am fine with running the tests via the python script to see if I can collect any other data.

@hamptonm1
Copy link
Collaborator Author

hamptonm1 commented Sep 22, 2023

@AlexandreEichenberger @tungld Okay here are the results below for test_reduce_log_sum_default_cpu (please let me know if you need me to test with any other flags/parameters):

meganhampton@Megans-MacBook-Pro-2 test_reduce_log_sum_default % ONNX_MLIR_HOME=/Users/meganhampton/zDLC/onnx-mlir/build/Debug /Users/meganhampton/zDLC/onnx-mlir/utils/RunONNXModel.py --model test_reduce_log_sum_default.onnx
Temporary directory has been created at /var/folders/9x/_g7t3dzn3h1011649yt57r740000gn/T/tmpox7_qr9l
Compiling the model ...
  took  0.17607170902192593  seconds.

Loading the compiled model ...
  took  0.11666995892301202  seconds.

Generating random inputs using seed 42 ...
  - 1st input's shape (3, 4, 5), element type float32. Value ranges [-0.1, 0.1]
The shape of the 2nd input is unknown. Use --shape-info to set.
 - The input signature:  {'type': 'i64', 'dims': [-1], 'name': 'axes'}

What stands out to me is this message The shape of the 2nd input is unknown. Use --shape-info to set.

I also tested using test_reduce_log_sum_negative_axes_cpu which worked all along and here are the results for comparison purposes:

meganhampton@Megans-MacBook-Pro-2 test_reduce_log_sum_negative_axes % ONNX_MLIR_HOME=/Users/meganhampton/zDLC/onnx-mlir/build/Debug /Users/meganhampton/zDLC/onnx-mlir/utils/RunONNXModel.py --model test_reduce_log_sum_negative_axes.onnx 
Temporary directory has been created at /var/folders/9x/_g7t3dzn3h1011649yt57r740000gn/T/tmpilim2vad
Compiling the model ...
  took  0.331482709152624  seconds.

Loading the compiled model ...
  took  0.12550391699187458  seconds.

Generating random inputs using seed 42 ...
  - 1st input's shape (3, 4, 5), element type float32. Value ranges [-0.1, 0.1]
  - 2nd input's shape (1,), element type int64. Value ranges [-10, 10]
  done.

Running inference ...
  1st iteration: 4.8625050112605095e-05 seconds

@hamptonm1
Copy link
Collaborator Author

I am going to close tis PR out and hopefully this PR should solve our issues: #2537. Thanks!

@hamptonm1 hamptonm1 closed this Sep 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants