Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Debugging jenkins build of PR 4950 #5097

Closed
wants to merge 19 commits into from

Conversation

hakonhagland
Copy link
Contributor

Refer to #4950. I am not able to reproduce the jenkins failure on my own laptop, so I am creating a new PR here to debug that Jenkins build by bisection (removing and adding code from #4950 until the jenkins build succeeds).

@hakonhagland
Copy link
Contributor Author

jenkins build this please

@bska
Copy link
Member

bska commented Jan 9, 2024

I am creating a new PR here to debug that Jenkins build by bisection

Would you mind putting the PR in "draft" mode then please, to prevent inadvertent merging?

@hakonhagland hakonhagland marked this pull request as draft January 9, 2024 09:13
@hakonhagland
Copy link
Contributor Author

hakonhagland commented Jan 9, 2024

Would you mind putting the PR in "draft" mode

@bska Done.

@bska
Copy link
Member

bska commented Jan 9, 2024

Would you mind putting the PR in "draft" mode

@bska Done.

Thanks–much appreciated!

@hakonhagland
Copy link
Contributor Author

jenkins build this please

3 similar comments
@hakonhagland
Copy link
Contributor Author

jenkins build this please

@hakonhagland
Copy link
Contributor Author

jenkins build this please

@hakonhagland
Copy link
Contributor Author

jenkins build this please

@hakonhagland
Copy link
Contributor Author

@akva and @bska Is it possible to run a specific test case with verbose output on the jenkins server? For example, I would like to run

ctest --output-on-failure -V -R python_fluid

The -V gives verbose output, such that I can add print statements like:

std::cout << "getFluidStateVariable: " << name << std::endl;

to the test case source code and try to determine where the segfault occurs by looking at the console output from the jenkins build.

@akva2
Copy link
Member

akva2 commented Jan 10, 2024

You can see the verbose output on jenkins for any failed test so nothing has to be added https://ci.opm-project.org/job/opm-simulators-PR-builder/5750/testReport/junit/(root)/mpi/python_fluidstate_variables/

@bska
Copy link
Member

bska commented Jan 10, 2024

You can see the verbose output on jenkins for any failed test

True, with one small caveat: If the transcripts exceeds some upper size threshold, then it will be truncated to the first "threshold" bytes and then it becomes a bit of a guessing game. We've sometimes run into that problem when regression test models with a large number of report steps fail, because our comparison tool is rather chatty.

@hakonhagland
Copy link
Contributor Author

jenkins build this please

@akva2
Copy link
Member

akva2 commented Jan 10, 2024

As for the segfault it looks like attached.
gdb.txt

I had to rebase on master as there is a bug in your base that triggers a problem earlier.

@hakonhagland
Copy link
Contributor Author

hakonhagland commented Jan 10, 2024

As for the segfault it looks like attached.

@akva2 Great! This is large step forward in solving this

@hakonhagland
Copy link
Contributor Author

jenkins build this opm-models=861 please

@hakonhagland
Copy link
Contributor Author

If the transcripts exceeds some upper size threshold, then it will be truncated to the first "threshold" bytes

@bska Yes, seem like what happend in the last build: https://ci.opm-project.org/job/opm-simulators-PR-builder/5761/testReport/junit/(root)/mpi/python_fluidstate_variables/
Seems like the limit is 307200 bytes.

@hakonhagland
Copy link
Contributor Author

jenkins build this opm-models=861 please

3 similar comments
@hakonhagland
Copy link
Contributor Author

jenkins build this opm-models=861 please

@hakonhagland
Copy link
Contributor Author

jenkins build this opm-models=861 please

@hakonhagland
Copy link
Contributor Author

jenkins build this opm-models=861 please

@hakonhagland
Copy link
Contributor Author

@akva2 The last jenkins build https://ci.opm-project.org/job/opm-simulators-PR-builder/5766/ reports "No changes" in the Status tab, even if I updated the opm-models PR OPM/opm-models#861 before starting the build. Do I need to update the current PR also in order for it to rerun the tests?

@akva2
Copy link
Member

akva2 commented Jan 12, 2024

no, but you do need to fix the build failure...

@hakonhagland
Copy link
Contributor Author

jenkins build this opm-models=861 please

4 similar comments
@hakonhagland
Copy link
Contributor Author

jenkins build this opm-models=861 please

@hakonhagland
Copy link
Contributor Author

jenkins build this opm-models=861 please

@hakonhagland
Copy link
Contributor Author

jenkins build this opm-models=861 please

@hakonhagland
Copy link
Contributor Author

jenkins build this opm-models=861 please

@hakonhagland
Copy link
Contributor Author

jenkins build this opm-models=861 please

@hakonhagland
Copy link
Contributor Author

hakonhagland commented Jan 21, 2024

As for the segfault it looks like attached. gdb.txt

@akva2 From the backtrace it looks like _M_erase_at_end() is called:

#0  std::vector<Opm::FvBaseElementContext<Opm::Properties::TTag::EclFlowProblemTPFA>::DofStore_, Opm::aligned_allocator<Opm::FvBaseElementContext<Opm::Properties::TTag::EclFlowProblemTPFA>::DofStore_, 8ul> >::_M_erase_at_end (
    __pos=0x970, this=0x7fffffffa260) at /usr/include/c++/11/bits/stl_vector.h:1796

this looks strange, since it would only erase elements at end if the size of dofVars_

https://github.com/OPM/opm-models/blob/6ac779703599c7464b41e8c32892126f70cc1194/opm/models/discretization/common/fvbaseelementcontext.hh#L158

would be greater than stencil_.numDof(). The size of the stencil should always be equal to 1 since we are using stencil_.updatePrimaryTopology(elem) on line 156

https://github.com/OPM/opm-models/blob/6ac779703599c7464b41e8c32892126f70cc1194/opm/models/discretization/common/fvbaseelementcontext.hh#L156

and then numDof() should return the number of sub control volumes (which is 1), see:

https://github.com/OPM/opm-models/blob/6ac779703599c7464b41e8c32892126f70cc1194/opm/models/discretization/ecfv/ecfvstencil.hh#L331

And the size of dofVars_ should always be 0 (in the case the element context has just been created) or 1 after updatePrimaryStencil() has been called (which resizes dofVars_ to 1).

More information of the std::vector resize() method can be found here:

https://github.com/gcc-mirror/gcc/blob/5c3e2e134ba8e692f317f21aea10b70bfe14cfc1/libstdc%2B%2B-v3/include/bits/stl_vector.h#L1015

@akva2 It would be great if you could run gdb and check why _M_erase_at_end() was called? I wonder what was the size of dofVars_ when the segmentation fault happend?

@akva2
Copy link
Member

akva2 commented Jan 22, 2024

size is 0, wants to resize to 1. it's all looks very funky in the debugger, it's apparently in the wrong code path cause size() returns 0 and __new_size is 1, yet the erase_at_end conditional seems to evaluate to true.
the valgrind output is just as confusing
foo.log

@hakonhagland
Copy link
Contributor Author

size is 0, wants to resize to 1

@akva2 Yes that is confusing. I have now resized it to 1 when the element context is constructed, see line 106 in fvbaseelementcontext.hh in OPM/opm-models#861:

https://github.com/OPM/opm-models/blob/fcb03d3af389c92a6ee051b3f8a3596b0ce30632/opm/models/discretization/common/fvbaseelementcontext.hh#L106

This allows the Jenkins run to run a little bit further:

https://ci.opm-project.org/job/opm-common-PR-builder/6688/testReport/(root)/mpi/python_fluidstate_variables/

Based on the output from the Jenkins build:

i=0
u
1-1
x
r
a
b
s
j=0
Segmentation fault

it appears that the segfault now appears after the line j=0, that is line 143 in PyFluidState_impl.hpp:

std::cout << "j=" << dof_idx << std::endl;

and before the line 468 in fvbaseelementcontext.hh, see:

https://github.com/OPM/opm-models/blob/fcb03d3af389c92a6ee051b3f8a3596b0ce30632/opm/models/discretization/common/fvbaseelementcontext.hh#L468

which should have output iq if the execution had reached that point. It would be great if you could rerun the current code, that is: OPM/opm-common#3881, OPM/opm-grid#708, OPM/opm-models#861, and #5097 and get a backtrace. I wonder what causes the segfault this time.

@akva2
Copy link
Member

akva2 commented Feb 7, 2024

gdb.txt
new bt

@hakonhagland
Copy link
Contributor Author

hakonhagland commented Feb 7, 2024

new bt

@akva2 Thanks, but this backtrace does not look like it uses the current version of #5097. For example, frame #3 in the backtrace:

#3  Opm::Pybind::PyFluidState<Opm::Properties::TTag::EclFlowProblemTPFA>::getFluidStateVariable (
    this=<optimized out>, name="po")
    at /home/akva/kode/opm/opm-simulators/opm/simulators/flow/python/PyFluidState_impl.hpp:129

refers to line 129 in PyFluidState_impl.hpp, but this line is commented out in the current version:

//const ElementIterator& elem_end_itr = grid_view.template end</*codim=*/0>();

Added methods to Python module opm.simulators.BlackOilSimulator to
access primary variables and fluid state variables.
Return vectors by value instead of unique pointers to arrays.
Removed most of the test cases from test_fluidstate_variables.py
Removed also the initialization of the Blackoil simulator
.. and a print statement
Just commit a comment so we can rerun jenkins
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants