-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline fixes: non-blocking d'tor #423
Conversation
bors try |
tryBuild failed: |
bors try |
tryBuild failed: |
bors try |
About cloning |
tryBuild succeeded: |
About the deadlock and the temporary workaround with When running tests on PizDaint GPU partition, which run on a single node, with 6 ranks we get On the Let's see what was creating the deadlock in the tests. And here it is the problem:
And here it is the deadlock. This is what we found out with The workaround at the moment is to wait for all the tasks before scheduling a collective blocking MPI call. When the algorithm exists, it already scheduled in HPX all the tasks, so we are making the algorithm blocking again. Note: the barrier we add with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM so far!
- The solution in Pipeline fixes: non-blocking d'tor #423 (comment) seems reasonable
- Regarding the temporary workaround :
hpx::threads::get_thread_manager().wait().
, was there a discussion for a permanent solution?
bors try |
bors try- |
67a5c1b
to
d3a067a
Compare
bors try |
bors try- |
d3a067a
to
a0f9199
Compare
bors try |
tryBuild succeeded: |
41af38c
to
f01747b
Compare
bors try |
tryBuild succeeded: |
I did the rebase and fixed the newly merged trsm-LLT with If anyone of you can give it a quick look to see I didn't introduced anything bad with the rebase, then it can be merged. 😉 |
Close #417
(I write here for future reference what we found out with @rasolca)
Removing the blocking call from the
Pipeline
d'tor exposed two main problems:Resource management withPromiseGuard
. For instance withPromiseGuard<Communicator>
used with asynchronous communications, it suffered of the same problem of Tiles, i.e. it has to be kept alive till the completion of the operation and not just for the posting of the operation.EDIT: this is not true, indeed the Communicator has to be kept just for the posting of the operation, not until its completion. Otherwise we end up serializing communications, completely missing the point of asynchronous MPI.
test_gen_to_std
DLA-Future/test/unit/eigensolver/test_gen_to_std.cpp
Lines 135 to 142 in 41ded01
CommunicatorGrid
. The problem was raised in case one of the rank was able to start the 2nd configuration before other ranks were able to finish the 1st one (e.g. a rank that does not have any tile to work on), so communications for the 2nd were posted mixed up with the 1st one.Both these problems will need a better management and we can think of redesigning a bit how it works, this is somehow a temporary solution.
CHANGES:
Extendmatrix::unwrapExtendTiles
to extend lifetime also forPromiseGuard
(and adapt asynchronous communication kernels accordingly)CommunicatorGrid
s inside each algorithm (see Pipeline fixes: non-blocking d'tor #423 (comment))recvBcastAlloc
(and adapttest_comm_matrix
that was using it)Even if this solution is temporary, IMHO it may be worth improving the naming ofunwrapExtendTiles
, e.g. since at this point we are extending the concept of lifetime extension not just to tiles.