Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline fixes: non-blocking d'tor #423

Merged
merged 21 commits into from
Dec 2, 2021
Merged

Pipeline fixes: non-blocking d'tor #423

merged 21 commits into from
Dec 2, 2021

Conversation

albestro
Copy link
Collaborator

@albestro albestro commented Sep 23, 2021

Close #417

(I write here for future reference what we found out with @rasolca)

Removing the blocking call from the Pipeline d'tor exposed two main problems:

  • Resource management with PromiseGuard. For instance with PromiseGuard<Communicator> used with asynchronous communications, it suffered of the same problem of Tiles, i.e. it has to be kept alive till the completion of the operation and not just for the posting of the operation.
    EDIT: this is not true, indeed the Communicator has to be kept just for the posting of the operation, not until its completion. Otherwise we end up serializing communications, completely missing the point of asynchronous MPI.
  • Shared communicators: it is something we knew, but it shown up in a subtle way. Using the same communicator for multiple algorithms, may end up in mixed/crossed communications resulting in undefined/strange behaviours (e.g. in the case I was debugging it was an "MPI Invalid Communicator"). This happened in test_gen_to_std
    for (const auto& comm_grid : this->commGrids()) {
    for (auto uplo : blas_uplos) {
    for (auto sz : sizes) {
    std::tie(m, mb) = sz;
    testGenToStdEigensolver<TypeParam, Backend::MC, Device::CPU>(comm_grid, uplo, m, mb);
    }
    }
    }
    where the same algorithm is executed sequentially on exactly the same CommunicatorGrid. The problem was raised in case one of the rank was able to start the 2nd configuration before other ranks were able to finish the 1st one (e.g. a rank that does not have any tile to work on), so communications for the 2nd were posted mixed up with the 1st one.

Both these problems will need a better management and we can think of redesigning a bit how it works, this is somehow a temporary solution.

CHANGES:

  • Pipeline d'tor does not block anymore
  • Due to previous point, algorithm call is not "blocking" anymore. This resulted in a deadlock condition in an edge case which has been workaround with by all HPX tasks (see Pipeline fixes: non-blocking d'tor #423 (comment))
  • Extend matrix::unwrapExtendTiles to extend lifetime also for PromiseGuard (and adapt asynchronous communication kernels accordingly)
  • Clone communicators got from CommunicatorGrids inside each algorithm (see Pipeline fixes: non-blocking d'tor #423 (comment))
  • Remove unused recvBcastAlloc (and adapt test_comm_matrix that was using it)

Even if this solution is temporary, IMHO it may be worth improving the naming of unwrapExtendTiles, e.g. since at this point we are extending the concept of lifetime extension not just to tiles.

@albestro
Copy link
Collaborator Author

bors try

bors bot added a commit that referenced this pull request Sep 23, 2021
@albestro albestro self-assigned this Sep 23, 2021
@bors
Copy link

bors bot commented Sep 23, 2021

try

Build failed:

@albestro
Copy link
Collaborator Author

albestro commented Nov 2, 2021

bors try

bors bot added a commit that referenced this pull request Nov 2, 2021
@bors
Copy link

bors bot commented Nov 2, 2021

try

Build failed:

@albestro
Copy link
Collaborator Author

albestro commented Nov 3, 2021

bors try

bors bot added a commit that referenced this pull request Nov 3, 2021
@albestro
Copy link
Collaborator Author

albestro commented Nov 3, 2021

About cloning Communicators/CommuicatorGrids inside the algorithms: it is a temporary solution we opted for, but the general idea is to change it so that Communicators will provide a Pipeline directly, e.g. it internally will have a fixed set of Pipelines that can be used by the different algorithms in a round-robin fashion.

@bors
Copy link

bors bot commented Nov 3, 2021

try

Build succeeded:

@albestro
Copy link
Collaborator Author

albestro commented Nov 3, 2021

About the deadlock and the temporary workaround with hpx::threads::get_thread_manager().wait().

When running tests on PizDaint GPU partition, which run on a single node, with 6 ranks we get 12 cores / 6 ranks = 2 cores/rank. Considering that 1 thread is dedicated to the MPI pool for asynchronous communication (mpi pool), in this configuration we end up having a single core for the default pool.

On the deafult pool runs both the computations tasks of the algorithms, and also the so called hpx_main, the one which we generally use just for scheduling things.

Let's see what was creating the deadlock in the tests.
The algorithm A0 exits, since the call is not blocking anymore, but after it the test waits for the local tiles for checking the results. Anyway, this does not imply that all algorithm tasks finished, for instance there may be some computation (e.g. on a panel) that are not relevant for the local tiles but does for the other ranks, i.e. they have to be computed and then communicated.
So, we have some algorithm tasks not yet completed/scheduled, but the local check can complete and go to the next test config starting the algorithm A1, which as first thing does the MPI_Comm_dup in the hpx_main task.

And here it is the problem: MPI_Comm_dup is a collective AND a blocking call, which means that does not complete until all ranks do the same call. But, we are in a condition where:

  • ranks which already completed A0 are stuck in MPI_Comm_dup and cannot do anything else until all other ranks reach the same point; they are stuck because the only core available in default is blocked. This implies that any other tasks for A0 not yet complete cannot be run, and in turn they cannot unlock the related communications that would permit the advancement of other ranks;
  • ranks which are still executing A0 are waiting data, that will be never communicated, and so they will never reach the MPI_Comm_dup.

And here it is the deadlock. This is what we found out with gen_to_std, but a similar scenario happens for reduction_to_band, where instead of having the MPI_Comm_dup we have an MPI collective call (i.e. MPI_AllGather) for the check phase of the test, that creates exactly the same problem.

The workaround at the moment is to wait for all the tasks before scheduling a collective blocking MPI call. When the algorithm exists, it already scheduled in HPX all the tasks, so we are making the algorithm blocking again.

Note: the barrier we add with hpx::threads::get_thread_manager().wait() is stronger than the Pipeline blocking d'tor. Indeed, the Pipeline blocking d'tor was waiting for the last MPI task to be posted in MPI, not even that it was finished. Moreover, with this kind of barrier now we are waiting for anything to complete before continuing (i.e. we wait also for computations, not just mpi).

Copy link
Contributor

@teonnik teonnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so far!

  1. The solution in Pipeline fixes: non-blocking d'tor #423 (comment) seems reasonable
  2. Regarding the temporary workaround : hpx::threads::get_thread_manager().wait()., was there a discussion for a permanent solution?

@albestro albestro marked this pull request as ready for review November 4, 2021 05:54
@albestro
Copy link
Collaborator Author

albestro commented Nov 4, 2021

bors try

bors bot added a commit that referenced this pull request Nov 4, 2021
@albestro
Copy link
Collaborator Author

albestro commented Nov 4, 2021

bors try-

@albestro
Copy link
Collaborator Author

albestro commented Nov 4, 2021

bors try

bors bot added a commit that referenced this pull request Nov 4, 2021
@albestro
Copy link
Collaborator Author

albestro commented Nov 4, 2021

bors try-

@albestro
Copy link
Collaborator Author

albestro commented Nov 4, 2021

bors try
this should be the good one 🤞🏻 sorry for the spam

bors bot added a commit that referenced this pull request Nov 4, 2021
@bors
Copy link

bors bot commented Nov 4, 2021

try

Build succeeded:

@albestro
Copy link
Collaborator Author

albestro commented Dec 1, 2021

bors try

bors bot added a commit that referenced this pull request Dec 1, 2021
@bors
Copy link

bors bot commented Dec 1, 2021

try

Build succeeded:

@albestro albestro marked this pull request as ready for review December 1, 2021 17:11
@albestro
Copy link
Collaborator Author

albestro commented Dec 1, 2021

I did the rebase and fixed the newly merged trsm-LLT with clone().

If anyone of you can give it a quick look to see I didn't introduced anything bad with the rebase, then it can be merged. 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Pipeline are blocking
4 participants