Pipeline fixes: non-blocking d'tor #423

albestro · 2021-09-23T08:39:11Z

Close #417

(I write here for future reference what we found out with @rasolca)

Removing the blocking call from the Pipeline d'tor exposed two main problems:

Resource management with PromiseGuard. For instance with PromiseGuard<Communicator> used with asynchronous communications, it suffered of the same problem of Tiles, i.e. it has to be kept alive till the completion of the operation and not just for the posting of the operation.
EDIT: this is not true, indeed the Communicator has to be kept just for the posting of the operation, not until its completion. Otherwise we end up serializing communications, completely missing the point of asynchronous MPI.

Shared communicators: it is something we knew, but it shown up in a subtle way. Using the same communicator for multiple algorithms, may end up in mixed/crossed communications resulting in undefined/strange behaviours (e.g. in the case I was debugging it was an "MPI Invalid Communicator"). This happened in test_gen_to_std

DLA-Future/test/unit/eigensolver/test_gen_to_std.cpp

Lines 135 to 142 in 41ded01

    
           for (const auto& comm_grid : this->commGrids()) { 
        
             for (auto uplo : blas_uplos) { 
        
               for (auto sz : sizes) { 
        
                 std::tie(m, mb) = sz; 
        
                 testGenToStdEigensolver<TypeParam, Backend::MC, Device::CPU>(comm_grid, uplo, m, mb); 
        
               } 
        
             } 
        
           }

where the same algorithm is executed sequentially on exactly the same CommunicatorGrid. The problem was raised in case one of the rank was able to start the 2nd configuration before other ranks were able to finish the 1st one (e.g. a rank that does not have any tile to work on), so communications for the 2nd were posted mixed up with the 1st one.

Both these problems will need a better management and we can think of redesigning a bit how it works, this is somehow a temporary solution.

CHANGES:

Pipeline d'tor does not block anymore
Due to previous point, algorithm call is not "blocking" anymore. This resulted in a deadlock condition in an edge case which has been workaround with by all HPX tasks (see Pipeline fixes: non-blocking d'tor #423 (comment))
~~Extend matrix::unwrapExtendTiles to extend lifetime also for PromiseGuard (and adapt asynchronous communication kernels accordingly)~~
Clone communicators got from CommunicatorGrids inside each algorithm (see Pipeline fixes: non-blocking d'tor #423 (comment))
Remove unused recvBcastAlloc (and adapt test_comm_matrix that was using it)

~~Even if this solution is temporary, IMHO it may be worth improving the naming of unwrapExtendTiles, e.g. since at this point we are extending the concept of lifetime extension not just to tiles.~~

albestro · 2021-09-23T08:44:40Z

bors try

include/dlaf/matrix/tile.h

bors · 2021-09-23T09:49:34Z

try

Build failed:

ci/gitlab/full-pipeline

albestro · 2021-11-02T17:41:25Z

bors try

bors · 2021-11-02T18:29:09Z

try

Build failed:

ci/gitlab/full-pipeline

albestro · 2021-11-03T07:30:23Z

bors try

albestro · 2021-11-03T07:46:26Z

About cloning Communicators/CommuicatorGrids inside the algorithms: it is a temporary solution we opted for, but the general idea is to change it so that Communicators will provide a Pipeline directly, e.g. it internally will have a fixed set of Pipelines that can be used by the different algorithms in a round-robin fashion.

bors · 2021-11-03T08:04:44Z

try

Build succeeded:

ci/gitlab/full-pipeline

albestro · 2021-11-03T08:22:02Z

About the deadlock and the temporary workaround with hpx::threads::get_thread_manager().wait().

When running tests on PizDaint GPU partition, which run on a single node, with 6 ranks we get 12 cores / 6 ranks = 2 cores/rank. Considering that 1 thread is dedicated to the MPI pool for asynchronous communication (mpi pool), in this configuration we end up having a single core for the default pool.

On the deafult pool runs both the computations tasks of the algorithms, and also the so called hpx_main, the one which we generally use just for scheduling things.

Let's see what was creating the deadlock in the tests.
The algorithm A0 exits, since the call is not blocking anymore, but after it the test waits for the local tiles for checking the results. Anyway, this does not imply that all algorithm tasks finished, for instance there may be some computation (e.g. on a panel) that are not relevant for the local tiles but does for the other ranks, i.e. they have to be computed and then communicated.
So, we have some algorithm tasks not yet completed/scheduled, but the local check can complete and go to the next test config starting the algorithm A1, which as first thing does the MPI_Comm_dup in the hpx_main task.

And here it is the problem: MPI_Comm_dup is a collective AND a blocking call, which means that does not complete until all ranks do the same call. But, we are in a condition where:

ranks which already completed A0 are stuck in MPI_Comm_dup and cannot do anything else until all other ranks reach the same point; they are stuck because the only core available in default is blocked. This implies that any other tasks for A0 not yet complete cannot be run, and in turn they cannot unlock the related communications that would permit the advancement of other ranks;
ranks which are still executing A0 are waiting data, that will be never communicated, and so they will never reach the MPI_Comm_dup.

And here it is the deadlock. This is what we found out with gen_to_std, but a similar scenario happens for reduction_to_band, where instead of having the MPI_Comm_dup we have an MPI collective call (i.e. MPI_AllGather) for the check phase of the test, that creates exactly the same problem.

The workaround at the moment is to wait for all the tasks before scheduling a collective blocking MPI call. When the algorithm exists, it already scheduled in HPX all the tasks, so we are making the algorithm blocking again.

Note: the barrier we add with hpx::threads::get_thread_manager().wait() is stronger than the Pipeline blocking d'tor. Indeed, the Pipeline blocking d'tor was waiting for the last MPI task to be posted in MPI, not even that it was finished. Moreover, with this kind of barrier now we are waiting for anything to complete before continuing (i.e. we wait also for computations, not just mpi).

teonnik

LGTM so far!

The solution in Pipeline fixes: non-blocking d'tor #423 (comment) seems reasonable
Regarding the temporary workaround : hpx::threads::get_thread_manager().wait()., was there a discussion for a permanent solution?

albestro · 2021-11-04T05:55:13Z

bors try

albestro · 2021-11-04T06:22:24Z

bors try-

albestro · 2021-11-04T06:25:29Z

bors try

albestro · 2021-11-04T06:32:39Z

bors try-

albestro · 2021-11-04T06:34:07Z

bors try
this should be the good one 🤞🏻 sorry for the spam

bors · 2021-11-04T07:06:12Z

try

Build succeeded:

ci/gitlab/full-pipeline

…iour!

albestro · 2021-12-01T16:31:03Z

bors try

bors · 2021-12-01T17:08:23Z

try

Build succeeded:

ci/gitlab/full-pipeline

albestro · 2021-12-01T17:13:50Z

I did the rebase and fixed the newly merged trsm-LLT with clone().

If anyone of you can give it a quick look to see I didn't introduced anything bad with the rebase, then it can be merged. 😉

albestro requested review from msimberg, teonnik and rasolca September 23, 2021 08:39

bors bot added a commit that referenced this pull request Sep 23, 2021

Try #423:

0b1969b

albestro self-assigned this Sep 23, 2021

msimberg reviewed Sep 23, 2021

View reviewed changes

include/dlaf/matrix/tile.h Show resolved Hide resolved

bors bot added a commit that referenced this pull request Nov 2, 2021

Try #423:

a4afdbd

bors bot added a commit that referenced this pull request Nov 3, 2021

Try #423:

7bbe560

teonnik reviewed Nov 3, 2021

View reviewed changes

albestro marked this pull request as ready for review November 4, 2021 05:54

bors bot added a commit that referenced this pull request Nov 4, 2021

Try #423:

250e1a1

albestro force-pushed the alby/pipeline_dtor branch from 67a5c1b to d3a067a Compare November 4, 2021 06:23

bors bot added a commit that referenced this pull request Nov 4, 2021

Try #423:

89daf48

albestro force-pushed the alby/pipeline_dtor branch from d3a067a to a0f9199 Compare November 4, 2021 06:33

bors bot added a commit that referenced this pull request Nov 4, 2021

Try #423:

4bb09b3

albestro requested a review from msimberg November 4, 2021 07:34

albestro and others added 20 commits December 1, 2021 17:29

adapt bcast for promiseguard lifetime management

0772c30

adapt test after removed recvBcastAlloc

36f3860

fix style for const&

dd502ff

remove blocking call from pipeline dtor

1b5852f

clone communicators for algorithms

c70713d

fix gpu management in bcast recv without recvBcastAlloc

a8a4480

remove promiseguard lifetime extension, otherwise we get serial behav…

f584743

…iour!

add barrier to avoid deadlocks

a2b945a

fix missing header (just in CPU code?!) and uniform solution

990aaaf

clang-format

cd70022

simplify by removing d'tor

f4b4c6d

this should simplify calls with CUDA RDMA enabled

9d04d95

document GPU temporary copy

a17e688

leftover after rebase

922996d

fix backtransformation band2gen

a26fae0

add clone for missing triangular solver implementations

bb08a12

workaround for MPI Listener with HPX (#429)

29b9f63

fix newly merged algorithm

adecef1

add test for checking correct pipeline behavior on destruction

5258366

fix LLT dist (newly merged algorithm)

f01747b

albestro force-pushed the alby/pipeline_dtor branch from 41af38c to f01747b Compare December 1, 2021 16:29

bors bot added a commit that referenced this pull request Dec 1, 2021

Try #423:

c09a968

albestro marked this pull request as ready for review December 1, 2021 17:11

rasolca merged commit fd5ee4c into master Dec 2, 2021

rasolca deleted the alby/pipeline_dtor branch December 2, 2021 09:51

rasolca mentioned this pull request Dec 6, 2021

Band to tridiagonal local version. #413

Merged

3 tasks

msimberg mentioned this pull request Mar 31, 2022

Replace threadmanager::wait by moving blocking MPI communication to separate std::thread #512

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline fixes: non-blocking d'tor #423

Pipeline fixes: non-blocking d'tor #423

albestro commented Sep 23, 2021 •

edited

Loading

albestro commented Sep 23, 2021

bors bot commented Sep 23, 2021

albestro commented Nov 2, 2021

bors bot commented Nov 2, 2021

albestro commented Nov 3, 2021

albestro commented Nov 3, 2021

bors bot commented Nov 3, 2021

albestro commented Nov 3, 2021

teonnik left a comment

albestro commented Nov 4, 2021

albestro commented Nov 4, 2021

albestro commented Nov 4, 2021

albestro commented Nov 4, 2021

albestro commented Nov 4, 2021

bors bot commented Nov 4, 2021

albestro commented Dec 1, 2021

bors bot commented Dec 1, 2021

albestro commented Dec 1, 2021

	for (const auto& comm_grid : this->commGrids()) {
	for (auto uplo : blas_uplos) {
	for (auto sz : sizes) {
	std::tie(m, mb) = sz;
	testGenToStdEigensolver<TypeParam, Backend::MC, Device::CPU>(comm_grid, uplo, m, mb);
	}
	}
	}

Pipeline fixes: non-blocking d'tor #423

Pipeline fixes: non-blocking d'tor #423

Conversation

albestro commented Sep 23, 2021 • edited Loading

albestro commented Sep 23, 2021

bors bot commented Sep 23, 2021

try

albestro commented Nov 2, 2021

bors bot commented Nov 2, 2021

try

albestro commented Nov 3, 2021

albestro commented Nov 3, 2021

bors bot commented Nov 3, 2021

try

albestro commented Nov 3, 2021

teonnik left a comment

Choose a reason for hiding this comment

albestro commented Nov 4, 2021

albestro commented Nov 4, 2021

albestro commented Nov 4, 2021

albestro commented Nov 4, 2021

albestro commented Nov 4, 2021

bors bot commented Nov 4, 2021

try

albestro commented Dec 1, 2021

bors bot commented Dec 1, 2021

try

albestro commented Dec 1, 2021

albestro commented Sep 23, 2021 •

edited

Loading