Skip to content

Commit

Permalink
planner: simplify accounting during migration (#427)
Browse files Browse the repository at this point in the history
There was a bug in the accounting logic during a DIST_CHANGE request. At
this point, we need to release the slots of the to-be migrated function,
and claim them in the migrated-to hosts.

The current logic assumed that, given two decisions, they would always
start being similar, and then diverge (or even diverge all the way
through).

This is not the case. Take for example the following situation where A
has 5 slots:

```
Old, New
B, A
B, A
B, A
C, A
C, A
A, C
A, C
```

Using our accounting logic, for a while we would have 7 slots assigned
to A. This would break our assertions (did not catch it because it only
happens in Release builds), but was breaking the MPI port assignation.
  • Loading branch information
csegarragonz authored Apr 18, 2024
1 parent da5fee2 commit 952811a
Showing 1 changed file with 18 additions and 18 deletions.
36 changes: 18 additions & 18 deletions src/planner/Planner.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -742,32 +742,32 @@ Planner::callBatch(std::shared_ptr<BatchExecuteRequest> req)
std::set<std::string> evictedHosts(evictedHostsVec.begin(),
evictedHostsVec.end());

// 1. We only need to update the hosts where both decisions differ
// For the moment we print the old and new decisions when
// migrating. We print them separately to help catching bugs
// with the accounting when transfering slots and ports
SPDLOG_INFO("Decided to migrate app {}!", appId);
SPDLOG_INFO("Old decision:");
oldDec->print("info");

// 1. Clear the slots/ports in the old decision, and then claim
// them in the new one. Note that it needs to happen in this order
// as we may need all claimed slots in the old decision for the
// new one
assert(decision->hosts.size() == oldDec->hosts.size());
for (int i = 0; i < decision->hosts.size(); i++) {
// If decision is the same, we propagate the port
if (decision->hosts.at(i) == oldDec->hosts.at(i)) {
decision->mpiPorts.at(i) = oldDec->mpiPorts.at(i);
continue;
}

for (int i = 0; i < oldDec->hosts.size(); i++) {
auto oldHost = state.hostMap.at(oldDec->hosts.at(i));
auto newHost = state.hostMap.at(decision->hosts.at(i));

// Slots accounting
releaseHostSlots(oldHost);
claimHostSlots(newHost);

// MPI ports accounting
releaseHostMpiPort(oldHost, oldDec->mpiPorts.at(i));
}

for (int i = 0; i < decision->hosts.size(); i++) {
auto newHost = state.hostMap.at(decision->hosts.at(i));
claimHostSlots(newHost);
decision->mpiPorts.at(i) = claimHostMpiPort(newHost);
}

// 0. For the time being, when migrating we always print both
// decisions (old and new)
SPDLOG_INFO("Decided to migrate app {}!", appId);
SPDLOG_INFO("Old decision:");
oldDec->print("info");
// Print the new decision after accounting has been updated
SPDLOG_INFO("New decision:");
decision->print("info");
state.numMigrations += 1;
Expand Down

0 comments on commit 952811a

Please sign in to comment.