Storm Rebalance Broken #226

JessicaLHartog · 2017-10-19T23:30:09Z

After the merging of #200 and #213 rebalance of topologies no longer does anything. This is because there are no offers on which slots can be made when a rebalance happens unless there happen to also be other topologies needing assignments.

This is as a result of the way that Nimbus handles the TopologiesMissingAssignments component. A quick rundown of what now happens is:

storm-mesos does scheduling of topologies until no topologies need assignments
since no topologies need assignments, offers are suppressed
storm-mesos doesn't do anything in MesosNimbus because no topologies need assignments (and offers are already suppressed)
a rebalance command comes in and is registered by Nimbus, a :do-rebalance event is scheduled some number of seconds in the future
those number of seconds later there is finally a topology that needs assignment (i.e. the one that was just rebalanced), but there are no offers buffered
since there are no offers buffered and there are topologies needing assignments, offers are revived
allSlotsAvailableForScheduling returns after reviving offers
Nimbus wants slots immediately for the rebalancing topology on, and there's no time for offers to come in and be used in the next allSlotsAvailableForScheduling call
since there are no slots available for the workers to be rescheduled onto, they don't get rescheduled and rebalance therefore does nothing

Notably, if there are other topologies needing assignments at the same time as the :do-rebalance is executed, then the rebalance should work as expected.

This also is simply referring to the Storm UI "Rebalance" and its associated command. I have not tested this with the type of rebalance mentioned in the Storm documentation:

## Reconfigure the topology "mytopology" to use 5 worker processes,
## the spout "blue-spout" to use 3 executors and
## the bolt "yellow-bolt" to use 10 executors.

$ storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10

However, I fully expect they hit the same logic in the Nimbus and this same behavior (or something similar) happens that way too.

The text was updated successfully, but these errors were encountered:

JessicaLHartog · 2017-10-20T00:05:52Z

Possible solutions:

Write logic that scrapes ZK state to see if there are any topologies in `REBALANCING` state, and if there are stop suppressing Offers.

Positive(s):

This would be able to accumulate Offers in anticipation of needing them to execute a :do-rebalance.
This limits the amount of perpetual offer collecting by this framework.

Negative(s):

This is complicated and requires a lot of ZK state parsing for little reward.
This is prone to bugs in Storm (like one that exists right now in Storm where if the Nimbus dies while a topology is in REBALANCING state, the only way out of it is to resubmit the topology).

Write logic to hold on to some number of unused Offers so that rebalance does something

Positive(s):

This would mean that there are always some Offers that we can leverage to make worker slots on whenever a rebalance command is triggered.

Negative(s):

This means that there are guaranteed to be wasted resources.
This is insufficient for large topologies that need many slots.
This discourages spread of workers across many hosts when rebalancing as there are only a few hosts that will have held Offers during execution of the rebalance.
Will likely reproduce the same behavior if an insufficient number of slots for the topology in question can be created across the held Offers.

Identify a way to release the `_offersLock` in the first round of scheduling where we have topologies that need assignment, revive and collect Offers, then use them.

Positives(s):

This is probably the right way to fix this problem.
This will not hold Offers when we don't need them.
This will enable worker spread across as many hosts as possible for topologies that are being rebalanced.

Negative(s):

This is decidedly challenging because of the locking situation.
- We don't want Offers to become unavailable to us when we anticipate using them for scheduling (hence the lock).
- It is not likely possible for us to hold the lock, let it go, wait a bit for revived Offers to come in, and regain the lock with any guarantee of that order of events.
  - This is because there are asynchronous updates happening to the Offers map when Offers are received.
This also would likely require some implementation of a Finite State Machine with transitions for holding/suppressing offers and the various ways in which you can get to/from any given state.

JessicaLHartog added bug scheduler labels Oct 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storm Rebalance Broken #226

Storm Rebalance Broken #226

JessicaLHartog commented Oct 19, 2017 •

edited

Loading

JessicaLHartog commented Oct 20, 2017

Storm Rebalance Broken #226

Storm Rebalance Broken #226

Comments

JessicaLHartog commented Oct 19, 2017 • edited Loading

JessicaLHartog commented Oct 20, 2017

Write logic that scrapes ZK state to see if there are any topologies in REBALANCING state, and if there are stop suppressing Offers.

Write logic to hold on to some number of unused Offers so that rebalance does something

Identify a way to release the _offersLock in the first round of scheduling where we have topologies that need assignment, revive and collect Offers, then use them.

JessicaLHartog commented Oct 19, 2017 •

edited

Loading

Write logic that scrapes ZK state to see if there are any topologies in `REBALANCING` state, and if there are stop suppressing Offers.

Identify a way to release the `_offersLock` in the first round of scheduling where we have topologies that need assignment, revive and collect Offers, then use them.