Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to Force Worker Spreading Across Hosts #166

Open
JessicaLHartog opened this issue Aug 5, 2016 · 4 comments
Open

Ability to Force Worker Spreading Across Hosts #166

JessicaLHartog opened this issue Aug 5, 2016 · 4 comments

Comments

@JessicaLHartog
Copy link
Collaborator

Sometimes the approach of "let Storm do its own thing" when it comes to scheduling worker slots can lead to interesting scenarios:

(1) A topology may wish to enable running agents in the JVM which listen on predefined ports (such as JDWP debugging or JMX remote). In such a scenario if the port is statically defined somewhere, then two workers on the same host will stall the topology. For example, if we pass -Dcom.sun.management.jmxremote.port=9111 as one of the java childopts, we will see an exception like the one below when trying to bring up the second worker:

Error: Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: 9111; nested exception is: 
    java.net.BindException: Address already in use

(2) Two worker processes on the same machine for the same topology can lead to inconsistent behavior. In a very resource-hungry topology predictability about how workers are scheduled across various hosts may yield better results when it comes to verifying behavior. For example, suppose a topology writes to disk a lot. Having two workers on the same host will increase the I/O on the worker, and can cause an inability to write heartbeat messages, bringing the worker offline. When these workers are rescheduled, if they are scheduled onto two new hosts, then the topology will likely not cause a crashing of the worker.

(3) Tuning the configuration of a topology for various elements like the number of required workers as well as the number of executors per topology component can be made difficult if the first time a topology is submitted two workers are scheduled on the same host, and after some tweaking of configuration options the workers are spread across more hosts (and vice versa).

To make these behaviors (and others) more predictable it would be nice if there were an option like topology.mesos.scheduler that can define a TopologyScheduler. The purpose of the TopologyScheduler would be to define worker spreading across hosts. Some options that may be useful are:

  • Default - the current "let Storm do its own thing" behavior
  • OneWorkerPerHost - worker slots for the same topology may not use offers from the same host
  • MinimumNumberOfHosts - all worker slots are scheduled on the same host (or on as few hosts as possible), attempting to maximize the number of worker slots on each host

Additional options would also then be possible should a need for them arise.

This also relates to Issue 158: enhance scheduling to act more predictably

@dsKarthick
Copy link
Collaborator

dsKarthick commented Aug 7, 2016

@JessicaLHartog +1 for topology.mesos.scheduler - Thats the whole reason behind creating Scheduling Framework described in #93.

I would like to propose IsolationScheduler along with your multifarious schedulers - such a scheduler would dedicate set of hosts for a particular topology while also ensuring no other worker gets scheduled on to those dedicated hosts. This would be helpful for debugging nefarious topologies.

@erikdw
Copy link
Collaborator

erikdw commented Aug 7, 2016

@dsKarthick : nice, "multifarious", I didn't know that word! ✋ ✋ (high five). And then following it up with "nefarious". "(multi/ne)farious". I wonder what other words end with "farious"... ahh: http://wordinfo.info/unit/3613/ip:1/il:F

But back on topic: yes, totally, we should have that (IsolationScheduler) too.

@bigdata-user
Copy link

@erikdw I have a problem.
2

1
1

the last picture says "Failed to detect a master: Failed to parse data of unknown label 'json.info'" .please tell me how to solve the problem.

@erikdw
Copy link
Collaborator

erikdw commented Aug 16, 2016

Again, please stop posting these messages in random pull requests and issues. Just file a new issue.

  • Erik

On Aug 16, 2016, at 12:38 AM, zhanghangchina [email protected] wrote:

@erikdw I have a problem.

the last picture says "Failed to detect a master: Failed to parse data of unknown label 'json.info'" .please tell me how to solve the problem.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants