-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Roadmap #1
Comments
One problem @clasqui and @exaexa had with Distributed.jl and Extrae.jl is that they needed to trace the communications, which is doable with MPI thanks to dependency injection, but imposible for Distributed. We had to create our own So, a hook system on different parts of Distributed, independently of the manager, would be super nice. |
Furthermore, a minor feature that could be nice to experiment with is custom worker ids; like multi-dimensional ids (is quite common to represent your problem in a 2D / 3D lattice of workers). Also some "worker grouping" functionality akin to MPI groups and communicators would be interesting too. Like "broadcast to a group". What would be killer is if a worker can be in multiple groups.
Do I understand it correctly that this means that it will stop being a master-worker model? Or at least multi-master? That would be nice. |
@mofeing thx for ping, this would be indeed relevant. I guess there should be some slightly more general mechanism to control the worker spawning; I recently had similar "fun" with just a custom JLL loaded. @jpsamaroo one question for the "next" version, would there be any improved support for managing worker-local data? We previously did this to "just do it reliably in the simplest way": https://github.com/LCSB-BioCore/DistributedData.jl . In another iteration (with simpler use-case) we managed to simplify to this: https://github.com/COBREXA/COBREXA.jl/blob/master/src/worker_data.jl , used with +1 for the other question of @mofeing -- having the worker locality somewhat more exposed (so that people can hopefully improve scheduling of stuff that depends on latency&volume) would be great. |
From discussing it with Julian I think the idea is more about better support for workers created under different cluster managers, e.g. some from SSH, some from slurm etc. Another possibility is having 'private clusters' that are not visible to |
I think there's a few possibilities here - doing multi-cluster at the level of a single Distributed logical cluster is the option I originally had in mind, but there could also be the possibility of a "multi-master" cluster. We'd probably need to discuss the pros and cons of each approach, or find a more general framework that allows both to exist. |
Just thinking out loud:
|
This was an implementation for what they needed in that moment, but I agree. Nowadays we deleted that instrumentation because it wasn't sustainable. IMO hooks should be like in Cassette: they should expect same args and kwargs to be passed.
Completely, but this should be known if we pass the args right? As an example, Extrae is capable of measuring message sizes in MPI and we detected that Serialization was getting confused with a bit a data-structure which has repeated refs to arrays, and artificially increased the serialized size by a factor of x10-100. This is imposible to detect with Distributed as it's right now. |
That's amazing @JamesWrigley! Sure, I can try add Correct me if I'm wrong, but in a Extrae+DistributedNext package extension, I just need to do sth like this right? function __init__()
DistributedNext.add_worker_starting_callback(extrae_dist_starting_cb)
DistributedNext.add_worker_started_callback(extrae_dist_start_cb)
DistributedNext.add_worker_exiting_callback(extrae_dist_exiting_cb)
DistributedNext.add_worker_exited_callback(extrae_dist_exited_cb)
end |
Yep, that's it 👍 |
Okay, I added support here bsc-quantic/Extrae.jl#22 I will give it a try these days. |
Required:
--worker
flag to anotheraddprocs
setup mechanism (Make the package actually usable #2)Enhancements (speculative, these are up for discussion):
@always_everywhere
and other pre-loading supportBug Fixes:
isempty
(Implement Base.isempty(::RemoteChannel) #3)The text was updated successfully, but these errors were encountered: