Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeking advice on improvement reliability of communication. #624

Open
samsja opened this issue Sep 4, 2024 · 5 comments
Open

Seeking advice on improvement reliability of communication. #624

samsja opened this issue Sep 4, 2024 · 5 comments

Comments

@samsja
Copy link
Contributor

samsja commented Sep 4, 2024

First, thanks for the work on Hivemind, it's a great library and we have been using it extensively in https://github.com/PrimeIntellect-ai/OpenDiloco.

There are two main issues that we have encountered and I am looking for tips / best practices on how to avoid them.

  • Peers don't always find each other during DHT initialization. It happened that when starting 4 peers two independent DHT will be created with 2 peers each instead. This happened even though I passed the same initial peers to all of them. Once they all join there is rarely desync at least at the dht level.

  • Lost peers during DecentralizedAverager.step() . It happened that we randomly lost a peer during an all_reduce with some class that inherited DecentralizedAverager. They never seem to have an obvious reason why the peer left.

Both of these issues happened relatively often even when doing experiments locally (passing through localhost). And it logically gets worse when using poorly connected machines. I have the feeling they are linked and that solving it would make decentralized training with hivemind more reliable.

My questions are:

  • Is there a set of DHT/hivemind parameters that would make it more reliable? Timeout, retry mechanism?
  • Is there part of the networking code that could be at fault here and could be improved? (happy to dig more if hinted to where to look at)

Thanks in advance 🙏

@samsja samsja changed the title Seeking adivce on improvement reliability of communication. Seeking advice on improvement reliability of communication. Sep 4, 2024
@Vectorrent
Copy link
Contributor

Bootstrapping in Hivemind is a nightmare. Even if I start the DHT locally, and use just a single node - I am plagued by issues like this:
image
The experts/DHT just hangs at initialization, and never completes. This happens like 75% of the time; you have to just keep restarting the training code.

@samsja
Copy link
Contributor Author

samsja commented Sep 6, 2024

Bootstrapping in Hivemind is a nightmare. Even if I start the DHT locally, and use just a single node - I am plagued by issues like this: image The experts/DHT just hangs at initialization, and never completes. This happens like 75% of the time; you have to just keep restarting the training code.

I see it more 25% of the time but yes same problem I end up restarting the dht multiple time to make it work

@Vectorrent
Copy link
Contributor

error.mp4

Here's a video that demonstrates 2 failed bootstraps, followed by a successful one - on a 100% local DHT.

@samsja
Copy link
Contributor Author

samsja commented Sep 10, 2024

error.mp4
Here's a video that demonstrates 2 failed bootstraps, followed by a successful one - on a 100% local DHT.

I have a similar problem. This seems like something that can be fixed honestly. Curious @mryab if you have any hint where in the codebase it could be coming from

@Vectorrent
Copy link
Contributor

I created reproducible code here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants