Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: asyncio-friendly nccl operations #52

Merged
merged 1 commit into from
Jul 26, 2024

Conversation

myungjin
Copy link
Contributor

Description

NCCL operation in PyTorch's distributed package needs to set up NCCL communicator so that ranks can talk to one another. To set up the communicator, c10d key-value store needs to be consulted. This is a blocking call, which blocks asyncio's loop. This prevents the loop from scheduling different coroutines. The issue is mitigated by using run_in_executor().

Note that this doesn't seem to be a permanent fix. Depending on timing, blocking appears from time to time and leads to an exception whose example may looks like "torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Socket Timeout".

Type of Change

  • Bug Fix
  • New Feature
  • Breaking Change
  • Refactor
  • Documentation
  • Other (please describe)

Checklist

  • I have read the contributing guidelines
  • Existing issues have been referenced (where applicable)
  • I have verified this change is not present in other open pull requests
  • Functionality is documented
  • All code style checks pass
  • New code contribution is covered by automated tests
  • All new and existing tests pass

NCCL operation in PyTorch's distributed package needs to set up NCCL
communicator so that ranks can talk to one another. To set up the
communicator, c10d key-value store needs to be consulted. This is a
blocking call, which blocks asyncio's loop. This prevents the loop
from scheduling different coroutines. The issue is mitigated by using
run_in_executor().

Note that this doesn't seem to be a permanent fix. Depending on timing,
blocking appears from time to time and leads to an exception whose
example may looks like "torch.distributed.DistBackendError: [1] is
setting up NCCL communicator and retrieving ncclUniqueId from [0]
via c10d key-value store by key '0:1', but store->get('0:1') got
error: Socket Timeout".
@myungjin myungjin merged commit ef048b1 into cisco-open:main Jul 26, 2024
1 check passed
@myungjin myungjin deleted the asyncio_friendly_ccl_ops branch July 26, 2024 00:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant