Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise an error in scatter when broadcast and AMM are incompatible #8796

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rjzamora
Copy link
Member

I spent a long time debugging a hang in XGBoost before I noticed this doc-string note about disabling AMM.

It turns out that xgboost.dask.predict(...) uses client.scatter(..., broadcast=True) to replicate the Booster object on all workers. In some cases, the replication process seems to conflict with the active-memory-manager's ReduceReplicas policy - resulting in a hang.

This PR proposes that a clear error be raised by Client.scatter when broadcast=True and AMM is enabled in the config. It also seems fine to produce a warning instead. However, I definitely think it makes sense to be "loud" when the user is likely to run into a problem like this.

@rjzamora rjzamora requested a review from fjetter as a code owner July 23, 2024 18:15
@rjzamora rjzamora marked this pull request as draft July 23, 2024 18:24
Copy link
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    29 files  ± 0      29 suites  ±0   12h 5m 45s ⏱️ + 2m 0s
 4 092 tests + 1   3 968 ✅  -  3    112 💤 ±0  12 ❌ + 5 
55 353 runs  +14  52 824 ✅  - 64  2 438 💤 ±0  91 ❌ +79 

For more details on these failures, see this check.

Results for commit 88c8565. ± Comparison against base commit 7013e2e.

Comment on lines +2782 to +2784
if broadcast and dask.config.get(
"distributed.scheduler.active-memory-manager.start"
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is problematic.

  1. you want to specifically test for ReduceReplicas. Namely, at the moment of writing AMM also serves the RetireWorker policy and could serve other, harmless user-defined policies.
  2. the AMM could have been started by hand with client.amm.start(). The API call clent.amm.running() is there for this.

I would instead suggest adding a new RPC endpoint to

  • distributed.active_memory_manager.AMMClientProxy and to
  • distributed.active_memory_manager.ActiveMemoryManagerExtension.amm_handler

which returns a set of currently running policies, and test that instead.
(caveat: there can be more than one RetireWorker policies running when you have more than one worker closing gracefully at the same time).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @crusaderky - I agree that the config check is not sufficient. Thanks for the suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants