Raise an error in `scatter` when `broadcast` and AMM are incompatible #8796

rjzamora · 2024-07-23T18:15:11Z

I spent a long time debugging a hang in XGBoost before I noticed this doc-string note about disabling AMM.

It turns out that xgboost.dask.predict(...) uses client.scatter(..., broadcast=True) to replicate the Booster object on all workers. In some cases, the replication process seems to conflict with the active-memory-manager's ReduceReplicas policy - resulting in a hang.

This PR proposes that a clear error be raised by Client.scatter when broadcast=True and AMM is enabled in the config. It also seems fine to produce a warning instead. However, I definitely think it makes sense to be "loud" when the user is likely to run into a problem like this.

github-actions · 2024-07-23T19:11:43Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

29 files ± 0 29 suites ±0 12h 5m 45s ⏱️ + 2m 0s
4 092 tests + 1 3 968 ✅ - 3 112 💤 ±0 12 ❌ + 5
55 353 runs +14 52 824 ✅ - 64 2 438 💤 ±0 91 ❌ +79

For more details on these failures, see this check.

Results for commit 88c8565. ± Comparison against base commit 7013e2e.

crusaderky · 2024-07-25T11:15:21Z

distributed/client.py

+        if broadcast and dask.config.get(
+            "distributed.scheduler.active-memory-manager.start"
+        ):


This is problematic.

you want to specifically test for ReduceReplicas. Namely, at the moment of writing AMM also serves the RetireWorker policy and could serve other, harmless user-defined policies.

the AMM could have been started by hand with client.amm.start(). The API call clent.amm.running() is there for this.

I would instead suggest adding a new RPC endpoint to

distributed.active_memory_manager.AMMClientProxy and to

distributed.active_memory_manager.ActiveMemoryManagerExtension.amm_handler

which returns a set of currently running policies, and test that instead.
(caveat: there can be more than one RetireWorker policies running when you have more than one worker closing gracefully at the same time).

Thanks @crusaderky - I agree that the config check is not sufficient. Thanks for the suggestions!

raise error when AMM is enabled and broadcast=True is used in scatter

88c8565

rjzamora requested a review from fjetter as a code owner July 23, 2024 18:15

rjzamora marked this pull request as draft July 23, 2024 18:24

crusaderky reviewed Jul 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise an error in `scatter` when `broadcast` and AMM are incompatible #8796

Raise an error in `scatter` when `broadcast` and AMM are incompatible #8796

rjzamora commented Jul 23, 2024

github-actions bot commented Jul 23, 2024

crusaderky Jul 25, 2024

rjzamora Jul 25, 2024

Raise an error in scatter when broadcast and AMM are incompatible #8796

Are you sure you want to change the base?

Raise an error in scatter when broadcast and AMM are incompatible #8796

Conversation

rjzamora commented Jul 23, 2024

github-actions bot commented Jul 23, 2024

Unit Test Results

crusaderky Jul 25, 2024

Choose a reason for hiding this comment

rjzamora Jul 25, 2024

Choose a reason for hiding this comment

Raise an error in `scatter` when `broadcast` and AMM are incompatible #8796

Raise an error in `scatter` when `broadcast` and AMM are incompatible #8796